How Ripple's C++ Team Cut rippled's Memory Footprint Down To Size

One of the best ways to make software more accessible is to reduce the hardware resources needed to run it. Blockchain software is no exception. The XRP Ledger is already one of the greenest blockchains due to its pioneering consensus protocol, but its ecosystem can still benefit from more efficient resource usage. Reduced inefficiencies benefit businesses, developers, and enthusiasts alike.

So how much can we really improve?

The C++ team at Ripple spent a significant amount of time in 2020 focusing on how to make better use of available system resources. We're proud to report that, compared to previous versions, the improvements we've contributed to version 1.7.0 of rippled, our reference implementation of the XRP Ledger server, cut its memory usage to less than half.

That's not a typo and these aren't theoretical numbers. These are real-world improvements of more than 50% in servers running rippled on the XRP Ledger mainnet. The operator of alloy.ee calls it "sheer voodoo":

"Sheer voodoo"

Keep reading to learn the technical details that made this engineering work possible.

Motivation Behind Reducing rippled’s Memory Footprint

Alongside our other experiments, we set out to measure and understand how rippled uses resources and how to improve it.

As a result of these measurements, we contributed support for link compression in version 1.6.0, which reduces the amount of bandwidth used by servers, benefiting operators of single servers and large clusters alike.

We're also working on optimizing how messages are propagated or routed throughout the XRP Ledger network so that every server receives everything it needs to know without getting redundant copies of info.

However, the primary pain-point for many operators of the software is RAM usage. Memory is a critical and scarce resource. Reducing the amount required by the server could have an outsized impact on the performance of individual servers, and potentially, the network as a whole.

To get the best return on our effort, we focused on identifying what parts of the server used the most RAM and experimented with what we could do about them.

Understanding SHAMap

In rippled, one of the largest targets was the SHAMap and its constituent parts—specifically, the nodes within the SHAMap's tree-like structure.

This data structure holds the state of the ledger itself—all the accounts, balances, settings, exchange orders, and everything else the XRP Ledger tracks. On the mainnet, the SHAMap consists of millions of nodes, which means that even small savings in the size of individual nodes can result in large savings overall.

More precisely, the SHAMap is a combination of a Merkle tree and a Radix tree with a branching factor of 16. This branching factor is especially relevant in this instance. It means that the inner nodes in the tree (the nodes other than the "leaves") can have up to 16 children. For example, consider this simple tree:

Each node has 16 children, whether they point to something or not. And that was the key insight: just because nodes can have up to 16 children doesn't mean that they actually do in practice.

The question then became, how many children does a "typical" inner node have? We gathered data from Ripple-operated servers, which resulted in the following distribution:

The data we collected shows that most inner nodes only have a handful of children. With that in mind, we examined if the nodes of the tree could be adjusted, at runtime, to accommodate only as many children as they needed.

In other words, what if we changed the tree to look more like this?

The Naive SHAMap Design

The SHAMap is a tree-like data structure where keys are 256-bit hashes, and a node in the tree (other than the leaves), can have up to 16 children. Children are always stored as a 256-bit hash value that can be used to retrieve the child node from a database. Children are also sometimes cached directly in a node for quick traversal.

The original implementation of the non-leaf nodes had a C++ class that looked a bit like this:

struct SHAMapInnerNode {
    std::bitset<16> isBranch_;
    SHAMapHash hashes_[16];
    std::shared_ptr<SHAMapTreeNode> cachedChildren_[16];
}

To traverse the tree, the code looks at the 256-bit key four bits at a time. These 4-bit numbers represent which of the sixteen children to transition to next. This means the children are ordered, and it's possible that some branches (such as 2, 5, and 7) are occupied while the rest are empty.

The isBranch_ variable tracks which branches have children and which are empty, but the hashes_[16] array always leaves space for sixteen 256-bit identifiers anyway.

Sparse SHAMap

Instead of always storing sixteen children, we want to use a space-saving sparse representation when there are only a few children and only switch to the dense representation when there are more children.

Such a representation could look like this:

struct SHAMapInnerNode {
    std::bitset<16> isBranch_;
    SHAMapHash* hashes_;
    std::shared_ptr<SHAMapTreeNode>* cachedChildren_;
    std::uint8_t arraySize_;
}

A sparse representation would store hashes_ and cachedChildren_ without gaps. For example, if branches 2, 5, and 7 are occupied while the rest are empty, then hashes_ would store these values at indexes 0, 1, and 2. To look up a child in a sparse array, count how many non-empty branches come before it.

For example, if branches 2, 5, and 7 are occupied, and you want to find the child in branch 5, there's 1 branch with an index less than 5, so you'd look at array index 1 of the hashes_ array.

How much memory does this save? Each hash is 256 bits—that's 32 bytes—and each shared pointer is 16 bytes. That means that every time the sparse representation doesn't leave empty space for a nonexistent child that the naive, dense implementation would have, you save 48 bytes.

With the data we had collected on how many children the nodes in the SHAMap really had when used in production, we could calculate how much memory could be saved by using a sparse representation.

The following Python code demonstrates how to perform such a calculation:

def savings(child_hist, arg_indexes):
"""
Given a histogram of the number of children, and the array size boundaries,
calculate the ratio of memory used for a sparse representation/memory used
for a dense representation
"""
if not arg_indexes:
	return 1.0
# add one (and copy so we don't modify original list)
indexes = [a+1 for a in arg_indexes]
	if indexes[0] != 0:
indexes.insert(0, 0)
	if indexes[-1] != 17:
indexes.append(17)
per_child_bytes = 48.0 # 32 for the hash 16 for the shared_ptr
dense_bytes = 16.0 * per_child_bytes * sum(child_hist)
sparse_bytes = 0.0
	for i in range(1, len(indexes)):
		sparse_bytes += ((indexes[i]-1) *
				per_child_bytes *
				sum(child_hist[indexes[i-1]:indexes[i]])
				)
return (sparse_bytes/dense_bytes)

How Sparse to Go?

It was clear that a sparse representation could save a substantial amount of memory, but the next challenge was to determine how sparse to go.

If we used 16 different sparse array sizes, one for every possible number of children, then no space would be wasted at all. But doing that brings in other costs. For example, when you have a size-10 array and you need to add an 11th child, you need to allocate space for a size-11 array and copy all 10 existing entries into the new array.

Another approach is to have just two sizes of arrays. For example, you might use a sparse array with room for up to 5 children, and a dense array when you need to hold 6 to 16.

This approach still wastes some space (i.e. when the sparse array has just 1 of its 5 spots occupied, or when the dense array uses only 6 or 7 of its 16 spots, it's simpler and faster than constantly changing the size of the array). Alternatively, we could have different 3 sizes or more (i.e. 3, 6, 16).

To decide between all the possibilities, we had to consider the trade-offs of different numbers of sparse array sizes. This kind of analysis is common in engineering and even in other disciplines; it's kind of like deciding how many different sizes of jeans you want to manufacture to make the most profit.

In our case, we considered all of the following in deciding how to approach the problem:

  1. Costs of moving data when the sparse array changes representations. If you need to add a child and the current sparse array is full, you need to allocate a new array and move the data into it.
  2. Costs of inserting and removing data in the sparse array. Since the sparse arrays are sorted, you have to move elements around whenever you add or remove children from them. Keeping the sparse arrays small reduces the amount of data movement necessary.
  3. Space savings. The more sparse array sizes, the greater the savings. However, we expected diminishing returns from each additional size option.
  4. Implementation considerations. A tagged pointer (described below) can be used to store a small number of bits without using additional space. Keeping the number of array sizes below this number results in additional savings.

Fortunately, we didn't have to make these decisions based on pure intuition and generalizations. Using our histogram data, we could calculate the actual space savings of various numbers and sizes of sparse arrays, which helped us figure out where the diminishing returns actually kicked in.

The following code helps compute this.

def savings_dim(dim=1):
	if dim == 0:
		s = savings(collected_child_histogram, [])
		return ([s], [()], 0)
	args = [arg for arg in itertools.combinations(range(17),dim)]
	r = [savings(collected_child_histogram, arg) for arg in args]
	return (r, args, numpy.argmin(r))

def doit():
	result = []
	for dim in range(17):
		r, args, arg_min = savings_dim(dim)
		result.append([r[arg_min], args[arg_min]])
	return result

This gives the following result:

[[1.000, ()],
[0.384, (5,)],
[0.293, (3, 6)],
[0.259, (2, 4, 6)],
[0.246, (2, 3, 5, 7)],
[0.237, (2, 3, 4, 5, 7)],
[0.233, (1, 2, 3, 4, 5, 7)],
[0.230, (1, 2, 3, 4, 5, 6, 8)],
[0.228, (1, 2, 3, 4, 5, 6, 8, 15)],
[0.228, (1, 2, 3, 4, 5, 6, 7, 9, 15)],
[0.227, (1, 2, 3, 4, 5, 6, 7, 9, 14, 15)],
[0.227, (1, 2, 3, 4, 5, 6, 7, 8, 9, 14, 15)],
[0.227, (1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15)],
[0.227, (1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 14, 15)],
[0.227, (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15)],
[0.227, (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)],
[0.227, (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)]]

The first row establishes the baseline, with no sparse arrays, just the naive SHAMap design. In other words, we defined "however much memory it was using originally" as 1.000 or 100%.

The next row represents an approach using one sparse array of size 5 and one dense array ((5,)) and estimates that this would use 38.4% as much space as the naive design (0.384). The third row tests an approach with two sparse arrays of sizes 3 and 6, which occupies 29.3% as much space, and so on.

Most of the savings occur by having one sparse array of size 5 and one dense array. There are some additional savings with 2 or 3 different sparse array sizes, but anything beyond that doesn't make much difference.

You can see the diminishing returns more clearly if we plot this as a graph:

With this information, we knew we wanted to use either one, two, or three different sizes of sparse arrays. It was tempting to go with just one sparse and one dense size, since that already gives us most of the space savings, and the implementation is simpler to understand. However, we decided that the extra savings of going a little bit further were worth the slight increase of complexity.

In the end, we opted to use three sparse array sizes (2, 4, and 6) plus the dense array (16). Conveniently, we can use two bits of a tagged pointer to encode these four sizes in a convenient, compact way. (More on this later.)

Development note: We actually coded up the "one sparse array" implementation first even though we knew it was never going to be used. It's often better to build and test an intermediate milestone like this rather than implement the final, complex version right away.

Further Optimizations for Memory Savings

Earlier, when we calculated "48 bytes" of space savings per unused node, it was a bit of an idealized scenario. In practice, the sparse array design contains several new details that reduce the savings from the potential maximum:

  1. Storing pointers to arrays instead of arrays directly.
  2. Storing the array sizes.
  3. Overhead from malloc.

What happened next is a good illustration of how software ends up becoming so complex.

The actual core idea of using sparse arrays is relatively simple. You can save a lot of space by implementing them in the most straightforward way possible. But if you want to get even more efficient, there are various tricks you can use to squeeze things into less space.

These techniques end up saving RAM, but they also make the code more difficult to understand and debug. The team is aware of the long-term costs this imposes on maintaining the software.

Blindly trying to maximize the efficiency of every piece of code is a "penny wise, pound foolish" move. You may save an imperceptible amount of RAM or CPU upfront only to have an update later break the code and overextend resources farther down the road to fix it.

But on the other hand, some things are worth optimizing. In the case of the SHAMap nodes, even small improvements to individual nodes are multiplied millions of times over because there are so many nodes in the overall structure. Given this, we opted to implement the smaller space savings as well.

One Opaque Pointer

It takes 8 bytes to store a pointer and we need to store pointers to two objects: the array of child hashes and the array of cached child objects. Fortunately, we know exactly how long the hashes_ array will be.

It happens that the hashes are 32 bytes, so the end of the array will always be cleanly aligned on an 8-byte boundary in memory. If we always store the cachedChildren_ array directly following the hashes_ array, we can use one pointer to refer to both of them. This saves 8 bytes for every inner node.

For example, an implementation could be:

struct SHAMapInnerNode {
	std::bitset<16> isBranch_;
	void* hashesAndCachedChildren_;
	std::uint8_t arraySize_;

	SHAMapHash* hashes(){
		return reinterpret_cast<SHAMapHash*>(hashesAndCachedChildren_);
	}
	std::shared_ptr<SHAMapTreeNode>* cachedChildren(){
	// since hashes are 32 bytes, this will 	always be correctly aligned
return reinterpret_cast<SHAMapHash*>(hashesAndCachedChildren_ + arraySize_);
 }
}

Storing Array Size

One of the added pieces of information in the sparse array is the field that indicates what size array any given node has: 2, 4, 6, or the full 16. If you recall, our example design for the sparse SHAMapInnerNode stored this as a 1-byte value, arraySize_. We also wanted to avoid storing this value.

Approach 1: Counting Occupied Branches

If you look over the sparse SHAMapInnerNode's other fields, you might notice that the isBranch_ field can pull double duty. If you look through how many of its bits are on, you will know how many children the node has and you can use that to deduce what size array the node is using.

This involves slightly more computing since the code has to count bits each time you check the node. However, the overhead is negligible, especially if you have a hardware-assisted popcount instruction available.

Counting bits in isBranch_ has a serious drawback though because it only works if every node always stores children in the smallest sparse array possible. This isn't always desirable. Sometimes it's better to temporarily keep a denser representation.

For example, if you're batch adding or removing many child nodes, it would make sense to set aside a dense array from the start rather than resize the same node several times in a short period. Alternatively, you might want to reduce how often arrays need to be copied by using a technique such as hysteresis or some other predictive heuristic.

Since we wanted to keep these options open, we decided against using isBranch_ to determine the array size.

Approach 2: Tagged Pointer

Pointers, like the one we use to refer to the array of child hashes, are the addresses of data structures in memory.

On a 64-bit system, a pointer is always 64 bits (8 bytes) so that it can point to any address in memory. However, the operating system doesn't allocate data structures in every possible address in memory; it generally aligns things along nice even boundaries of 4, 8, or 16 bytes. This means in practice, the lowest few bits of a pointer are always going to be 0's.

The idea of a "tagged pointer" is to use those bits to store other useful data. Since we have only four sizes of arrays (2, 4, 6, and 16), we can encode the size of the array into just two bits. We then "fold" this information into the pointer and pull the separate pieces apart using bit masks when we need to use them.

One implementation of this could be:

constexpr std::array<std::uint8_t, 4> boundaries{
    2,
    4,
    6,
    16};

struct SHAMapInnerNode {
    static_assert(
        alignof(SHAMapHash) >= 4,
        "Bad alignment: Tag pointer requires low two bits to be zero.");
    /** Upper bits are the pointer, lowest two bits are the tag
        A moved-from object will have a tp_ of zero.
    */
    std::uintptr_t tp_ = 0;
    /** bit-and with this mask to get the tag bits (lowest two bits) */
    static constexpr std::uintptr_t tagMask = 3;
    /** bit-and with this mask to get the pointer bits (mask out the tag) */
    static constexpr std::uintptr_t ptrMask = ~tagMask;

    // decode the tagged pointer into the tag and pointer
    [[nodiscard]] std::pair<std::uint8_t, void*>
    TaggedPointer::decode() const
    {
        return {tp_ & tagMask, reinterpret_cast<void*>(tp_ & ptrMask)};
    }

    // return array size and the two arrays
    [[nodiscard]]
        std::tuple<std::uint8_t, SHAMapHash*, std::shared_ptr<SHAMapTreeNode>*>
        TaggedPointer::getHashesAndChildren() const
    {
        auto const [tag, ptr] = decode();
        auto const hashes = reinterpret_cast<SHAMapHash*>(ptr);
        std::uint8_t numAllocated = boundaries[tag];
        auto const children = reinterpret_cast<std::shared_ptr<SHAMapTreeNode>*>(
            hashes + numAllocated);
        return {numAllocated, hashes, children};
    };
}

The actual implementation in rippled wraps this logic in the TaggedPointer class.

In the end, we were able to store the size of the array into some bits that were not in use, instead of adding 8 bytes per node for an arraySize_ field.

Avoid std::bitset

We still have the isBranch_ field to indicate which branches actually contain children and which ones are empty. Since this field tracks 16 distinct “yes-no” states, the smallest possible size of this field is 16 bits (2 bytes).

Theoretically, this is what std::bitset<16> represents. However, in practice, most implementations of the C++ standard library don't go to extremes to save space, so when you ask for a std::bitset<16> you often get 32 or even 64 bits, depending on the machine. Those extra 2 to 6 bytes can add up when there are millions of instances of the same data structure.

To avoid this inefficiency, we opted to use a std::uint16_t, which is more likely to be allocated as actually 16 bits. Depending on the platform, this can end up saving as much as 6 bytes for every inner node.

Custom allocators

There's typically an 8 to 16 byte space overhead when malloc is used to get memory from the heap. malloc is also a relatively slow operation.

Since the SHAMap is allocating lots of these inner nodes, and they all have the same size (or rather four sizes, but we can easily use four allocators), this is a good use-case for custom allocators. C++17 introduced polymorphic memory resources. The new pool allocator is a good match here because it promises fast allocation and deallocation and has zero extra space overhead.

Unfortunately, on macOS, the Clang compiler's standard library does not yet have an implementation for this. Additionally, a microbenchmark showed that the GCC compiler's std::synchronized_pool_resource was much too slow, and resulted in a significant performance regression in rippled.

Fortunately, Boost has an implementation of pooled resources that is significantly faster. It fixed this performance regression and also works on macOS.

Our team is already looking into why GCC's implementation delivered such poor results. If we identify the root cause, we will submit code to fix the problem. Furthermore, we are also working on a custom slab allocator with thread-aware freelist caching that promises even better performance.

Bit twiddling

One of the most common operations you perform on a sparse array is to determine where in the sparse array a given branch is stored.

The example we used earlier was that if branches 2, 5, and 7 are occupied, and we are looking for a child in branch 5, we count how many branches are occupied below that (just one, branch 2) and use that as the index into the sparse array.

We wanted to make sure that this operation was very fast so the sparse arrays performed well. We realized that we could use bitwise operations on the isBranch_ field to perform this calculation in just a couple of steps.

First, mask out all the children greater than the index you're looking for. Then, count how many bits are still on. The popcnt16 instruction is built-in on modern CPU architectures, so the entire calculation takes just a few processor cycles to run.

inline int
TaggedPointer::getChildIndex(std::uint16_t isBranch, int i) const
{
    // Sparse children are stored sorted. This means the index
    // of a child in the array is the number of non-empty children
    // before it. Since `isBranch_` is a bitset of the stored
    // children, we simply need to mask out (and set to zero) all
    // the bits in `isBranch_` equal to to higher than `i` and count
    // the bits.

    // mask sets all the bits >=i to zero and all the bits <em to
    // one.
    auto const mask = (1 << i) - 1;
    return popcnt16(isBranch & mask);
}

Oh... one more thing.

Our CTO, David Schwartz, was inspired to go one step further, examining some of the other heavy users of memory. He implemented a change that eliminates a layer of caching, which resulted in further memory improvements, and surprisingly, improved execution time.

Real-world Impact

The following chart compares the memory usage of three versions of the rippled server:

  • The first (blue) is unpatched, running the previous version of the code
  • The second (orange) has the "sparse inner node" changes described in this article
  • The third (green) has both the "sparse inner node" changes and the improvements added by David Schwartz:

It's easy to see the significant impact these two changes have on memory usage. When combined, they result in total memory savings of over 50%, at some points using 7 GB less RAM than the original code.

Some members of the community, like Alloy Networks, have tested out the new code to confirm that it works for them as well as it did for us. With confirmation that it does, we are excited to say that these changes will be part of the upcoming release of rippled version 1.7.0.

Conclusions

It's often lamented that as computer hardware gets better, software only gets worse to compensate. At Ripple, we believe it doesn't have to be that way.

In some ways the biggest change is no change. If you’re running rippled you won’t have to change anything to take advantage of the improvements. You can use the recommended systems specs and still experience dramatically better memory usage and execution time.

We look forward to continuing to help the XRP Ledger grow to be the best, most efficient, and greenest system out there for payments and more.