Always Bump Downwards
When writing a bump allocator, always bump downwards. That is, allocate from high addresses, down towards lower addresses by decrementing the bump pointer. Although it is perhaps less natural to think about, it is more efficient than incrementing the bump pointer and allocating from lower addresses up to higher ones.
What is Bump Allocation?
Bump allocation is a super fast method for allocating objects. We have a chunk
of memory, and we maintain a “bump pointer” within that memory. Whenever we
allocate an object, we do a quick test that we have enough capacity left in the
chunk, and then, assuming we have enough room, we move the bump pointer over by
sizeof(object)
bytes and return the pointer to the space we just reserved for
the object within the chunk.
That’s it!
Here is some pseudo-code showing off the algorithm:
The trade off with bump allocation is that we can’t deallocate individual objects in the general case. We can deallocate all of them en mass by resetting the bump pointer back to its initial location. We can deallocate in a LIFO, stack order by moving the bump pointer in reverse. But we can’t deallocate an arbitrary object in the middle of the chunk and reclaim its space for new allocations.
Finally, notice that the chunk of memory we are bump allocating within is always split in two: the side holding allocated objects and the side with free memory. The bump pointer separates the two sides. Furthermore, note that I haven’t defined which side of the bump pointer is free or allocated space, and I’ve carefully avoided saying whether the bump pointer is incremented or decremented.
Bumping Upwards
First, let’s consider what we shouldn’t do: bump upwards by initializing the bump pointer at the low end of our memory chunk and incrementing the bump pointer on each allocation.
We begin with a struct
that holds the start and end addresses of our chunk of
memory, as well as our current bump pointer:
Constructing our upwards bump allocator requires giving it the start
and end
pointers, and it will initialize its bump pointer to the start
address:
To allocate an object, we will begin by grabbing the current bump pointer, and
saving it in a temporary variable: this is going to be the pointer to the newly
allocated space. Then we increment the bump pointer by the requested size, and
check if it is still less than end
. If so, then we have capacity for the
allocation, and can commit the new bump pointer to self.ptr
and return the
temporary pointing to the freshly allocated space.
But first, there is one thing that the pseudo-code ignored, but which a real implementation cannot: alignment. We need to round up the initial bump pointer to a multiple of the requested alignment before we compute the new bump pointer by adding the requested size.0
Put all that together, and it looks like this:
If we compile this allocation routine to x86-64 with optimizations, we get the following code, which I’ve lightly annotated:
I’m not going to explain each individual instruction. What’s important to appreciate here is that this is just a small handful of fast instructions with only a single branch to handle the not-enough-capacity case. This is what makes bump allocation so fast — great!
But before we get too excited, there is another practicality to consider: to maintain memory safety, we must handle potential integer overflows in the allocation procedure, or else we could have bugs where we return pointers outside the bounds of our memory chunk. No good!
There are two opportunities for overflow we must take care of:
-
If the requested allocation’s size is large enough,
aligned + size
can overflow. -
If the requested allocation’s alignment is large enough, the
ptr + align - 1
sub-expression we use when rounding up to the alignment can overflow.
To handle both these cases, we will use checked addition and return a null pointer if either addition overflows. Here is the new Rust source code:
Now that we’re handling overflows in addition to the alignment requirements,
let’s take a look at the x86-64 code that rustc
and LLVM produce for the
function now:
Now there are three conditional branches rather than one. The two new branches are from those two new overflow checks that we added. Less than ideal.
Can bumping downwards do better?
Bumping Downwards
Now let’s implement a bump allocator where the bump pointer is initialized at the end of the memory chunk and is decremented on allocation, so that it moves downwards towards the start of the memory chunk.
The struct
is identical to the previous version:
Constructing a BumpDown
is similar to constructing a BumpUp
except we
initialize the ptr
to end
rather than start
:
When we were allocating by incrementing the bump pointer, the original bump pointer value before it was incremented pointed at the space that was about to be reserved for the allocation. When we are allocating by decrementing the bump pointer, the original bump pointer is pointing at either the end of the memory chunk, or at the last allocation we made. What we want to return is the value of the bump pointer after we decrement it down, at which time it will be pointing at our allocated space.
First we subtract the allocation size from the bump pointer. This subtraction
might overflow, so we check for that and return a null pointer if that is the
case, just like we did in the previous, upward-bumping function. Then, we round
that down to the nearest multiple of align
to ensure that the allocated space
has the object’s alignment. At this point, we check if we are down past the
start of our memory chunk, in which case we don’t have the capacity to fulfill
this allocation, and we return null. Otherwise, we update the bump pointer to
its new value and return the pointer!
And here is the x86-64 code generated for this downward-bumping allocation routine!
Because rounding down doesn’t require an addition or subtraction operation, it doesn’t have an associated overflow check. That means one less conditional branch in the generated code, and downward bumping only has two conditional branches versus the three that upward bumping has.
Additionally, because we don’t need to save the original bump pointer value, this version uses fewer registers than the upward-bumping version. Bump allocation functions are designed to be fast paths that are inlined into callers, which means that downward bumping is creating less register pressure at every call site.
Finally, this downwards-bumping version is implemented with eleven instructions, while the upwards-bumping version requires thirteen instructions. In general, fewer instructions implies a shorter run time.
Benchmarks
I recently switched the bumpalo
crate from bumping upwards to
bumping downwards. It has a nice, little micro-benchmark suite that is written
with the excellent, statistics-driven Criterion.rs benchmarking
framework. With Criterion’s built-in support for defining a baseline
measurement and comparing an alternate implementation of the code against it, I
compared the new, downwards-bumping implementation against the original,
upwards-bumping implementation.
The new, downwards-bumping implementation has up to 19% better allocation throughput than the original, upwards-bumping implementation! We’re down to 2.7 nanoseconds per allocation.
The plot below shows the probability of allocating 10,000 small objects taking a certain amount of time. The red curve represents the old, upwards-bumping implementation, while the blue curve shows the new, downwards-bumping implementation. The lines represent the mean time.
You can view the complete, nitty-gritty benchmark results in the pull request.
The One Downside: Losing a realloc
Fast Path
bumpalo
doesn’t only provide an allocation method, it also provides a
realloc
method to resize an existing allocation. realloc
is O(n) because
in the worst-case scenario it needs to allocate a whole new region of memory and
copy the data from the old to the new region. But the old, upwards-bumping
implementation had a fast path for growing the last allocation: it would add the
delta size to the bump pointer, leaving the allocation in place and avoiding
that copy. The new, downwards-bumping implementation also has a fast path for
resizing the last allocation , but even if we reuse that space, the start of the
allocated region of memory has shifted, and so we can’t avoid the data copy.
The loss of that fast path leads to a 4% slow down in our realloc
benchmark
that formats a string into a bump-allocated buffer, triggering a number of
realloc
s as the string is constructed. We felt that this was worth the trade
off for faster allocation.
Less Work with More Alignment?
It is rare for types to require more than word alignment. We could enforce a minimum alignment on the bump pointer at all times that is greater than or equal to the vast majority of our allocations’ alignment requirements. If our allocation routine is monomorphized for the type of the allocation it’s making, or it is aggressively inlined — and it definitely should be — then we should be able to completely avoid generating any code to align the bump pointer in most cases, including the conditional branch on overflow if we are rounding up for upwards bumping.
The trade off is extra memory overhead from introducing wasted space between small allocations that don’t require that extra alignment.
Conclusion
If you are writing your own bump allocator, you should bump downwards: initialize the bump pointer to the end of the chunk of memory you are allocating from within, and decrement it on each allocation so that it moves down towards the start of the memory chunk. Downwards bumping requires fewer registers, fewer instructions, and fewer conditional branches. Ultimately, that makes it faster than bumping upwards.
The one exception is if, for some reason, you frequently use realloc
to grow
the last allocation you made, in which case you might get more out of a fast
path for growing the last allocation in place without copying any data. And if
you do decide to bump upwards, then you should strongly consider enforcing a
minimum alignment on the bump pointer to recover some of the performance that
you’re otherwise leaving on the table.
Finally, I’d like to thank Jim Blandy, Alex Crichton, Jeena Lee, and Jason Orendorff for reading an early draft of this blog post, for discussing these ideas with me, and for being super friends :)
0 The simple way to round n
up to a multiple of
align
is
(n + align - 1) / align * align
Consider the numerator: n + align - 1
. This is ensuring that if there
is any remainder for n / align
, then the result of the division sub-expression
is one greater than n / align
, and that otherwise we get exactly the same
result as n / align
due to integer division rounding off the remainder. In
other words, we only round up if n
is not aligned to align
.
However, we know align
is a power of two, and therefore anything /
align
is equivalent to anything >> log2(align)
and anything * align
is
equivalent to anything << log2(align)
. We can therefore rewrite our expression
into:
(n + align - 1) >> log2(align) << log2(align)
But shifting a value right by some number of bits b
and then shifting
it left by that same number of bits b
is equivalent to clearing the bottom b
bits of the number. We can clear the bottom b
bits of a number by bit-wise
and’ing the number with the bit-wise not of 2^b - 1
. Plugging this into our
equation and simplifying, we get:
(n + align - 1) >> log2(align) << log2(align)
= (n + align - 1) & !(2^log2(align) - 1)
= (n + align - 1) & !(align - 1)
And now we have our final version of rounding up to a multiple of a power of two!
If you find these bit twiddling hacks fun, definitely find yourself a copy of Hacker’s Delight. It’s a wonderful book! ↩