Always Bump Downwards

When writing a bump allocator, always bump downwards. That is, allocate from high addresses, down towards lower addresses by decrementing the bump pointer. Although it is perhaps less natural to think about, it is more efficient than incrementing the bump pointer and allocating from lower addresses up to higher ones.

What is Bump Allocation?

Bump allocation is a super fast method for allocating objects. We have a chunk of memory, and we maintain a “bump pointer” within that memory. Whenever we allocate an object, we do a quick test that we have enough capacity left in the chunk, and then, assuming we have enough room, we move the bump pointer over by sizeof(object) bytes and return the pointer to the space we just reserved for the object within the chunk.

That’s it!

Here is some pseudo-code showing off the algorithm:

bump(size):
    if our capacity < size:
        fail
    else
        self.ptr = move self.ptr over by size bytes
        return pointer to the freshly allocated space

The trade off with bump allocation is that we can’t deallocate individual objects in the general case. We can deallocate all of them en mass by resetting the bump pointer back to its initial location. We can deallocate in a LIFO, stack order by moving the bump pointer in reverse. But we can’t deallocate an arbitrary object in the middle of the chunk and reclaim its space for new allocations.

Finally, notice that the chunk of memory we are bump allocating within is always split in two: the side holding allocated objects and the side with free memory. The bump pointer separates the two sides. Furthermore, note that I haven’t defined which side of the bump pointer is free or allocated space, and I’ve carefully avoided saying whether the bump pointer is incremented or decremented.

Bumping Upwards

First, let’s consider what we shouldn’t do: bump upwards by initializing the bump pointer at the low end of our memory chunk and incrementing the bump pointer on each allocation.

We begin with a struct that holds the start and end addresses of our chunk of memory, as well as our current bump pointer:

pub struct BumpUp {
    // A pointer to the first byte of our memory chunk.
    start: *mut u8,
    // A pointer to the last byte of our memory chunk.
    end: *mut u8,
    // The bump pointer. At all times, we maintain the
    // invariant that `start <= ptr <= end`.
    ptr: *mut u8,
}

Constructing our upwards bump allocator requires giving it the start and end pointers, and it will initialize its bump pointer to the start address:

impl BumpUp {
    pub unsafe fn new(
        start: *mut u8,
        end: *mut u8,
    ) -> BumpUp {
        assert!(start as usize <= end as usize);
        let ptr = start;
        BumpUp { start, end, ptr }
    }
}

To allocate an object, we will begin by grabbing the current bump pointer, and saving it in a temporary variable: this is going to be the pointer to the newly allocated space. Then we increment the bump pointer by the requested size, and check if it is still less than end. If so, then we have capacity for the allocation, and can commit the new bump pointer to self.ptr and return the temporary pointing to the freshly allocated space.

But first, there is one thing that the pseudo-code ignored, but which a real implementation cannot: alignment. We need to round up the initial bump pointer to a multiple of the requested alignment before we compute the new bump pointer by adding the requested size.⁰

Put all that together, and it looks like this:

impl BumpUp {
    pub unsafe fn alloc(
        &mut self,
        size: usize,
        align: usize,
    ) -> *mut u8 {
        debug_assert!(align > 0);
        debug_assert!(align.is_power_of_two());

        let ptr = self.ptr as usize;

        // Round the bump pointer up to the requested
        // alignment. See the footnote for details.
        let aligned = (ptr + align - 1) & !(align - 1);

        let new_ptr = aligned + size;

        let end = self.end as usize;
        if new_ptr > end {
            // Didn't have enough capacity!
            return std::ptr::null_mut();
        }

        self.ptr = new_ptr as *mut u8;
        aligned as *mut u8
    }
}

If we compile this allocation routine to x86-64 with optimizations, we get the following code, which I’ve lightly annotated:

; Incoming arguments:
;   * `rdi`: pointer to `BumpUp`
;   * `rsi`: The allocation's `size`
;   * `rdx`: The allocation's `align`
; Outgoing result:
;   * `rax`: the pointer to the allocation or null
alloc_up:
    mov rax, rdx
    mov rcx, QWORD PTR [rdi+0x10]
    add rcx, rdx
    add rcx, 0xffffffffffffffff
    neg rax
    and rax, rcx
    add rsi, rax
    cmp rsi, QWORD PTR [rdi+0x8]
    ja  .return_null
    mov QWORD PTR [rdi+0x10], rsi
    ret

.return_null:
    xor eax, eax
    ret

I’m not going to explain each individual instruction. What’s important to appreciate here is that this is just a small handful of fast instructions with only a single branch to handle the not-enough-capacity case. This is what makes bump allocation so fast — great!

But before we get too excited, there is another practicality to consider: to maintain memory safety, we must handle potential integer overflows in the allocation procedure, or else we could have bugs where we return pointers outside the bounds of our memory chunk. No good!

There are two opportunities for overflow we must take care of:

If the requested allocation’s size is large enough, aligned + size can overflow.
If the requested allocation’s alignment is large enough, the ptr + align - 1 sub-expression we use when rounding up to the alignment can overflow.

To handle both these cases, we will use checked addition and return a null pointer if either addition overflows. Here is the new Rust source code:

// Try and unwrap an `Option`, returning a null pointer
// if the option is `None`.
macro_rules! try_null {
    ( $e:expr ) => {
        match $e {
            None => return std::ptr::null_mut(),
            Some(e) => e,
        }
    };
}

impl BumpUp {
    pub unsafe fn alloc(
        &mut self,
        size: usize,
        align: usize,
    ) -> *mut u8 {
        debug_assert!(align > 0);
        debug_assert!(align.is_power_of_two());

        let ptr = self.ptr as usize;

        // Round the bump pointer up to the requested
        // alignment.
        let aligned =
            try_null!(ptr.checked_add(align - 1)) & !(align - 1);

        let new_ptr = try_null!(aligned.checked_add(size));

        let end = self.end as usize;
        if new_ptr > end {
            // Didn't have enough capacity!
            return std::ptr::null_mut();
        }

        self.ptr = new_ptr as *mut u8;
        aligned as *mut u8
    }
}

Now that we’re handling overflows in addition to the alignment requirements, let’s take a look at the x86-64 code that rustc and LLVM produce for the function now:

; Incoming arguments:
;   * `rdi`: pointer to `BumpUp`
;   * `rsi`: The allocation's `size`
;   * `rdx`: The allocation's `align`
; Outgoing result:
;   * `rax`: the pointer to the allocation or null
alloc_up:
    lea rax, [rdx-0x1]
    add rax, QWORD PTR [rdi+0x10]
    jb  .return_null
    neg rdx
    and rax, rdx
    add rsi, rax
    jb  .return_null
    cmp rsi, QWORD PTR [rdi+0x8]
    ja  .return_null
    mov QWORD PTR [rdi+0x10], rsi
    ret

.return_null:
    xor eax, eax
    ret

Now there are three conditional branches rather than one. The two new branches are from those two new overflow checks that we added. Less than ideal.

Can bumping downwards do better?

Bumping Downwards

Now let’s implement a bump allocator where the bump pointer is initialized at the end of the memory chunk and is decremented on allocation, so that it moves downwards towards the start of the memory chunk.

The struct is identical to the previous version:

#[repr(C)]
pub struct BumpDown {
    // A pointer to the first byte of our memory chunk.
    start: *mut u8,
    // A pointer to the last byte of our memory chunk.
    end: *mut u8,
    // The bump pointer. At all times, we have the
    // invariant that `start <= ptr <= end`.
    ptr: *mut u8,
}

Constructing a BumpDown is similar to constructing a BumpUp except we initialize the ptr to end rather than start:

impl BumpDown {
    pub unsafe fn new(start: *mut u8, end: *mut u8) -> BumpDown {
        assert!(start as usize <= end as usize);
        let ptr = end;
        BumpDown { start, end, ptr }
    }
}

When we were allocating by incrementing the bump pointer, the original bump pointer value before it was incremented pointed at the space that was about to be reserved for the allocation. When we are allocating by decrementing the bump pointer, the original bump pointer is pointing at either the end of the memory chunk, or at the last allocation we made. What we want to return is the value of the bump pointer after we decrement it down, at which time it will be pointing at our allocated space.

First we subtract the allocation size from the bump pointer. This subtraction might overflow, so we check for that and return a null pointer if that is the case, just like we did in the previous, upward-bumping function. Then, we round that down to the nearest multiple of align to ensure that the allocated space has the object’s alignment. At this point, we check if we are down past the start of our memory chunk, in which case we don’t have the capacity to fulfill this allocation, and we return null. Otherwise, we update the bump pointer to its new value and return the pointer!

impl BumpDown {
    pub unsafe fn alloc(
        &mut self,
        size: usize,
        align: usize,
    ) -> *mut u8 {
        debug_assert!(align > 0);
        debug_assert!(align.is_power_of_two());

        let ptr = self.ptr as usize;

        let new_ptr = try_null!(ptr.checked_sub(size));

        // Round down to the requested alignment.
        let new_ptr = new_ptr & !(align - 1);

        let start = self.start as usize;
        if new_ptr < start {
            // Didn't have enough capacity!
            return std::ptr::null_mut();
        }

        self.ptr = new_ptr as *mut u8;
        self.ptr
    }
}

And here is the x86-64 code generated for this downward-bumping allocation routine!

; Incoming arguments:
;   * `rdi`: pointer to `BumpDown`
;   * `rsi`: The allocation's `size`
;   * `rdx`: The allocation's `align`
; Outgoing result:
;   * `rax`: the pointer to the allocation or null
alloc_down:
    mov rax, QWORD PTR [rdi+0x10]
    sub rax, rsi
    jb  .return_null
    neg rdx
    and rax, rdx
    cmp rax, QWORD PTR [rdi]
    jb  .return_null
    mov QWORD PTR [rdi+0x10], rax
    ret

.return_null:
    xor eax, eax
    ret

Because rounding down doesn’t require an addition or subtraction operation, it doesn’t have an associated overflow check. That means one less conditional branch in the generated code, and downward bumping only has two conditional branches versus the three that upward bumping has.

Additionally, because we don’t need to save the original bump pointer value, this version uses fewer registers than the upward-bumping version. Bump allocation functions are designed to be fast paths that are inlined into callers, which means that downward bumping is creating less register pressure at every call site.

Finally, this downwards-bumping version is implemented with eleven instructions, while the upwards-bumping version requires thirteen instructions. In general, fewer instructions implies a shorter run time.

Benchmarks

I recently switched the bumpalo crate from bumping upwards to bumping downwards. It has a nice, little micro-benchmark suite that is written with the excellent, statistics-driven Criterion.rs benchmarking framework. With Criterion’s built-in support for defining a baseline measurement and comparing an alternate implementation of the code against it, I compared the new, downwards-bumping implementation against the original, upwards-bumping implementation.

The new, downwards-bumping implementation has up to 19% better allocation throughput than the original, upwards-bumping implementation! We’re down to 2.7 nanoseconds per allocation.

The plot below shows the probability of allocating 10,000 small objects taking a certain amount of time. The red curve represents the old, upwards-bumping implementation, while the blue curve shows the new, downwards-bumping implementation. The lines represent the mean time.

You can view the complete, nitty-gritty benchmark results in the pull request.

The One Downside: Losing a `realloc` Fast Path

bumpalo doesn’t only provide an allocation method, it also provides a realloc method to resize an existing allocation. realloc is O(n) because in the worst-case scenario it needs to allocate a whole new region of memory and copy the data from the old to the new region. But the old, upwards-bumping implementation had a fast path for growing the last allocation: it would add the delta size to the bump pointer, leaving the allocation in place and avoiding that copy. The new, downwards-bumping implementation also has a fast path for resizing the last allocation , but even if we reuse that space, the start of the allocated region of memory has shifted, and so we can’t avoid the data copy.

The loss of that fast path leads to a 4% slow down in our realloc benchmark that formats a string into a bump-allocated buffer, triggering a number of reallocs as the string is constructed. We felt that this was worth the trade off for faster allocation.

Less Work with More Alignment?

It is rare for types to require more than word alignment. We could enforce a minimum alignment on the bump pointer at all times that is greater than or equal to the vast majority of our allocations’ alignment requirements. If our allocation routine is monomorphized for the type of the allocation it’s making, or it is aggressively inlined — and it definitely should be — then we should be able to completely avoid generating any code to align the bump pointer in most cases, including the conditional branch on overflow if we are rounding up for upwards bumping.

impl BumpDown {
    pub unsafe fn alloc<T>(&mut self) -> *mut MaybeUninit<T> {
        let ptr = self.ptr as usize;

        // Ensure we always always keep the bump pointer
        // `MIN_ALIGN`-aligned by rounding the size up. This
        // should be boiled away into a constant by the compiler
        // after monomorphization.
        let size =
            (size_of::<T>() + MIN_ALIGN - 1) & !(MIN_ALIGN - 1);

        let new_ptr = try_null!(ptr.checked_sub(size));

        // If necessary, round down to the requested alignment.
        // Again, this `if`/`else` should be boiled away by the
        // compiler after monomorphization.
        let new_ptr = if align_of::<T> > MIN_ALIGN {
            new_ptr & !(align_of::<T>() - 1)
        } else {
            new_ptr
        };

        let start = self.start as usize;
        if new_ptr < start {
            // Didn't have enough capacity!
            return std::ptr::null_mut();
        }

        self.ptr = new_ptr as *mut u8;
        self.ptr as *mut _
    }
}

The trade off is extra memory overhead from introducing wasted space between small allocations that don’t require that extra alignment.

Conclusion

If you are writing your own bump allocator, you should bump downwards: initialize the bump pointer to the end of the chunk of memory you are allocating from within, and decrement it on each allocation so that it moves down towards the start of the memory chunk. Downwards bumping requires fewer registers, fewer instructions, and fewer conditional branches. Ultimately, that makes it faster than bumping upwards.

The one exception is if, for some reason, you frequently use realloc to grow the last allocation you made, in which case you might get more out of a fast path for growing the last allocation in place without copying any data. And if you do decide to bump upwards, then you should strongly consider enforcing a minimum alignment on the bump pointer to recover some of the performance that you’re otherwise leaving on the table.

Finally, I’d like to thank Jim Blandy, Alex Crichton, Jeena Lee, and Jason Orendorff for reading an early draft of this blog post, for discussing these ideas with me, and for being super friends :)

⁰ The simple way to round n up to a multiple of align is

(n + align - 1) / align * align

Consider the numerator: n + align - 1. This is ensuring that if there is any remainder for n / align, then the result of the division sub-expression is one greater than n / align, and that otherwise we get exactly the same result as n / align due to integer division rounding off the remainder. In other words, we only round up if n is not aligned to align.

However, we know align is a power of two, and therefore anything / align is equivalent to anything >> log2(align) and anything * align is equivalent to anything << log2(align). We can therefore rewrite our expression into:

(n + align - 1) >> log2(align) << log2(align)

But shifting a value right by some number of bits b and then shifting it left by that same number of bits b is equivalent to clearing the bottom b bits of the number. We can clear the bottom b bits of a number by bit-wise and’ing the number with the bit-wise not of 2^b - 1. Plugging this into our equation and simplifying, we get:

  (n + align - 1) >> log2(align) << log2(align)
= (n + align - 1) & !(2^log2(align) - 1)
= (n + align - 1) & !(align - 1)

And now we have our final version of rounding up to a multiple of a power of two!

If you find these bit twiddling hacks fun, definitely find yourself a copy of Hacker’s Delight. It’s a wonderful book! ↩