A Wee Allocator for WebAssembly
Recently, we introduced WebAssembly (compiled from Rust) into the source-map
JavaScript library.
We saw parsing and querying source maps get a whole lot faster. But keeping the
compiled .wasm
code size small was a challenge.
There are a variety of tools available for removing dead code from WebAssembly
and compressing .wasm
sizes:
-
wasm-gc
constructs the callgraph of all the functions in a.wasm
file, determines which functions are never transitively called by an exported function, and removes them.0 -
wasm-snip
replaces a WebAssembly function’s body with a singleunreachable
instruction. This is useful for manually removing functions that will never be called at runtime, but which the compiler andwasm-gc
couldn’t statically prove were dead. After snipping a function, other functions that were only called by the snipped function become unreachable, so runningwasm-gc
again afterwasm-snip
often yields further code size reductions. -
wasm-opt
runsbinaryen
‘s sophisticated optimization passes over a.wasm
file, both shrinking its size and improving its runtime performance.
After using these tools, we looked at what was left in our .wasm
file, and
broke the results down by crate:
Half of our code size was coming from dlmalloc
, the allocator that Rust uses
by default for the wasm32-unknown-unknown
target. Source map parsing requires
just a couple of large, long-lived allocations, and then does its heavy lifting
without allocating any further. Querying a parsed source map doesn’t require
any allocation. So we are paying a lot, in code size, for an allocator that we
aren’t really using that much.
Allocator implementations have trade offs, but Rust’s default is the wrong choice for the source map parsing and querying scenario.
Introducing wee_alloc
wee_alloc
is a work-in-progress memory allocator designed for
WebAssembly. It has a tiny code size footprint, compiling down to only a
kilobyte of .wasm
code.
wee_alloc
is designed for scenarios like those described above: where there
are a handful of initial, long-lived allocations, after which the code does its
heavy lifting without any further allocations. This scenario requires some
allocator to exist, but we are more than happy to trade performance for small
code size.
In contrast, wee_alloc
is not designed for, and would be a poor choice in,
scenarios where allocation is a performance bottleneck.
Although WebAssembly is the primary target, wee_alloc
also has an mmap
-based
implementation for unix systems. This enables testing wee_alloc
, and using
wee_alloc
in your own code, without a browser or WebAssembly engine.
How wee_alloc
Works
Allocating WebAssembly Pages
WebAssembly module instances have a linear memory space1, and use store and load instructions to access values within it via an index. If an instruction attempts to access the value at an index that is beyond the memory’s bounds, a trap is raised.
There are two instructions for manipulating the linear memory space itself,
rather than its contents: current_memory
and grow_memory
2. The current_memory
instruction gives the
current size of memory, in units of pages. The grow_memory
instruction takes
an operand n, grows the memory space by n pages, and gives back the old
size of memory, units of pages. Alternatively, if growing memory fails, -1
is
returned.
WebAssembly does not have any facilities for shrinking memory, at least right now.
To implement allocating n pages of memory to Rust, we need to use LLVM’s
intrinsics. Right now, the intrinsic for grow_memory
doesn’t expose the return
value, but this should be fixed once Rust updates its LLVM. Therefore, our page
allocation routine must use current_memory
before grow_memory
, when it
should just use the result of grow_memory
instead. It also means we can’t
check for failure yet.
Careful readers will have noticed the Pages
type in alloc_pages
’s type
signature. wee_alloc
uses newtypes for units of bytes, words, and pages. Each
of these is a thin wrapper around a usize
with relevant operator overloads and
inter-conversions. This has been very helpful in catching bugs at compile time,
like attempts to offset a pointer by two words rather than two bytes, and
compiles away to nothing in the emitted .wasm
.
Free Lists
But we don’t satisfy individual allocation requests by directly allocating pages. First, the WebAssembly page size is 64KiB, which is much larger than most allocations. Second, because there is no way to return unused pages to the WebAssembly engine, it would be incredibly wasteful if we didn’t reuse pages. Instead, we maintain a free list of blocks of memory we’ve already allocated from the WebAssembly engine.
Free lists have low complexity and are easy to implement. These properties also
lend themselves to a small implementation. The basic idea is to maintain an
intrusive linked list of memory blocks that are available. Allocation removes a
block from the free list. If the block is larger than the requested allocation
size, we can split it in two. Or, if there is no suitable block available in the
free list, we can fall back to alloc_pages
to get a fresh block of
memory. Deallocation reinserts the given block back into the free list, so that
it can be reused for future allocations. Because a block is only in the free
list if it is not allocated and is therefore unused, we can use a word from the
data itself to store the free list links, so long as we ensure that the data is
always at least a word in size.
Here is a diagram of what the free list looks like in memory, showing a free block, followed by an allocated block, followed by another free block:
--. .--------------------------------------. ,-----------
| | | |
V | V |
+----------------\\-----+---------\\-----+---------------
| free ; data... // ... | data... // ... | free ; data...
+----------------\\-----+---------\\-----+---------------
Even after choosing to use free lists, we have more design choices to make. How should we choose which block to use from the free list? The first that can satisfy this allocation, aka first fit? The block that is closest to the requested allocation size, in an effort to cut down on fragmentation (more on this in a minute), aka best fit? Pick up the search where we left off last time, aka next fit? Regardless which of first fit, best fit, and next fit we choose, we are dealing with an O(n) search. Indeed, this is the downside to the trade off we made when choosing free lists for their simplicity of implementation.
A common technique for speeding up free list allocation is to have separate free lists for allocations of different sizes, which is known as segregated fits or having size classes. With this approach, we can guarantee the invariant that every block in the free list for a particular size can satisfy an allocation request of that size. All we need to do is avoid splitting any block into pieces smaller than that size.
Maintaining this invariant gives us amortized O(1) allocation for these sizes with their own free lists. Using first fit within a size’s free list is guaranteed to only look at the first block within the free list, because the invariant tells us that the first block can satisfy this request. It is amortized because we need to fall back to the O(n) allocation to refill this size’s free list from the fallback, main free list whenever it is empty.
If we reuse the same first fit allocation routine for both our size classes’ free lists and our main free list, then we get the benefits of size classes without paying for them in extra code size.
This is the approach that wee_alloc
takes.
Fragmentation
Our other main concern is avoiding fragmentation. Fragmentation is the degree
of wasted space between allocations in memory. High fragmentation can lead to
situations where there exist many free blocks of memory, but none of which can
fulfill some allocation request, because each individual free block’s size is
too small, even if the sum of their sizes is more than enough for the requested
allocation. Therefore, a high degree of fragmentation can effectively break an
allocator. It had one job — allocate memory — and it can’t even do
that anymore. So wee_alloc
really should have some kind of story here;
punting 100% on fragmentation is not a practical option.
Once again there are trade offs, and avoiding fragmentation is not a binary choice. On one end of the spectrum, compacting garbage collectors can re-arrange objects in memory and pack them tightly next to each other, effectively leading to zero fragmentation. The cost that you pay is the size and time overhead of a full tracing garbage collector that can enumerate all pointers in the system and patch them to point to moved objects’ new locations. On the opposite end of the spectrum, if we never re-consolidate two blocks of adjacent memory that we previously split from what had originally been a single contiguous block, then we can expect a lot of fragmentation. As we split blocks into smaller blocks for smaller allocations, we get small bits of wasted space between them, and even after we free all these small allocations, we won’t have any large block in the free list for a large allocation. Even if we have multiple adjacent, small blocks in the free list that could be merged together to satisfy the large allocation.
One possibility is keeping the free list sorted by each block’s address, and then deallocating a block would re-insert it into the free list at the sorted location. If either of its neighbors in the free list are immediately adjacent in memory, we could consolidate them. But then deallocation is an O(n) search through the free list, instead of the O(1) push onto its front. We could lower that to O(log n) by representing the free list as a balanced search tree or btree. But the implementation complexity goes up, and I suspect code size will go up with it.
Instead of a free list, we could use bitmaps to track which portions of our heap are free or allocated, and then the consolidation could happen automatically as bits next to each other are reset. But then we need to restrict our allocator to parceling out portions of a single, contiguous region of memory. This implies that only a single, global allocator exists, since if there were multiple, each instance would want to “own” the end of the WebAssembly linear memory, and have the power to grow it to satisfy more and larger allocations. And maybe this is a fair constraint to impose in the context of WebAssembly, where memory is already linear and contiguous. But lifting this constraint, while still using bitmaps, implies a hybrid free list and bitmap implementation. The downside to that is more implementation complexity, and a larger code size foot print.
wee_alloc
takes a third approach: trading some space overhead for easy and
fast merging. We maintain a sorted, doubly-linked list of all blocks, whether
allocated or free. This adds two words of space overhead to every heap
allocation. When freeing a block, we check if either of its adjacent blocks are
also free, and if so, merge them together with a handful of updates to the next
and previous pointers. If neither of the neighbors are free, then we push this
block onto the front of the free list. In this way, we keep both O(1)
deallocation and our simple free list implementation.
Here is a diagram of what this sorted, doubly-linked list looks like in memory:
,---------------------------------. ,---------------------
| ,--|---------------------------|--.
| X | | | |
V | V | V |
+-----------------------\\-----+-----------------------\\-----+----------------------
| prev ; next ; data... // ... | prev ; next ; data... // ... | prev ; next ; data...
+-----------------------\\-----+-----------------------\\-----+----------------------
| ^ | ^ |
| | | | |
`---------------------' `---------------------' `------------
CellHeader
, FreeCell
, and AllocatedCell
The CellHeader
contains the common data members found within both allocated
and free memory blocks: the next and previous doubly-linked list pointers.
We use a low bit of the next_cell_raw
pointer to distinguish whether the cell
is allocated or free, and can consult its value to dynamically cast to an
AllocatedCell
or FreeCell
.
We use pointer arithmetic to calculate the size of a given cell’s data to avoid
another word of space overhead, so the next_cell_raw
pointer must always point
just after this cell’s data. But, because of that restriction, we can’t use a
null pointer as the sentinel for the end of the doubly-linked-list. Therefore,
we use the second low bit of the next_cell_raw
pointer to distinguish whether
the data pointed to by next_cell_raw
(after the appropriate masking) is a
valid cell, or is garbage memory.
An AllocatedCell
is a CellHeader
followed by data that is allocated.
A FreeCell
is a CellHeader
followed by data that is not in use, and from
which we recycle a word for the next link in the free list.
Each of AllocatedCell
and FreeCell
have methods that make sense only when
the cell is allocated or free, respectively, and maintain the invariants
required for cells of their state. For example, the method for transforming a
FreeCell
into an AllocatedCell
ensures that the IS_ALLOCATED
bit gets set,
and the method for transforming an AllocatedCell
into a FreeCell
unsets that
bit.
Implementing Allocation
Let’s begin by looking at first fit allocation without any refilling of the free list in the case where there are no available blocks of memory that can satisfy this allocation request. Given the head of a free list, we search for the first block that can fit the requested allocation. Upon finding a suitable block, we determine whether to split the block in two, or use it as is. If we don’t find a suitable block we return an error.
Splitting a cell in two occurs when a cell has room for both the requested
allocation and for another adjacent cell afterwards that is no smaller than some
minimum block size. We use the &AllocPolicy
trait object to configure this
minimum block size, among other things, for different size classes without the
code duplication that monomorphization creates. If there is room to split, then
we insert the newly split cell into the free list, remove the current cell, and
fixup the doubly-linked list of adjacent cells in the headers.
Refilling a free list when there is not a suitable block already in it is
easy. For the main free list, we allocate new pages directly from the
WebAssembly engine with the alloc_pages
function we defined earlier. For a
size class’s free list, we allocate a (relatively) large block from the main
free list. This logic is encapsulated in the two different AllocPolicy
implementations, and the AllocPolicy::new_cell_for_free_list
method.
To allocate with a fallback to refill the free list, we do just that: attempt a first fit allocation, if that fails, refill the free list by pushing a new cell onto its front, and then try a first fit allocation once again.
But where do we get the free list heads from? The WeeAlloc
structure holds the
head of the main free list, and if size classes are enabled, the size classes’
free list heads.
As you can see, every free list head is wrapped in an imp::Exclusive
.
The imp
module contains target-specific implementation code and comes in two
flavors: imp_wasm32.rs
and imp_unix.rs
. The alloc_pages
function we saw
earlier is defined in imp_wasm32.rs
. There is another alloc_pages
function
that uses mmap
inside imp_unix.rs
. The imp::Exclusive
wrapper type
guarantees exclusive access to its inner value. For WebAssembly, this is a
no-op, since SharedArrayBuffer
s aren’t shipping and there is no shared-data
threading. For unix systems, this protects the inner value in a pthread mutex,
and is similar to std::sync::Mutex
but provides a FnOnce
interface rather
than an RAII guard.
If size classes are not enabled, we always use the main free list head and the
LargeAllocPolicy
. If size classes are enabled, we try to get the appropriate
size class’s free list head, and if that works, then we use the
SizeClassAllocPolicy
with it. If there is no size class for the requested
allocation size, then we fall back to the main free list and the
LargeAllocPolicy
.
Finally, all that is left is to tie everything together to implement the alloc
method for the Alloc
trait:
Implementing Deallocation
Deallocation either merges the just-freed block with one of its adjacent neighbors, if they are also free, or it pushes the block onto the front of the free list.
If we are reinserting a block into a size class’s free list, however, it doesn’t
make sense to merge blocks. Because the these free lists are always servicing
allocations of a single size, we would just end up re-splitting the merged block
back exactly as it is split now. There is no benefit to splitting and merging
and splitting again. Therefore, we have the AllocPolicy
inform us whether
merging is desirable or not.
First, let’s examine deallocation without the details of merging. We get the
appropriate free list and allocation policy, and conjure up a reference to the
AllocatedCell
that sits just before the data being freed. Then (assuming we
didn’t merge into another block that is already in the free list) we push the
block onto the front of the free list.
When merging cells, the adjacent neighbor(s) we are merging into are also free, and therefore are already inside the free list. Because our free list is singly-linked, rather than doubly-linked, we can’t arbitrarily splice in new elements when we have a handle to an element that is already in the free list. This causes some hiccups.
Merging with the previous adjacent cell is still easy: it is already in the
free list, and we aren’t changing the location of the CellHeader
, so folding
this cell into it is all that needs to be done. The free list can be left alone.
Merging with the next adjacent cell is a little harder. It is already in the
free list, but we need to splice it out from the free list, since its header
will become invalid after consolidation, and it is this cell’s header that
needs to be in the free list. But, because the free list is singly-linked, we
don’t have access to the pointer pointing to the soon-to-be-invalid header, and
therefore can’t update that pointer to point to the new cell header. So instead
we have a delayed consolidation scheme. We insert this cell just after the next
adjacent cell in the free list, and set the next adjacent cell’s
NEXT_FREE_CELL_CAN_MERGE
bit.
Then, the next time that we walk the free list for allocation, the bit will be
checked and the consolidation will happen at that time. Which means that the
walk_free_list
definition we showed earlier was incomplete, since it didn’t
include the code for consolidation. Here is its complete definition:
On the other hand, if both the previous and next adjacent cells are free, we are faced with a dilemma. We cannot merge all the previous, current, and next cells together because our singly-linked free list doesn’t allow for that kind of arbitrary appending and splicing in O(1) time. Instead, we use a heuristic to choose whether to merge with the previous or next adjacent cell. We could choose to merge with whichever neighbor cell is smaller or larger, but we don’t. Right now, we prefer the previous adjacent cell because we can greedily consolidate with it immediately, whereas the consolidating with the next adjacent cell must be delayed, as explained above.
If we made the minimum allocation size two words, then we would have room for a doubly-linked free list, and could support consolidating previous, current, and next free cell neighbors. We could also remove the delayed consolidation scheme, which would further simplify a bunch of code. But it would mean effectively three words of overhead for single word heap allocations. It isn’t clear to me whether or not that trade off is worth it or not. To make an informed decision, we’d need a corpus of allocations and frees made by typical WebAssembly applications.
Conclusion
wee_alloc
is a work-in-progress prototype, but it already meets its goal of
being small.
However, it is certainly lacking in other areas. Sergey Pepyakin
tried bootstrapping rustc
with wee_alloc
enabled as the global
allocator, and it is a couple orders of magnitude slower than
bootstrapping rustc
with its default jemalloc
. On the one hand,
bootstrapping rustc
isn’t exactly the scenario wee_alloc
was designed for,
and it isn’t surprising that this new, unoptimized prototype is slower than the
mature, production-grade jemalloc
. Furthermore, I doubt that we can compete
with jemalloc
on speed without regressing code size. But even so, I think
wee_alloc
is slower than it should be, and I suspect that there are some
low-hanging fruit waiting to be plucked.
I would love, love, love some help building wee_alloc
— it’s a lot
of fun! Are you interested in code golfing the smallest .wasm
binaries? Do you
like nitty-gritty profiling and digging into performance? Is wee_alloc
missing
some obviously-better technique that you’re familiar with? Want to help Rust be
the number one choice for compiling to WebAssembly? Fork the wee_alloc
repository on GitHub and then come join us in #rust-wasm
on
irc.mozilla.org
and introduce yourself!
Thanks to Mike Cooper and David Keeler for reading early drafts and providing valuable feedback.
0 This will be unnecessary soon, since LLVM’s
linker, lld
, is gaining support for WebAssembly, and already supports
this functionality. The Rust toolchain will also start using lld
and then we
won’t need wasm-gc
anymore. ↩
1 Right now there is a single linear memory, but it is expected that more address spaces will come. They would be useful for, for example, referencing garbage-collected JavaScript or DOM objects directly from WebAssembly code. ↩
2 The current_memory
and grow_memory
instructions will likely be renamed to mem.size
and
mem.grow
. ↩