New Stack Maps for Wasmtime and Cranelift
Note: This post was originally published on the Bytecode Alliance blog.
As part of implementing the WebAssembly garbage collection proposal in Wasmtime, which is an ongoing process, we’ve overhauled the stack map infrastructure in Cranelift. This post will explain what stack maps are, why we needed to change them, and how the new stack maps work.
Wasmtime is a lightweight WebAssembly runtime that is fast, secure, and
standards-compliant. Cranelift is its compiler backend, which aims to strike a
balance between code quality and compile time, while maintaining exceptionally
high standards for correctness. The garbage collection proposal for
WebAssembly extends the WebAssembly language with runtime-managed
references to struct
s and array
s.
Background: Garbage Collection, Safepoints, and Stack Maps
In the garbage collection (GC) literature, we call the program that allocates objects and adds or removes edges between them the mutator. It mutates the heap graph.
As the mutator runs, allocating objects and filling up the GC heap, the runtime must eventually perform a GC. But the runtime can’t collect garbage at any arbitrary moment: the mutator might be in the middle of manipulating some objects, keeping them in its stack frame or in registers, and effectively hiding them from the runtime. The runtime must know about all such objects, or else it might reclaim an object that it believes is garbage, but which the mutator is still using, leading to use-after-free bugs. This is where safepoints and stack maps come in.
A safepoint is a program point in the mutator (that is, an instruction in the
compiled WebAssembly program) where it is safe for the runtime to collect
garbage. When the mutator calls out to a runtime routine, for example, that
call
instruction would typically be a safepoint, under the assumption that the
runtime might need to GC while running the routine.
The mutator’s compiler does two things to make a safepoint safe, in the presence of objects that the mutator is still actively working with:
-
Insert instructions to move the objects from (say) volatile registers that a
call
safepoint might clobber to well-known locations in the stack frame. -
Emit stack map records that describe, for each safepoint in the mutator program, the locations where to find those still-in-active-use objects inside the stack frame.
While collecting garbage, the runtime can walk the mutator’s stack frames and, using those stack maps, identify all of the objects that the mutator is still manipulating. No more potential use-after-collected bugs!
But there is a final twist: if the GC is a moving collector, for example a generational or compacting collector, then an object may be relocated in memory during collection. If an object is moved, then all references to the object must also be updated to refer to the new location. Failure to update a reference leads to use-after-move bugs like accessing newly-allocated objects that happened to be placed in the original object’s old location. So, in the presence of a moving collector, the runtime must additionally update the mutator’s stack frame, redirecting each reference noted in the stack map so that it now points to its object’s new memory location. The compiler must, in turn, insert reloads of each GC-managed reference from its stack map location that it was spilled to — that is, from the location where the runtime potentially overwrote the old reference with a new, redirected reference — and back into a register so that the mutator may continue to manipulate the object.
Cranelift’s Old Approach: Stack Maps During Register Allocation
Cranelift previously generated stack maps during register allocation. Our register allocator naturally tracks the live range of each value in order to determine whether the same hardware register may or may not be reused for two values. At each safepoint, it would insert spills for each of the live values that are managed by the GC, and record those spilled locations in a stack map for that program point. This approach is straightforward on the surface, but led to a number of subtle downsides and complications.
First, because register allocation is so late in the compiler pipeline, the previous system involved tracking GC-managed references through almost the entire compiler: from the WebAssembly frontend, through the ISA-agnostic mid-end, through the ISA-specific backends, and into register allocation. The only part of the pipeline that didn’t have to preserve knowledge of GC-managed references was the final instruction encoding and binary emission. In general, the fewer invariants we must maintain, the simpler our compiler’s implementation. If we could avoid precisely tracking GC references throughout most of Cranelift, we could expect less complexity and, in general, less complexity would mean fewer bugs. The complexity of the coordination required between the runtime and the compiler to correctly implement garbage collection and stack maps has bitten us before.
Second, to make tracking GC-managed values easier, we distinguished them from plain pointers and integers with dedicated types in Cranelift’s intermediate representation (CLIF). This succeeded in making them easier to track, but led to foiling instruction selection rules and mid-end peephole rewrites that otherwise should have applied. For example, checking if a pointer is null involves simply comparing it to zero, the same way we compare any other integer to zero:
v456 = icmp_imm eq v123, 0
Checking whether GC reference was null, on the other hand, required a dedicated
is_null
instruction, despite lowering to identical machine code when it
appears in isolation:
v456 = is_null v123
This doesn’t seem like a big deal at first glance, beyond being forced to define
additional, duplicate instructions that waste our opcode space and inflate our
instruction set. However, we have a bunch of instruction selection rules for
integer comparisons followed by conditional branches, for example, that let us
avoid materializing flags on x86_64
, yielding smaller and faster machine code
in those contexts. However, because is_null
is a different instruction, these
rules’ patterns don’t match, and their optimizations are missed. The result is
that we must choose between generating subpar code or duplicating the relevant
set of instruction selection rules to also match against an is_null
instruction followed by a conditional branch.
Third, in addition to missed optimizations, we were also seeing misoptimizations: incorrect transformations that broke the input program. From its point of view, our mid-end was performing perfectly legal optimizations. But the mid-end can’t reason about effects that it can’t see, and it couldn’t see the spills and reloads of GC references associated with a safepoint because they weren’t inserted yet. This lack of visibility became an issue when we were, for example, accessing the same field of an object before and after a safepoint:
;; `v0` is a reference to a GC-managed object.
v0 = ...
;; Bitcast it to an `i64` so we can do pointer
;; arithmetic on it.
v1 = bitcast.i64 v0
;; Get the address of the field we want to
;; access.
v2 = iadd_imm v1, 8
;; Write `11` into the object field.
v3 = iconst.i32 11
store.i32 v2, v3
;; Call some function! This might trigger a GC
;; and therefore must be a safepoint.
call $f()
;; Write to same field as above, but with the
;; value `22` this time.
v4 = bitcast.i64 v0
v5 = iadd_imm v4, 8
v6 = iconst.i32 22
store.i32 v5, v6
Our mid-end performs a number of ISA-agnostic optimizations, including global value numbering (GVN) (as part of its e-graphs pass) to eliminate duplicate computations. When the mid-end processed the CLIF snippet above, it would determine that the second field-address computation was redundant, and we could instead reuse the result of the first field computation, yielding the following optimized CLIF:
v0 = ...
v1 = bitcast.i64 v0
v2 = iadd_imm v1, 8
v3 = iconst.i32 11
store.i32 v2, v3
call $f()
v6 = iconst.i32 22
;; Reusing the field address we previously
;; computed above.
store.i32 v2, v6
To reason about why this transformation is invalid — despite only
deduplicating pure, non-effectful, and otherwise-identical instructions —
we must consider context that is invisible to the mid-end: the safepoint spills
and reloads that will be inserted later by the register allocator. The second
field-address computation is not actually reusing v0
as its base, it is using
reload(v0)
as the base, but that reload
operation is hidden from the
mid-end. If the runtime collected garbage during $f
’s execution and moved the
object that v0
references, then in this optimized CLIF, the second store
is
writing to the object’s old, vacated location, which is a use-after-move
vulnerability.
We began to address these sorts of bugs in a whack-a-mole fashion, e.g. by marking bitcasts of reference types as side-effectful, but ultimately we’d rather not play this whack-a-mole game at all, reacting to bugs as we find them, never knowing how many are still lurking. Instead, we’d much prefer that the mid-end simply had the visibility it needs to optimize correctly in the first place.
A final quirk of the old system was that GC references were always represented
as pointers at the machine level. You could only use 64-bit references on 64-bit
ISAs and 32-bit references on 32-bit ISAs, and you definitely could not do
NaN
boxing. This was an annoyance for us in practice because
we’ve designed Wasmtime’s GC implementation such that our GC references are
always 32-bit integers, even on 64-bit ISAs. This design choice lets us leverage
the same sandboxing and virtual-memory techniques to elide bounds checks that
Wasmtime already implements for WebAssembly linear memories. It lets us benefit
from smaller objects and denser cache locality; because of that, it is a fairly
common technique. But,
despite storing GC references in 32 bits on the heap, Cranelift’s old system
forced us to extend those references to 64 bits when we were manipulating them
on the stack or in registers when targeting a 64-bit ISA.
The New Approach: User Stack Maps
Given the complications of Cranelift’s previous system for tracking GC references across safepoints and emitting stack maps, we decided to implement a new approach. Instead of generating safepoint spills, reloads, and stack maps at the end of the compiler pipeline, the user is responsible for producing CLIF that already contains those spills and reloads, as well as annotating each safepoint with the virtual stack slots live GC references were spilled into. The mid-end, backends, and register allocator have no extra responsibilities other than forwarding these annotations through to binary emission, at which point we can translate the virtual stack slots into precise stack-pointer offsets and emit stack maps. We dubbed this new approach “user stack maps” because their correctness largely falls upon the Cranelift user, rather than the core of Cranelift itself.
User stack maps mean that we no longer need to precisely track GC references
throughout the whole compiler pipeline. This simplification allows us us to
remove the distinct reference types from CLIF, as well as their related
instructions, like is_null
. This means no more missed optimizations or
duplicated instruction selection rules.
With user stack maps, we no longer have to worry about safepoint-specific miscompilations in the mid-end. Let’s revisit the previous miscompilation example, where we wrote to a GC object’s field before and after a safepoint. This is what the initial CLIF looks like with user stack maps, where the CLIF producer has already inserted the spills and reloads for the safepoint:
;; Set the field to `11`
v0 = ...
v1 = iadd_imm v0, 8
v2 = iconst.i32 11
store.i32 v1, v2
;; NEW: Spill `v0` to stack slot `ss0`
;; before the safepoint.
v3 = stack_addr ss0
store.i64 notrap v3, v0
;; NEW: stack map annotation on the safepoint.
call $f(), stack_map = [i64 @ ss0]
;; NEW: Reload what was `v0` from the
;; stack slot. Don't reuse `v0` because
;; the referenced object could have
;; moved.
v4 = stack_addr ss0
v5 = load.i64 notrap v4
;; Set the field to `22`.
v6 = iadd_imm v5, 8
v7 = iconst.i32 22
store.i32 v6, v7
Looking at the updated CLIF, it is immediately obvious to the mid-end that
reusing the initial field-address computation for the second store is invalid.
The base of the second field-address computation is v5
, not v0
, and it
cannot prove that v5
and v0
are actually the same value; indeed, if the
referenced object moved in memory then they will not be the same! We’ve
avoided the previous misoptimization by construction.
Cranelift already must preserve the effects of accessing memory, and a safepoint’s spills and reloads are no different: they are explicitly present and modeled in the intermediate representation as loads and stores, the same as any other memory access. In the world of user stack maps, supporting safepoint spills and reloads is not an additional responsibility nor additional implementation complexity for the compiler.
User stack maps actually unlock more potential optimizations than what used to be possible. Our alias analysis pass, which performs redundant load elimination and store-to-load forwarding, can now see and operate upon safepoint-related spills and reloads. And once again, Cranelift’s surface area that is critical for correctness has not grown; the alias analysis pass must already preserve the effects of accessing memory.
While user stack maps are a big simplification for most of the compiler pipeline, they do shift some of that old complexity into the frontend. The mid-end, backends, and register allocator no longer need to precisely track GC references, and insert spills and reloads at safepoints, but now the frontend must. We believe that we’ve ended up at a design point with less overall complexity.
We, the Cranelift developers, couldn’t leave users (i.e. we, the Wasmtime
developers) completely on their own, without any tools to help produce correct
safepoints in CLIF. Cranelift users, including Wasmtime, generally use the
cranelift-frontend
crate to produce the initial CLIF that Cranelift takes as
input to compilation. cranelift-frontend
helps users build CLIF from their
source language, handles SSA construction, and etc…; it is an appropriate
place to help with user stack maps as well. While they are creating CLIF with
cranelift-frontend
, users can now mark any value as requiring inclusion in
stack maps, and cranelift-frontend
will automatically insert spills and
reloads of that value around safepoints, as well as annotate those safepoints
with stack map entries that record the spill locations for the runtime.
To fulfill these new obligations, the cranelift-frontend
crate performs a
simple liveness analysis for GC references, allocates and assigns stack slots
for the GC references, and then inserts spills and reloads. If this sounds
similar to register allocation, that’s because it is. The important difference
is that the frontend performs this work only for GC references, not for the
majority of non-GC-reference values in the program. When a program does not use
any GC-managed objects, which is the vast majority of WebAssembly programs
today, no overhead is imposed.
Our liveness analysis flows backwards, from uses (which mark values live) to definitions (which remove values from the live set) in the data-flow graph and from successor blocks to predecessor blocks in the control-flow graph.
We compute two live sets for each block:
-
The live-in set, which is the set of values that are live when control enters the block.
-
The live-out set, which is the set of values that are live when control exits the block.
A block’s live-out set is the union of its successors’ live-in sets. A block’s live-in set is the set of values that are still live after the block’s instructions have been processed, where uses mark values live and definitions remove them from the set.
live_in(block) = union(live_out(s) for s in successors(block))
live_out(block) = live_in(block) + uses(block) - defs(block)
Whenever we update a block’s live-in set, we must reprocess all of its predecessors, because those predecessors’ live-out sets depend on this block’s live-in set. Processing continues until the live sets stop changing and we’ve reached a fixed-point. We implement this with the classic worklist approach.
Once we have the results of the liveness analysis in hand, we do a final pass over the CLIF, assigning stack slots and inserting spills and reloads. This is also implemented as a backwards pass. Whenever the pass finds a use of a GC reference that is live across any safepoint, it allocates the reference’s stack slot if it doesn’t already have one, and replaces the use with a load from that stack slot. Upon reaching a safepoint, the pass adds user stack map entries for each of the GC references that are live across the safepoint. Finally, when the pass encounters a GC reference’s definition, it appends a spill of the reference to its associated stack slot to the definition. Then, knowing that the value will never be seen again, since we are processing SSA values in backwards order, the GC reference’s stack slot is returned to a free list and made available for any new GC references the pass has yet to encounter.
Astute readers may have noticed that replacing every use with a fresh reload may result in many redundant loads. This is not a concern because, as previously noted, our mid-end’s alias analysis can already clean up and remove these redundant loads. Separation of concerns between compiler passes lets us decrease overall complexity.
Conclusion
We’ve moved stack map generation and safepoint spills and reloads from the end of the compiler pipeline and into the frontend. It simplifies the compiler and avoids misoptimizations and missed optimizations by relying on our existing correctness invariants rather than imposing new ones to maintain. It also unlocks efficient sandboxing of the GC heap using the same techniques we use for WebAssembly’s linear memories.
We’re making steady progress implementing the WebAssembly garbage collection proposal in Wasmtime. It is not complete or ready for general use yet, but it is coming soon.
Want to start hacking on Wasmtime or Cranelift? Check out our contributor documentation!
Thanks to Trevor Elliott for partnering with me on implementing this stack maps overhaul and to Chris Fallin for design feedback, brainstorming, code review, and providing feedback on an early draft of this blog post.