Writing a Test Case Generator for a Programming Language
Generating pseudo-random WebAssembly programs
Maxime Chevalier-Boisvert requested resources for learning about fuzzing programming language implementations on Twitter:
I’d like to learn about fuzzing, specifically fuzzing programming language implementations. Do you have reading materials you would recommend, blog posts, papers, books or even recorded talks?
Maxime received many replies linking to informative papers, blog posts, and lectures. John Regehr suggested writing a simple generative fuzzer for the programming language.
A generative fuzzer combines a test case generator with the system under test (e.g. your compiler), generating new test cases and feeding them into the system:
I realized that many people might not know what it takes to write their own generative fuzzer, so this blog post shows one aspect of it: implementing a test case generator.
Our test case generator will generate WebAssembly programs. While WebAssembly has its own quirks — it’s a binary format and is generally a compilation target rather than a source language — it is a small and simple language. The techniques we use when generating WebAssembly should transfer to generating the programming language of your choice.
If you want to skip the exposition and jump head first into the code, here is the repository for our final test case generator.
Table of Contents
- What is a Test Case Generator?
- Getting Set Up
- Translating Grammars into Generators
- Generating the Type Section
- Generating the Import Section
- Generating the Code Section
- Using the Test Case Generator
- Conclusion
What is a Test Case Generator?
Test case generators generate test cases. These test cases are always within the test domain: no cycles are wasted on invalid inputs, such as source text that fails to parse. Compare this to mutation-based fuzzing, where existing seed inputs are mutated to produce new inputs. In general, nothing guarantees that the new, mutated input is still within the test domain: the mutation may have introduced a syntax error. This property, that generated inputs are always within the test domain, is generative fuzzing’s main advantage and the test case generator’s main responsibility.
A test case generator should, additionally, support every feature of its target
programming language. You won’t discover a bug in your compiler’s handling of
switch
statements if the test case generator doesn’t support generating
switch
statements. Pushing this idea even further, the test case generator
should uniformly sample from the test domain. If the test case generator can
technically generate switch
statements but the probability of doing so is
nearly zero, then you likely still won’t find that bug. However, uniformly
sampling from the infinite set of all programs that can be written in a
particular programming language is
nontrivial and an area of
active
research.
A test case generator should, finally, be fast. The faster we can generate test cases, the faster we will discover bugs. If the generator is too slow, we can blow our time budget, failing to find those bugs at all.
Getting Set Up
First, we create a new crate with cargo
. We’ll name this crate wasm-smith
,
giving a little nod to Csmith, the popular C program generator.
Second, we add the arbitrary
crate as a dependency:
The arbitrary
crate helps us generate structured data from arbitrary bytes. It
is typically used in combination with libFuzzer to translate the raw bytes
given to use by libFuzzer into something that the system you’re testing can
process. For example, a color conversion library might use arbitrary
to turn
the raw fuzzer-provided bytes into Rgb
or Hsl
color types. We will use it in
a similar way for this project, translating raw bytes given to us by libFuzzer
into semantically valid WebAssembly modules.
The arbitrary
crate’s main export is the Arbitrary
trait:
It takes an Unstructured
, which is a helpful wrapper around a byte slice, and
returns an instance of the type for which it is implemented.
For our wasm-smith
crate, we define a Module
type that represents our
pseudo-random WebAssembly modules, and then we implement the Arbitrary
trait
for it:
Before we fill in that todo!()
lets take a moment to settle on a design for
what the implementation will look like.
Translating Grammars into Generators
Writing a generator is remarkably similar to hand-writing a recursive descent parser, so if you’ve done that before, then you should feel right at home. For example, given this grammar production (borrowed and lightly edited from the C++ name mangling grammar):
<class-enum-type> ::= Ts <name>
| Tu <name>
| Te <name>
A recursive descent parser will, almost mechanically, translate the production into something like this:
Our generator will do something similar, except instead of peeking at the input string to decide which right-hand side of the production to parse, we will make a pseudo-random choice to generate one of those potential right hand sides.
We could use a random number generator directly to make these choices, but this has two problems:
-
We give up determinism unless we are careful to control the RNG’s seed and reuse the same RNG everywhere, threading it through all of our functions as a parameter. Determinism is extremely important for reproducing test failures! It’s definitely possible to do these things, but can occasionally be a little annoying.
-
More importantly, using an RNG precludes a mature fuzzing engine, like libFuzzer, from guiding our test case generation based on code coverage and other insights.
Instead, we use a raw input byte slice given to us by libFuzzer or AFL as a sequence of predetermined choices.0 This lets the fuzzer guide our test case generation, and gives us test case reduction “for free” since we can ask the fuzzer to reduce the raw input sequence, rather than write a domain-specific test case reducer. This comes as a relief because writing a reducer that understands WebAssembly is easily as much effort as writing the generator itself.
Here is the same C++ mangling example from above, but translated from a parser
into a generator, using Unstructured
:
Once again, this is mostly mechanical.
This pattern will generate syntactically correct test cases that can be parsed successfully but which likely contain a plethora of type errors, calls to undefined functions, etc. We’ve set out to generate semantically correct test cases that pass type checking and will exercise more than just the language implementation’s frontend.
Our final pattern maintains some extra information about the program we’ve
generated thus far, so that we can consult that information when generating new
forms. This extra information might include which names are in scope, the types
of each variable, etc. We consult that information while dynamically building up
thunks for every valid option we could generate. Once we have enumerated every
option, we ask the Unstructured
to choose one of them, and finally we call the
chosen thunk to generate the form.
Here is an example of using this pattern for generating integer expressions, where an integer expression is either a constant integer, an arithmetic operation, a use of an integer variable, or a call of a function that returns an integer:
Finally, it is worth noting that, similar to how parser generators take a grammar and generate a parser, there are tools that will take a grammar and generate a test case generator (and even Rust implementations of those tools).
Generating the Type Section
Now we’re ready to continue generating WebAssembly!
The first section in a WebAssembly module is the type section. It declares the function type signatures used in the module and it has the following grammar:
<typesec> ::= 0x01 <size:u32> <num_funcs:u32> <functype>*
<functype> ::= <num_params:u32>
<valtype>*
<num_results:u32>
<valtype>*
<valtype> ::= 0x7F # i32
| 0x7E # i64
| 0x7D # f32
| 0x7C # f64
This is not context free: size
is the size of the whole section in bytes,
num_funcs
is the number of following <functype>
s, num_params
defines how
many parameter <valtype>
s a function type has, and num_results
defines how
many result <valtype>
s it has. But none of this comes into play until we
actually serialize the module into bytes. Until then, a type section is a
sequence of function type signatures, and these signatures don’t actually need
access to any scope information, so we can just derive the Arbitrary
trait
directly for them:
And since a module has a type section, let’s hook this up into our Module
definition and its Arbitrary
implementation:
Generating the Import Section
The import section is the first section where we will need to look at what we’ve previously generated when generating new items. An import brings an external function, table, memory, or global into scope. When importing a function, we need to specify the function’s type via an index into the types section. This means we can only generate a function import if our types section is non-empty.
The structure definitions and Arbitrary
implementations for table and global
types are straightforward, and nothing different from what we saw with the type
section, so I’ve elided them here. The one thing to note is that the largest
memory that a Wasm module can declare is 4GiB, so the Arbitrary
implementation
for memory types must take that into account:
When generating an arbitrary import, we dynamically build up a list of
functions, one for each semantically valid choice we can make. Then we will use
the next bit of raw input from Unstructured
to choose one of those functions
and call it to create an import. We keep creating more imports while the raw
input tells us to — we read a bool
from the Unstructured
and stop
adding imports once we read false
.
Hooking this up to Module
and its Arbitrary
implementation is, once again,
straightforward:
Generating the Code Section
There are a few more sections before the code section, which contains function bodies, but their implementation is similar to what we’ve already seen with the type and import sections, so we’ll skip them.
WebAssembly is a typed stack-based language, and has structured control
flow. There is the operand stack, where values are pushed and popped, and
there is the control stack, which keeps track of active control frames
(i.e. if
s, loop
s, and block
s) and their labels that can be jumped
to. Whether an instruction is valid depends on the types on the operand stack
and, if the instruction is control instruction, the labels on the control
stack. Therefore, we will explicitly model these stacks while generating
instructions. Choosing which instruction to generate will consult the contents
of these stacks, and, once a choice is made, generating a new instruction will
update them.
Let’s begin with the basic definitions for value types and control frames:
We model the control stack as a vector of Control
s, and the operand stack as a
vector of Option<ValType>
. The Option
is introduced because unreachable code
produces values of any type, and these are modeled as None
.
In order to generate a function body, we’ll also need to know the function’s
parameter and result types, and the local variables available. We’ll collect
these things and our stacks in a CodeBuilder
type. To avoid thrashing the
memory allocator, we factor the operand and control stacks out into a separate
structure that is reused across the creation of each function body. This
structure also contains the vector of options that we build up every time we
generate an instruction.
This is basically the same information that WebAssembly’s validation algorithm requires. Our generator is similar to a WebAssembly validator for the same reasons that a generator that emits syntactically correct test cases is similar to a recursive descent parser.
Because there are so many instructions, we won’t write the validation checks and the thunks that generate the instructions inline like we did for previous sections. Instead we will write pairs of functions: the first checks whether a given instruction would be valid to generate in the current context, and the second generates that instruction and updates the context.
Despite this small change in code organization, we will use these function pairs to accomplish the same task as before: dynamically building up a list of functions, one for each instruction that would be valid in the current context, and then choosing one of them and calling it to generate an instruction.
i32.add
Let’s start with an easy example: generating the i32.add
instruction. An
i32.add
is valid when there are two i32
s on the operand stack. There are
many instructions that are valid to generate when there are two i32
s on the
operand stack, so rather than naming the predicate function something like
i32_add_is_valid
, we’ll name it i32_i32_on_stack
.
i32.add
pops the i32
s and pushes their sum. The function that generates an
i32.add
needs to do the same to our model of the operand stack:
f64.const
Here is another simple example. An f64.const
instruction is always valid. It
has an f64
immediate and adds that f64
onto the operand stack:
if
Things start getting more interesting when we consider instructions that
introduce new control frames. An if
instruction requires an i32
on top of
the operand stack which determines whether its truth-y or false-y code path is
executed. It can also take additional operand stack parameters, making them
available to further instructions within its control frame. And while it can
take stack parameters, we only require an i32
on the stack because we can
always define if
s with empty stack parameters.
else
An else
instruction is only valid when the top control frame was introduced by
an if
and the types on (the accessible portion of) the operand stack are
exactly the top control frame’s result types.
The else
instruction resets the operand stack to its state at the moment that
the if
was introduced. It also changes the control kind to
ControlKind::Block
so that we don’t generate instruction sequences with two
else
s for the same if
, like if ... else ... else ... end
.
end
An end
instruction finishes a control sequence that was introduced by an if
,
else
, block
, or a loop
instruction. Similar to an else
, it is only valid
when we have exactly the current control frame’s result types on the operand
stack. The final thing to check is that if the top control frame was introduced
by an if
, and the if
does not leave the stack as it was upon entering
(i.e. the parameter and result types are not equal) then we can’t generate an
end
. This scenario requires that we generate an else
first, since not
executing any instructions if the if
’s truth-y path is not taken would leave
the stack in a type mismatched state.
call
The final instruction we will examine is call
. Whether a call
is valid or
not depends on the callee function’s type signature, and therefore whether we
can generate a call
or not depends on whether any of the module’s functions
have a type signature compatible with the current operand stack.
To generate a call
instruction, we choose which function we are calling and
update the operand stack accordingly:
Generating Instruction Sequences
Currently, the core WebAssembly specification defines 189 instructions, and I’ve
implemented support for all of them in wasm-smith
. That’s too many to
reproduce here, but you can check them
out
if you’re curious. We’ll now turn our attention to using the pairs of functions
we’ve defined for generating individual instructions to generate the sequences
of instructions that make up a whole function body.
We keep generating new instructions until either when there aren’t any control
frames left (meaning that we’ve returned from the function) or when the
Unstructured
tells us to stop. To generate one instruction, we filter
OPTIONS
down to just those options that are valid within the current context,
ask the Unstructured
to choose one of them, and call the chosen function to
generate the instruction. This new instruction is added to our function body and
we repeat the process.
If the Unstructured
tells us to stop generating instructions before we’ve
exited all control frames, then the instruction sequence most likely is not at a
valid stopping point. Either there are unfinished loop
s, if
s, and blocks
,
or we have the wrong types on the operand stack for a return. The
end_active_control_frames
method resolves these conflicts. When it can, it
leaves the operands on the stack as a control frame’s results or a function’s
return values. When the operand stack doesn’t align with the control frame’s
results, it inserts an unreachable
instruction. This instruction makes the
sequence validate, but will trigger a trap at when executed.
The CodeBuilderAllocations
are created in the arbitrary_code
method, which
generates the module’s code section. The code section contains the bodies of the
functions locally defined in the module. For each function, it calls
arbitrary_func_body
, which creates locals for the function and a CodeBuilder
that reuses the allocations and generates the instruction sequence for the
function.
Finally, here is the full Arbitrary
implementation for Module
, including
calls to each of the methods to generate sections whose description we elided in
this blog post:
That’s everything we need to generate arbitrary, valid Wasm modules!
Using the Test Case Generator
Now we are ready to use our shiny, new Wasm test case generator with a fuzzer!
We’ll use the libfuzzer-sys
crate, and cargo
fuzz
. Our new fuzz target accepts an arbitrary Module
as input,
serializes it into bytes, and passes these bytes into the
wasmparser
crate’s validator. Finally, it asserts that the
module is valid. If this assertion fails, then either the test case generator or
the wasmparser
crate has a bug.
We can run this fuzz target with cargo fuzz
:
Initially, I used this fuzz target to test my generator, making sure that it was
actually generating valid Wasm modules. Then it stopped finding bugs in my
generator, and soon after that it started finding bugs in wasmparser
’s
validation implementation. It has already
found
five
unique
validation
bugs! And it isn’t
like wasmparser
has particularly low hanging fruit — we already fuzz
wasmparser
in a variety of different ways 24/7 on
OSS-Fuzz.
Conclusion
Writing a test case generator for a programming language isn’t magic. It isn’t very difficult once you know the pattern to follow. And once you have the test case generator written, it can poke deep into your compiler, past the parser, helping you find a bunch of hidden bugs!
wasm-smith
is already quite promising and I intend to keep developing and
maintaining it. Next I want to define a bunch of fuzz targets that use
wasm-smith
for Wasmtime and its ecosystem of crates and tools. I also want
to add support for swarm
testing, toggling various
Wasm proposals (such as multi-value and reference types) on and off, and add a
mode that guarantees termination of its generated programs.
*Update: support for swarm testing and ensuring termination now exists. Supporting new Wasm proposals is ongoing, but you can already toggle some of them on or off.
Thanks for reading!
0 It should be noted that this technique, while it enables using libFuzzer and AFL with our generator, does not require their use. We can still use a purely random approach by filling a buffer with a random number generator and then using that buffer as our input sequence of predetermined choices. This would, for example, let us fuzz with our test case generator on platforms that aren’t supported by AFL or libFuzzer. ↩