A Structure-Aware Fuzzing Experiment

Structure-aware fuzzing can better exercise the system under test (SUT) by crafting inputs in the format expected by the SUT, rather than throwing pseudorandom bytes against it. That is, it avoids “shallow” inputs that the SUT will reject early (for example, syntactically invalid source text when fuzzing a programming language’s compiler) and only produces inputs that go “deep” into the SUT (e.g. programs that type-check and exercise the mid-end optimizer and backend code generator). The Rust fuzzing ecosystem is largely built around cargo-fuzz and the libfuzzer-sys crate, which provides two methods for structure-aware fuzzing:

Generating structured inputs from scratch with the arbitrary crate
Mutating existing inputs from the fuzzer’s corpus in a structure-aware manner, thereby producing new structured inputs, via the fuzz_mutator! hook

While the two methods are not technically mutually exclusive, combining the two can be difficult and engineering resources are finite. So:

If we are only implementing one approach, is generation or mutation better?

To help answer this question, I implemented structure-aware generation and mutation of guaranteed-valid WebAssembly (Wasm) instruction sequences. This task is small enough to be easily understandable but large enough and real enough to (hopefully) be representative and applicable to other domains, or, at the very least, interesting.¹ To evaluate their effectiveness, I used Wasmtime as the SUT, libfuzzer-sys as the fuzzing engine driving everything, and then compared code coverage over time when using mutation-based fuzzing versus generation-based fuzzing.

Additionally, there are many ways we can generate pseudorandom WebAssembly instruction sequences. In this experiment, I’ve evaluated three methods:

Unconstrained instruction sequence generation followed by a fixup pass to ensure validity
Generating valid instructions in a forwards, bottom-up manner (from operands to operators)
Generating valid instructions in a backwards, top-down manner (from operators to operands)

In contrast, while there are surely many ways to mutate a given WebAssembly instruction sequence into a new, valid instruction sequence, I’ve only implemented one method: perform an arbitrary instruction insertion, deletion, or replacement, producing a new but probably-invalid instruction sequence, and then run the same fixup pass mentioned previously to ensure validity. This is the direct mutation-based equivalent of the first generation-based method.

Before continuing further, I want to disclose that I am the author of wasm-smith and mutatis, and a maintainer of Wasmtime, arbitrary, libfuzzer-sys, and cargo-fuzz. That is, while I am familiar with Wasm, fuzzing, fuzzing Wasm, and both the arbitrary and mutatis crates, I may also be propagating my own biases into these implementations.

Background

Generation-Based and Mutation-Based Fuzzing

A generation-based fuzzer uses a generator to create a pseudo-random test cases from scratch, feeds these into the system under test, and reports any failures to the user:

fn generation_based_fuzzing<T>(
    // A test-case generator.
    generator: impl Fn() -> T,
    // A function to run the system under test with a
    // generated test case, returning a result that
    // describes whether the run was successful or
    // not.
    run_system_under_test: impl Fn(&T) -> FuzzResult,
) {
    loop {
        // Generate an input.
        let input = generator();

        // Run the input through the system under test.
        let result = run_system_under_test(&input);

        // If the system crashed, panicked, failed an
        // assertion, violated an invariant, or etc...
        // then report that to the user.
        if let Err(failure) = result {
            report_to_user(&input, failure);
        }
    }
}

On the other hand, mutation-based fuzzers are given an initial corpus of inputs and create new inputs by mutating existing corpus members. They run each new input through the SUT, report failures the same as before, and if the new input was “interesting” (for example, exercised new code paths in the SUT that weren’t previously covered in any other input’s execution) then the new input is added into the corpus for use in future test iterations:

fn mutation_based_fuzzing<T>(
    // A corpus of test cases.
    corpus: &mut Corpus<T>,
    // A function to pseudo-randomly mutate an existing
    // input into a new input.
    mutate: impl Fn(&T) -> T,
    // A function to run an input in the system under
    // test, returning a result that describes whether
    // the run was successful or not.
    run_system_under_test: impl Fn(&T) -> FuzzResult,
) {
    loop {
        // Choose an old test case from the corpus.
        let old_input = corpus.choose_one();

        // Pseudo-randomly mutate that old test case,
        // creating a new one.
        let input = mutate(old_input);

        // Run the input through the system under test.
        let result = run_system_under_test(&input);

        // If the system crashed, panicked, failed an
        // assertion, violated an invariant, or etc...
        // then report that to the user.
        if let Err(failure) = result {
            report_to_user(&input, failure);
        }

        // If the input was interesting, for example if
        // it executed previously-unknown code paths,
        // then add it into the corpus for use in a
        // future iteration.
        if result.input_was_interesting() {
            corpus.insert(input);
        }
    }
}

The two approaches are not mutually exclusive and hybrid generation- and mutation-based fuzzers exist.

More resources:

Structure-Aware Fuzzing

Structure-unaware fuzzing will generate pseudorandom byte sequences and pass them directly to the SUT. If the SUT expects some sort of structured input, e.g. the source text for a programming language, it is likely that these byte sequences are invalid and will be rejected early by the SUT’s frontend. For example, when fuzzing a compiler, the input is rejected as syntactically invalid by the parser or rejected as semantically invalid by the type checker. This can be useful when hardening a tokenizer, parser, or type checker, but is less useful when hunting for misoptimization in the mid-end or bad instruction encoding in the backend because the inputs are unlikely to make it that far through the compiler’s pipeline.

Structure-aware fuzzing will produce inputs that match the SUT’s expected input format. Returning to the compiler-fuzzing example, structure-aware fuzzing lets us generate valid programs for the compiler, so we can exercise more of the mid-end and backend, rather than just the frontend.

Structure-aware fuzzing is often generation-based: for example using grammar-based fuzzing to generate pseudorandom strings from a given language grammar or language-specific tools like csmith and wasm-smith that generate C and WebAssembly programs respectively. But structure-aware fuzzing can also be mutation-based: libFuzzer’s custom mutator example implements a structure-aware mutator for zlib-compressed strings, where the raw input is decompressed, the decompressed data is mutated, and then the mutated data is recompressed to provide the new raw input. The mutator is aware of the SUT’s zlib-compressed input structure.

More resources:

The `arbitrary` Crate

The arbitrary crate helps Rust developers write custom structure-aware generators for fuzzing. It provides building blocks and abstractions for translating a raw byte sequence (usually from a fuzzing engine) into a structured type, effectively interpreting the raw bytes as a “DNA string” or set of predetermined choices for its decision tree. The library also provides a derive(Arbitrary) macro to automatically implement its functionality for a given type.

Because arbitrary is effectively implemented by combining decision trees, it is extremely easy to create imbalanced trees and unintentionally bias the distribution of generated test cases.

The `mutatis` Crate

The mutatis crate is, at a high-level, performing the same role for authoring structure-aware mutators that arbitrary plays for generators. That is, it provides Rust developers with abstractions and combinators for creating custom structure-aware mutators. It also provides a derive(Mutate) macro to automatically implement its functionality for a given type.

mutatis is designed to resist bias via a two-phase design: first, it enumerates all of the candidate mutations that could be applied to a test case, and only afterwards chooses a particular random mutation from the candidate set to actually apply.

WebAssembly

WebAssembly is a virtual instruction set designed to be safe, portable, and fast. It is a stack machine where an instruction’s operands are popped off a stack during execution and results pushed. It has sandboxed linear memories, global variables, and local variables (the latter two effectively being two kinds of virtual registers). The following instruction sequence computes a * 3 and stores the result into memory at address p:

;; []
local.get $p
;; [p]
local.get $a
;; [p, a]
i32.const 3
;; [p, a, 3]
i32.mul
;; [p, a*3]
i32.store
;; []

Generator and Mutator Implementation

The range of all three generators and the mutator is the same universe of WebAssembly programs. They are all implemented on top of the same Module and Inst types, and, given enough time, none is capable of producing an instruction sequence that another cannot. This helps ensure that our comparison is apples-to-apples. However, due to their different implementation techniques, they do produce different distributions of WebAssembly programs within that universe, and produce test cases at different speeds from one another, which ultimately affects how efficiently they exercise the SUT.

All of the generators are built on top of the arbitrary crate. The mutator is built on top of the mutatis crate.

The Module type is our structured fuzzing input. It describes a WebAssembly module containing a variable number of linear memories, a variable number and type of globals, and one function with a variable number and type of parameters and results and a variable instruction sequence:

/// A WebAssembly module of the shape:
///
///     (module
///       (memory ...)
///       (memory ...)
///       ...
///
///       (global ...)
///       (global ...)
///       ...
///
///       (func (export "run") (param ...) (result ...)
///         ...
///       )
///     )
pub struct Module {
    num_memories: u32,
    globals: Vec<Global>,
    param_types: Vec<ValType>,
    result_types: Vec<ValType>,
    instructions: Vec<Inst>,
}

The Inst type is an enum of all the WebAssembly instructions the implementations support, which is all of the integer, float, SIMD, memory, local, and global instructions. Control-flow, threading, table, and GC instructions are not supported. Here is a subset of Inst’s definition:

/// A WebAssembly instruction.
pub enum Inst {
    Drop,
    LocalGet(u32),
    GlobalGet(u32),

    // ...

    I32Const(i32),
    I32Add,
    I32Sub,
    I32Mul,

    // ...

    I64Const(i64),
    I64Add,
    I64Sub,
    I64Mul,

    // ...

    F32Const(f32),
    F32Add,
    F32Sub,
    F32Mul,

    // ...

    F64Const(f64),
    F64Add,
    F64Sub,
    F64Mul,

    // ...

    I32WrapI64,
    I64ExtendI32S,
    I64ExtendI32U,

    // ...

    V128Const(i128),
    I8x16Add,
    I8x16Sub,

    // ...

    I32Load(u32),
    I64Load(u32),

    // ...

    I32Store(u32),
    I64Store(u32),

    // ...

    MemorySize(u32),
    MemoryGrow(u32),
}

There is an Inst::operand_types method that returns the types that the instruction pops from the stack, and an Inst::result_type method that returns the type of the value that the instruction pushes onto the stack, if any. Finally, the Module::to_wasm_binary method encodes the module into WebAssembly’s binary format, so it can be fed into Wasmtime. These methods are used, directly or indirectly, in every generator and mutator implementation.

`arb`

The arb generator leverages derive(arbitrary::Arbitrary) on our structured input types to generate a pseudorandom instance of Module, unconstrained by validity. The module’s instruction sequence is almost certainly not valid at this point: it likely is missing operands for instructions, producing more results than the function’s signature describes, producing results of types that don’t match the function signature, accessing globals and locals that don’t exist, etc… Having produced an instance of Module, it next calls the Module::fixup method to mutate the Module so that it is valid.

The fixup method works by abstractly interpreting the instruction sequence to track the types of each value on the stack at every program point. Whenever an instruction’s operand types don’t match the types on top of the stack, it generates dummy values of the correct type. When the instructions produce more values than the function’s signature proscribes, it emits drop instructions.

impl Module {
    pub fn fixup(&mut self, mut make_value: impl FnMut() -> i64) {
        // ...

        // The fixed-up instructions.
        let mut fixed = Vec::with_capacity(
            self.instructions.len(),
        );

        // The types on the stack at any given program
        // point. Similar to the Wasm spec's appendix's
        // validation algorithm.
        let mut stack: Vec<ValType> = Vec::new();

        for inst in mem::take(&mut self.instructions) {
            // Special-case `drop` because it is
            // polymorphic.
            if matches!(inst, Inst::Drop) {
                if stack.is_empty() {
                    fixed.push(
                        ValType::I32.make_const(make_value()),
                    );
                } else {
                    stack.pop();
                }
                fixed.push(inst);
                continue;
            }

            // First clamp entity indices to valid
            // ranges.
            let Some(inst) = self.fixup_inst_immediates(
                &mut make_value,
                has_mutable_global,
                inst,
            ) else {
                continue
            };

            // Then make sure that the stack has
            // operands of the correct types for this
            // instruction.
            self.fixup_stack(
                &mut make_value,
                &mut fixed,
                &mut stack,
                &inst,
            );

            // Finally, apply the effects to the stack.
            let len_operands = inst.operand_types(
                &self.globals,
            ).len();
            stack.truncate(stack.len() - len_operands);
            stack.extend(inst.result_type(
                &self.param_types,
                &self.globals,
            ));

            fixed.push(inst);
        }

        // ...

        self.instructions = fixed;
    }

    fn fixup_stack(
        &mut self,
        mut make_value: impl FnMut() -> i64,
        fixed: &mut Vec<Inst>,
        stack: &mut Vec<ValType>,
        inst: &Inst,
    ) {
        let needed = inst.operand_types(&self.globals);
        let n = needed.len();

        if stack.len() >= n {
            if (0..n).all(|i| {
                stack[stack.len() - n + i] == needed[i]
            }) {
                // All needed operands are on the stack.
                return;
            }
        } else {
            if stack.iter().enumerate().all(|(i, ty)| {
                *ty == needed[i]
            }) {
                // A prefix of needed operands are on the
                // stack; make constants for the tail that
                // are missing.
                for ty in &needed[stack.len()..] {
                    fixed.push(ty.make_const(make_value()));
                    stack.push(*ty);
                }
                return;
            }
        }

        // Otherwise, just make constants for all the
        // needed operands.
        for ty in needed {
            fixed.push(ty.make_const(make_value()));
            stack.push(*ty);
        }
    }

    // ...
}

The fixup method also makes sure that for all instructions that have an immediate referencing some entity, the referenced entity is valid. For example, for a local.get $l instruction, it ensures that local $l actually exists or else rewrites the local to one that does exist.

impl Module {
    // ...

    fn fixup_inst_immediates(
        &mut self,
        mut make_value: impl FnMut() -> i64,
        has_mutable_global: bool,
        mut inst: Inst,
    ) -> Option<Inst> {
        match &mut inst {
            Inst::LocalGet(l) => *l %= self.param_types.len() as u32,

            // ...

            Inst::I32Load(m)
            | Inst::I64Load(m)
            | Inst::F32Load(m)
            | Inst::F64Load(m)
            | Inst::V128Load(m) => {
                if self.num_memories == 0 {
                    return None;
                }
                *m %= self.num_memories;
            }

            // ...

            _ => {}
        }

        Some(inst)
    }
}

After calling fixup, the arb generator invokes Module::to_wasm_binary to get the encoded Wasm program.

`bottom_up`

The bottom_up generator also uses abstract interpretation to track the types of values on the stack. It generates instructions in forwards order, from operands to operators. It begins with an empty stack, filters candidate instructions down to just those that would be valid given the types currently on the stack, randomly chooses one, updates the stack types accordingly, and repeats the process. This is the same approach that wasm-smith uses. After generating instructions this way, it then makes sure that the final types on the stack match the function signature’s results, similar to the end of fixup.

impl Module {
    pub fn bottom_up(u: &mut Unstructured<'_>) -> Result<Self> {
        // ...

        let max_insts = u.int_in_range(1..=MAX_INSTS)?;
        let mut instructions = Vec::new();
        let mut stack: Vec<ValType> = Vec::new();

        for _ in 0..max_insts {
            if stack == result_types && u.ratio(3, 4)? {
                break;
            }

            // Choose a random instruction whose operand
            // types match those currently on the stack.
            let inst = choose_inst_bottom_up(
                u,
                &stack,
                &param_types,
                &globals,
                num_memories,
            )?;

            // Apply this instruction's effects to the
            // stack.
            apply_inst(
                &inst,
                &mut stack,
                &param_types,
                &globals,
            );
            instructions.push(inst);
        }

        // ...

        Ok(Module {
            param_types,
            result_types,
            globals,
            num_memories,
            instructions,
        })
    }
}

fn choose_inst_bottom_up(
    u: &mut Unstructured<'_>,
    stack: &[ValType],
    param_types: &[ValType],
    globals: &[Global],
    num_memories: u32,
) -> Result<Inst> {
    // Build up all the valid candidate instructions.
    let mut candidates: Vec<Inst> = Vec::new();

    // Producers are always okay: [] -> [t]
    candidates.push(Inst::I32Const(0));
    candidates.push(Inst::I64Const(0));
    candidates.push(Inst::F32Const(0.0));
    candidates.push(Inst::F64Const(0.0));
    candidates.push(Inst::V128Const(0));
    if !param_types.is_empty() {
        candidates.push(Inst::LocalGet(0));
    }

    // ...

    let top = stack.last().copied();
    let second = stack.get(stack.len() - 2).copied();

    // Drop needs 1 operand of any type: [t] -> []
    if top.is_some() {
        candidates.push(Inst::Drop);
    }

    // i32 unary: [i32] -> [...]
    if top == Some(I32) {
        candidates.push(Inst::I32Clz);
        candidates.push(Inst::I32Ctz);
        candidates.push(Inst::I32Popcnt);
        // ...
    }

    // i64 unary: [i64] -> [...]
    if top == Some(I64) {
        candidates.push(Inst::I64Clz);
        candidates.push(Inst::I64Ctz);
        candidates.push(Inst::I64Popcnt);
        // ...
    }

    // ...

    // i32 binary: [i32 i32] -> [...]
    if top == Some(I32) && second == Some(I32) {
        candidates.push(Inst::I32Add);
        candidates.push(Inst::I32Sub);
        candidates.push(Inst::I32Mul);
        // ...
    }

    // i64 binary: [i64 i64] -> [...]
    if top == Some(I64) && second == Some(I64) {
        candidates.push(Inst::I64Add);
        candidates.push(Inst::I64Sub);
        candidates.push(Inst::I64Mul);
        // ...
    }

    // ...

    // Choose a random instruction from the
    // candidates.
    let mut inst = *u.choose(&candidates)?;

    // If the instruction has immediates, generate
    // them here, as they were hard-coded during
    // candidate selection.
    match &mut inst {
        Inst::I32Const(v) => *v = u.arbitrary()?,
        Inst::I64Const(v) => *v = u.arbitrary()?,
        // ...
        Inst::GlobalGet(g) => {
            *g = u.int_in_range(0..=(globals.len() as u32 - 1))?;
        }
        // ...
        Inst::I32Load(m)
        | Inst::I64Load(m)
        | Inst::F32Load(m)
        | Inst::F64Load(m)
        | Inst::V128Load(m)
        | Inst::I32Store(m)
        | Inst::I64Store(m)
        | Inst::F32Store(m)
        | Inst::F64Store(m)
        | Inst::V128Store(m)
        | Inst::MemorySize(m)
        | Inst::MemoryGrow(m) => {
            *m = u.int_in_range(0..=(num_memories - 1))?;
        }
        _ => {}
    }

    Ok(inst)
}

After constructing a Module via bottom_up, we don’t need to call fixup because the module is already valid by construction, so all that’s left is invoking Module::to_wasm_binary to get the encoded Wasm program.

`top_down`

The top_down generator is very similar to bottom_up, but instead of generating instructions forwards, from operands to operators, it generates them backwards, from operators to operands. Instead of maintaining a stack of the types of values generated thus far by the instruction sequence prefix, it maintains a stack of the types of values expected by the instruction sequence suffix. This is the approach that rgfuzz by Park, Kim, and Yun takes.²

impl Module {
    pub fn top_down(
        u: &mut Unstructured<'_>,
    ) -> Result<Self> {
        // ...

        let max_insts = u.int_in_range(1..=MAX_INSTS)?;
        let mut instructions = Vec::new();
        let mut needed = result_types.clone();
        for _ in 0..max_insts {
            if needed.is_empty() && u.ratio(3, 4)? {
                break;
            }

            // Choose a random instruction in a
            // top-down manner.
            let inst = choose_inst_top_down(
                u,
                needed.last().copied(),
                &param_types,
                &globals,
                num_memories,
            )?;

            // Pop the result type from `needed`, if
            // any, as it's been satisfied.
            let ty = inst.result_type(
                &param_types,
                &globals,
            );
            if ty == needed.last().copied() {
                needed.pop();
            }

            // Add operand type demands.
            match &inst {
                Inst::Drop => {
                    // `drop` is polymorphic; choose
                    // a random type.
                    needed.push(u.arbitrary()?);
                }
                Inst::GlobalSet(g) => {
                    needed.push(globals[*g as usize].ty);
                }
                _ => {
                    needed.extend_from_slice(
                        inst.operand_types(&globals),
                    );
                }
            }

            instructions.push(inst);
        }

        // Fill remaining needed types with
        // constants.
        for ty in needed.iter().rev() {
            instructions.push(
                ty.make_const(u.arbitrary()?),
            );
        }

        // Instructions were generated backwards, so
        // reverse.
        instructions.reverse();

        Ok(Module {
            param_types,
            result_types,
            globals,
            num_memories,
            instructions: prefix,
        })
    }
}

fn choose_inst_top_down(
    u: &mut Unstructured<'_>,
    target_ty: Option<ValType>,
    param_types: &[ValType],
    globals: &[Global],
    num_memories: u32,
) -> Result<Inst> {
    let mut candidates: Vec<Inst> = Vec::new();
    match target_ty {
        Some(I32) => {
            candidates.push(Inst::I32Const(0));
            candidates.push(Inst::I32Add);
            candidates.push(Inst::I32Sub);
            candidates.push(Inst::I32Mul);
            // ...
        }
        Some(I64) => {
            candidates.push(Inst::I64Const(0));
            candidates.push(Inst::I64Add);
            candidates.push(Inst::I64Sub);
            candidates.push(Inst::I64Mul);
            // ...
        }
        Some(F32) => {
            candidates.push(Inst::F32Const(0.0));
            candidates.push(Inst::F32Add);
            candidates.push(Inst::F32Sub);
            candidates.push(Inst::F32Mul);
            // ...
        }
        // ...
        None => {
            // Nothing needed. `drop`, `global.set`, and
            // stores add demand.
            candidates.push(Inst::Drop);
            if globals.iter().any(|g| g.mutable) {
                candidates.push(Inst::GlobalSet(0));
            }
            if num_memories > 0 {
                candidates.push(Inst::I32Store(0));
                // ...
            }
        }
    }

    let mut inst = *u.choose(&candidates)?;

    // If the instruction has immediates, generate
    // them here, as they were hard-coded during
    // candidate selection. Same as `bottom_up`.
    match &mut inst {
        // ...
    }

    Ok(inst)
}

Similar to bottom_up, after we’ve constructed a Module via top_down, we don’t need to call fixup because the module is already valid by construction. All that’s left is invoking Module::to_wasm_binary to get the encoded Wasm program.

`mutate`

mutate is, as the name implies, a mutator rather than a generator. It is the direct equivalent of the arb generator, but for mutation: it uses derive(mutatis::Mutate) on Module and Inst to automatically generate custom mutators for these types, rather than authoring them by hand. After producing a new Module by mutating an old Module, that new Module probably represents an invalid Wasm program, in the same way that derive(arbitrary::Arbitrary) produces Modules that are probably invalid. And mutate also uses the same approach that arb does to resolve this problem: the fixup method.

But first, a mutator-specific wrinkle is that fuzz_mutator! gives us a mutable byte slice to mutate, not a Module. We address this gap by deriving the serde crate’s Serialize and Deserialize traits on Module and Inst, deserializing a Module from the mutable byte slice, mutating that deserialized Module with mutatis, and then reserializing it back into the mutable byte slice. We use the postcard crate here, but could just as easily use bincode, JSON, or protobuf.

use libfuzzer_sys::{fuzz_mutator, fuzz_target, fuzzer_mutate};

fuzz_mutator!(|
    data: &mut [u8],
    size: usize,
    max_size: usize,
    seed: u32,
| {
    // With probability of about 1/8, use default
    // mutator.
    if seed.count_ones() % 8 == 0 {
        return fuzzer_mutate(data, size, max_size);
    }

    // Try to decode using postcard; fallback to
    // default input on failure.
    let mut module: Module =
        postcard::from_bytes(&data[..size])
            .ok()
            .unwrap_or_default();

    // Mutate with `mutatis`.
    let mut session = mutatis::Session::new()
        .seed(seed.into())
        .shrink(max_size < size);
    if session.mutate(&mut module).is_ok() {
        if let Ok(encoded) = postcard::to_slice(
            &module,
            data,
        ) {
            return encoded.len();
        }
    }

    // Fallback to the default libfuzzer mutator if
    // serialization or mutation fails because, for
    // example, `data` doesn't have enough capacity.
    fuzzer_mutate(data, size, max_size)
});

Finally, the fuzz target itself deserializes the Module from the raw bytes, calls fixup, encodes it to a Wasm binary via Module::to_wasm_binary, and then passes that into Wasmtime.

fuzz_target!(|data: &[u8]| {
    let Ok(mut module) = postcard::from_bytes::<Module>(data) else {
        return;
    };
    module.fixup(|| 0);
    let wasm = module.to_wasm_binary();

    // ...
});

Benchmarking

Methodology

We pair each of our generators and mutator with libfuzzer-sys and feed the resulting test cases into Wasmtime. All fuzzers start with an empty corpus.

The most important metric for a fuzzer is its bug-finding ability, but that can be difficult to measure directly. For example, Wasmtime is actively fuzzed 24/7 with more-complete fuzzers than those implemented here, so, as expected, I have not found any bugs via these benchmarks. Therefore, instead of reporting a found-bugs count, the benchmark harness reports two alternative metrics:

Coverage over time: Coverage is the cumulative code paths exercised by the fuzzer. A fuzzer cannot find bugs in code paths it does not cover. This is the most important metric reported.
Executions over time: An execution is one iteration of the fuzzing loop. This is basically measuring how fast the fuzzer can produce test cases. All else being equal, more executions is better, but all else is rarely equal. It is easy to generate poor test cases very quickly: just return an empty sequence of Wasm instructions every time. Unfortunately, that exclusively leads to useless executions. Therefore, this metric is really only useful when comparing two implementations of the same algorithm, and I’ve omitted its results in the next section.

Additionally, I report results for both 24 hours of fuzzing and 5 minutes of fuzzing. The expected behavior of long-term fuzzing, e.g. 24/7 fuzzing in OSS-Fuzz, can be extrapolated from the 24-hour results. The 5-minute results show the expected behavior of short-term fuzzing, e.g. when using mutatis::check or arbtest.

Discussion of short-term fuzzing is somewhat rare, so I feel its motivation deserves explanation. I find short-term fuzzing useful in the following scenarios, for example:

Running a quick fuzzing session locally, to catch bugs that avoid detection in the traditional unit- and integration-test suites, before opening a pull request.
Running some quick fuzzing in CI before allowing a pull request to merge, for similar reasons.

That is, short-term fuzzing is useful for the same reasons and in the same scenarios as property-based testing.³

As recommended in Evaluating Fuzz Testing by Klees, Ruef, Cooper, Wei, and Hicks and adopted in Fuzz Bench: An Open Fuzzer Benchmarking Platform and Service by Metzman, Szekeres, Simon, Sprabery, and Arya, the benchmark harness tests the statistical significance of its results with a Mann-Whitney U-test. The harness performs 20 trials per fuzzer, the same number of trials as Fuzz Bench.

Results

24 Hours of Fuzzing

arb has 1.00 ± 0.00 times more coverage than bottom_up (p = 0.01)
mutate has 1.01 ± 0.00 times more coverage than arb (p = 0.00)
top_down has 1.00 ± 0.00 times more coverage than arb (p = 0.00)
mutate has 1.02 ± 0.00 times more coverage than bottom_up (p = 0.00)
top_down has 1.01 ± 0.00 times more coverage than bottom_up (p = 0.00)
mutate has 1.01 ± 0.00 times more coverage than top_down (p = 0.00)

5 Minutes of Fuzzing

bottom_up has 1.01 ± 0.01 times more coverage than arb (p = 0.04)
mutate has 1.47 ± 0.02 times more coverage than arb (p = 0.00)
top_down has 1.06 ± 0.02 times more coverage than arb (p = 0.00)
mutate has 1.45 ± 0.01 times more coverage than bottom_up (p = 0.00)
top_down has 1.05 ± 0.02 times more coverage than bottom_up (p = 0.00)
mutate has 1.38 ± 0.02 times more coverage than top_down (p = 0.00)

Conclusion

The mutate fuzzer performs best. It vastly outperforms all the others at 5 minutes of fuzzing (36-49% more coverage), and while the rest narrow that gap after 24 hours of fuzzing, mutate maintains its lead (1-2% more coverage).

The comparison between arb and mutate is as apples-to-apples of a comparison as it gets between idiomatic test-case generation and mutation in Rust: derive(Arbitrary) and derive(Mutate). They use the same fixup method to ensure that the resulting Wasm instructions are valid. The fuzzer built with mutatis and test-case mutation provides better coverage over time than the fuzzer built with arbitrary and test-case generation. When writing structure-aware fuzzers, I used to reach for arbitrary; in the future, I will reach for mutatis instead.

The top_down fuzzer performs second-best, and is best of the generation-based fuzzers. This aligns with results from the rgfuzz paper, which found that top-down Wasm instruction generation resulted in better instruction diversity than bottom-up generation. This result is intuitive, they point out, because Wasm instructions tend to have more operands than results, which means that more candidates are filtered out from consideration when generating instructions in forward order from operands to results (bottom-up) than when generating them in backward order from results to operands (top-down).

Subjectively, none of the approaches feel significantly more-complicated nor easier to implement than the others. All approaches require a stack of types, representing the generated Wasm’s operand stack, at some point in their implementation. Some require it during instruction generation (top_down and bottom_up) while others require it during fixup (mutate and arb). Adding support for new Wasm instructions is roughly the same in all of them: add a new variant to enum Inst and define its operand and result types. top_down and bottom_up additionally require adding a line for the new instruction in their choose_inst_{top_down,bottom_up} functions, but this could be avoided with some targeted macro_rules! sugar.

The fixup method fixes instructions in a forwards order; as future work, it would be interesting to implement a backwards_fixup method that fixes instructions in a backwards order and see if mutate and backwards_fixup outperforms the current mutate and forwards fixup the same way that backwards generation (top_down) outperforms forwards generation (bottom_up).

fixup makes an attempt to reuse stack operands when it can, rather than synthesize dummy constants or drop already-computed values, but the attempt is somewhat half-hearted. Dropping operands introduces dead code, which is not very interesting for exercising deep into the compiler pipeline. Dummy constants are not that interesting either. Therefore, another potential line of follow-up work would be to investigate ways to maximize operand reuse and minimize drops and dummy constants inserted while ensuring validity. That could include storing values to memory or globals instead of droping them when possible. It could even include liberating ourselves from the stack-focused paradigm we’ve had thus far.

WebAssembly is a stack-based language and so it is natural that our approaches have focused on producing stack-y code. But, in practice, optimizing WebAssembly compilers like Wasmtime’s use a static single-assignment intermediate representation, and erase the operand stack early in their compilation pipelines. Therefore, from these compilers’ point of view, the following two WebAssembly snippets are identical:

;; `x = a + (b * c)` in a "stack-y" encoding and
;; without temporary locals.
local.get $a
local.get $b
local.get $c
i32.mul
i32.add
local.set $x

;; `x = a + (b * c)` in a "non-stack-y" encoding
;; that uses temporary locals for every operation.
;;
;; Equivalent of
;;
;;     temp0 = b * c
;;     temp1 = a + temp0
;;     x = temp1
local.get $b
local.get $c
i32.mul
local.set $temp0
local.get $a
local.get $temp0
i32.add
local.set $temp1
local.get $temp1
local.set $x

Producing code that uses many temporaries in this manner might be easier than code that doesn’t, but, more importantly, it may enable better reuse of already-computed subexpressions, emit less dead code, and ultimately produce more interesting data-flow graphs that better exercise the deep innards of the compiler.

A final vein of interesting follow-up work to mine would be comparing arbitrary-based generators and mutatis-based mutators for structured inputs that are not programming languages and when the SUT we are fuzzing is not a compiler. Do we see these same results when, for example, producing PNG images to fuzz an image-transformation library?

Here is the source code for this experiment, including the three generators, one mutator, raw benchmark data, and benchmarking harness. The README includes instructions on running the benchmarks yourself.

WebAssembly’s stack-based instructions encode an expression tree — local.get $a; local.get $b; local.get $c; i32.add; i32.mul is isomorphic to a * (b + c) — so the experiment should be relevant and applicable to any other generator or mutator for a programming language with expressions, even if it might not appear so at first glance. ↩
Ignoring its rule-guided bit, which is orthogonal and could be applied to bottom_up as well. ↩
Structure-aware fuzzing and property-based testing are basically the same: convergent evolution from different communities. ↩