Speed Without Wizardry
Vyacheslav Egorov, who goes by mraleph
on the Web, wrote a response
to my article “Oxidizing Source Maps with Rust and WebAssembly”
titled “Maybe you don’t need Rust and WASM to speed up your JS”.
The “Oxidizing” article recounts my experience integrating Rust
(compiled to WebAssembly) into the source-map
JavaScript
library. Although the JavaScript implementation was originally
authored in idiomatic JavaScript style, as we profiled and implemented speed
improvements, the code became hard to read and maintain. With Rust and its
zero-cost abstractions, we found that there was no trade-off between performance
and clean code.
mraleph
is an established expert on JavaScript performance.
He specializes in Just-In-Time (JIT) compilers, and has contributed to Google’s
V8 JavaScript engine. Guided by profiling the JIT’s internals and emitted code,
he ultimately brought the JavaScript implementation’s performance on par with
the Rust and WebAssembly implementation:
Here is where we started:
$ d8 bench-shell-bindings.js ... [Stats samples: 5, total: 24050 ms, mean: 4810 ms, stddev: 155.91063145276527 ms] $ sm bench-shell-bindings.js ... [Stats samples: 7, total: 22925 ms, mean: 3275 ms, stddev: 269.5999093306804 ms]
and this is where we are finishing
$ d8 bench-shell-bindings.js ... [Stats samples: 22, total: 25158 ms, mean: 1143.5454545454545 ms, stddev: 16.59358125226469 ms] $ sm bench-shell-bindings.js ... [Stats samples: 31, total: 25247 ms, mean: 814.4193548387096 ms, stddev: 5.591064299397745 ms]
This is a factor of 4 improvement!
The optimizations he implemented roughly fall into three buckets:
- Algorithmic improvements.
- Avoiding allocations to reduce garbage collections.
- Staying on the JIT’s happy paths to gain compiler optimizations, and subsequently avoiding falling off performance cliffs.
His article is well-measured and a great read. I particularly like it for three reasons:
- It is educational to peek into his profiling process, and he shares JavaScript performance techniques that are just plain fun to read about.
- The opportunities for algorithmic improvements that he noticed helped speed
up the Rust and WebAssembly implementation another 3x over what we had
previously seen. Thanks to his suggestions, the current version of the
source-map
library is now 5.3x faster than the original JavaScript in Chrome, 10.8x faster in Firefox, and 9.4x faster in Safari. - He perfectly demonstrates one of the points my “Oxidizing” article was making: with Rust and WebAssembly we have reliable performance without the wizard-level shenanigans that are required to get the same performance in JavaScript.
Before continuing, I suggest that you go back and read the two articles if you haven’t already.
Staying on the JIT’s Happy Paths
First, mraleph
noticed that some sorting comparator functions had an arity of
three, the last parameter being optional, but were only ever actually invoked
with two arguments in hot code paths. In V8, this arity mismatch results in an
extra trampoline function in the emitted machine code, and fixing the arity
improved performance:
Just by fixing the arity mismatch we improved benchmark mean on V8 by 14% from 4774 ms to 4123 ms. If we profile the benchmark again we will discover that
ArgumentsAdaptorTrampoline
has completely disappeared from it. Why was it there in the first place?It turns out that
ArgumentsAdaptorTrampoline
is V8’s mechanism for coping with JavaScript’s variadic calling convention: you can call function that has 3 parameters with 2 arguments - in which case the third parameter will be filled withundefined
. V8 does this by creating a new frame on the stack, copying arguments down and then invoking the target function.
Impressive results for such a tiny change! But that’s also part of the problem:
the reasons for JavaScript’s good or bad performance don’t readily reveal
themselves to readers. Worse yet, what is fine in one engine’s JIT might be a
performance cliff in another engine’s JIT. This is the case with this arity
mismatch, as mraleph
notes:
For what it is worth argument adaptation overhead seems to be highly V8 specific. When I benchmark my change against SpiderMonkey, I don’t see any significant performance improvement from matching the arity.
Next, mraleph
noticed that the generic sorting routine was not being
monomorphized, and the comparator functions passed to it were not being inlined.
To mold the JavaScript code into something that would match the JIT’s heuristics
for monomorphization and inlining, he stringified the functions to get their
source text and then re-evaluated that source text with new
Function(...)
. This causes the JavaScript engine to create distinct functions
with their own type inference records. Therefore, from the JIT’s point of view,
the types passed to the new functions are not intermingled with those passed to
the originals, and the JIT will monomorphize them in the machine code it emits.
Stringifying a function to get its source text and re-evaluating it is a very clever trick, but it is not idiomatic JavaScript style. This introduces more convolutions to the codebase for the sake of performance, and maintainability suffers. Furthermore, although the code might match this JIT’s heuristics for inlining and monomorphization today, those heuristics may change in the future, and other JITs may use different heuristics altogether.
With Rust and WebAssembly, we have reliable speed without clever tricks.
WebAssembly is designed to perform well without relying on heuristic-based
optimizations, avoiding the performance cliffs that come if code doesn’t meet
those heuristics. It is expected that the compiler emitting the WebAssembly (in
this case rustc
and LLVM) already has sophisticated optimization
infrastructure, that the engine is receiving WebAssembly code that has already
had optimization passes applied, and that the WebAssembly is close to its final
form.
Rust lets us talk about monomorphization and inlining directly, without
circuitous incantations. We can annotate functions with #[inline]
,
#[inline(always)]
, and #[inline(never)]
attributes to explicitly communicate
our inlining suggestions to the compiler. With generic functions, we can choose
between monomorphization via type parameters and dynamic virtual calls via trait
objects.
Avoiding Allocation and Reducing Garbage Collection
The original JavaScript implementation parsed each mapping into a JavaScript
Object
, which must be allocated by the garbage collector. To reduce GC
pressure during parsing, mraleph
modified the parser so that, rather than
allocating Objects
, it uses a linear typed array buffer and a pointer into
that buffer as a bump allocator. It writes parsed mappings into the typed array,
doubling the typed array’s size whenever it fills up.
Effectively, this technique is like writing C code in JavaScript, but without
even the abstractions that C provides. We can’t define any struct
s because
that implies a GC allocation in JavaScript. In some cases, JITs can optimize
away such allocations, but (once again) that depends on unreliable heuristics,
and JIT engines vary in their effectiveness at removing the allocations.
Since we can’t define any struct
s to name the records and their members in the
typed array, we must manually write, in mraleph
’s words, “verbose and error
prone code” to read and write memory[pointer + static_offset_of_member]
:
In contrast to JavaScript, safely avoiding garbage collection is Rust’s bread
and butter. Not only can we name a struct
without boxing
it or requiring a GC, we don’t have to sacrifice any of Rust’s abstractions to
do so. In fact, Rust’s ownership, lifetimes, and borrowing let us comfortably
avoid allocations and copies in ways that would cause severe headaches even in C
or C++.
Algorithmic Improvements
The final group of optimizations that mraleph
implemented included two
algorithmic improvements to avoid sorting all mappings at once, and only sort
subsequences of mappings at a time.
First, mraleph
noticed that the encoding for mappings means that they are
already sorted by generated line number as we parse them. It follows that, if we
wish to sort mappings by generated location (i.e. generated line and column,
breaking ties with original location), we only need to sort within each
generated line’s subsequence to sort the entire sequence by generated location.
Second, when sorting by original location, we can bucket mappings by their source file, and then we only need to sort one source file’s bucket at a time. Furthermore, we can lazily sort each source file’s bucket.
These algorithmic improvements aren’t providing fundamentally smaller big-O running times. In the worst case, there is only one generated line and only one original source file, which means that the subsequences we sort are actually the full sequence in both cases. And this isn’t a terribly rare worst case either: if you minify a single JavaScript file, you’ll likely create this scenario. But in the large Scala.js source map that is an input to the benchmark, and I suspect many other source maps found in the wild, both of these optimizations pay off.
I implemented both subsequence sorting optimizations for the current Rust and
WebAssembly version of the source-map
library, and then measured the “setting
a breakpoint for the first time” benchmark once more, using the same methods
and setup as in the “Oxidizing” article. Below are the results;
recall that lower values are better.
Speed improved another ~3x over what we had previously seen. The current
version of the source-map
library is now 5.3x faster than the original
JavaScript in Chrome, 10.8x faster in Firefox, and 9.4x faster in Safari!
(mraleph
did not publish his complete set of changes, so I could not
directly compare his improved JavaScript implementation’s speeds in this
graph.)
Conclusion
Most of the improvements that mraleph
implemented are desirable regardless of
the programming language that is our medium. Excessive allocation rates make any
garbage collector (or malloc
and free
implementation) a bottleneck.
Monomorphization and inlining are crucial to eking out performance in both Rust
and JavaScript. Algorithms transcend programming languages.
But a distinction between JavaScript and Rust+WebAssembly emerges when we consider the effort required to attain inlining and monomorphization, or to avoid allocations. Rust lets us explicitly state our desires to the compiler, we can rely on the optimizations occurring, and Rust’s natural idioms guide us towards fast code, so we don’t have to be performance wizards to get fast code. In JavaScript, on the other hand, we must communicate with oblique incantations to match each JavaScript engine’s JIT’s heuristics.
mraleph
concludes with a bit of sound engineering advice:
Obviously each developer and each team are free to choose between spending
N
rigorous hours profiling and reading and thinking about their JavaScript code, or to spendM
hours rewriting their stuff in a languageX
.
I couldn’t agree more! It is important to remember that engineering choices have trade-offs, and it is equally important to be cognizant of the choices in the first place.
We chose to rewrite a portion of our library in Rust and WebAssembly not just for the speed ups, although those are certainly nice, but also for maintainability and to get rid of the kludges added to gain JIT optimizations. We wanted to return to clean code and clearly expressed intent. It has been a great success.
Rewriting the whole library from scratch would not have been tenable either. We avoided that because Rust’s small runtime and freedom from a garbage collector mean that incremental adoption is both possible and practical. We were able to surgically replace just the hottest code paths with Rust and WebAssembly, leaving the rest of the library’s code in place.
Finally, I’d like to re-broadcast mraleph
’s points about profiling:
Profiling and fine grained performance tracking in various shapes and forms is the best way to stay on top of the performance. It allows you to localize hot-spots in your code and also reveals potential issues in the underlying runtime. For this particular reason don’t shy away from using low-level profiling tools like
perf
- “friendly” tools might not be telling you the whole story because they hide lower level.Different performance problems require different approaches to profiling and visualizing collected profiles. Make sure to familiarize yourself with a wide spectrum of available tools.
This is sage advice, and his article is an outstanding example of letting a profiler lead your investigations.