Demangling C++ Symbols
...in Rust
I wrote a Rust crate to demangle C++ symbols: cpp_demangle
. Find it
on crates.io and github.
C++ symbols are mangled? Why?
Yes.
Linkers only support C identifiers for symbol names. They don’t have any knowledge of C++’s namespaces, monomorphization of a template function, overloaded functions, etc. That means that the C++ compiler needs to generate C identifier compatible symbols for C++ constructs. This process is called “mangling”, the resulting symbol is a “mangled symbol”, and reconstructing the original C++ name is “demangling”.
Let’s take an overloaded function as a concrete example. Say there are two
overloads: one that takes a char*
and another that takes an int
:
Although overloaded
is a valid C identifier that the linker could understand,
we can’t name both versions of the function overloaded
because we need to
differentiate them from each other for the linker.
We can see mangling in action if we run nm
on the object file:
$ nm -j overloaded.o
__Z10overloadedPc
__Z10overloadedi
There is some prefixed gunk and then the types of parameters got mangled
themselves and appended to the end of the function name: Pc
for a pointer to a
char
, and i
for int
.
Here is a more complex example:
This time the mangled symbols are a bit less human readable, and its a bit less obvious where some of the mangled bits even came from:
$ nm -j complex.o
__Z11instantiatev
__ZN3foo3BarIPcE11some_methodEPS2_S3_S3_
__ZN3foo3BarIiE11some_methodEPS1_S2_S2_
Do compilers have to agree on the mangling format?
If they want code compiled from one compiler to inter-operate with code compiled from another compiler (or even different versions of the same compiler), then yes, they need to agree on names.
These days, almost every C++ compiler uses the Itanium C++ ABI’s name mangling rules. The notable exception is MSVC, which uses a completely different format.
We’ve been looking at the Itanium C++ ABI style mangled names, and that’s what
cpp_demangle
supports.
Why demangle C++ symbols in Rust?
A few reasons:
-
I wanted to demangle some C++ symbols for
gimli
’saddr2line
clone. But I also didn’t want to introduce complexity into the build system for some old C code, nor a dependency on a system library outside of the Rust ecosystem, nor spawn ac++filt
subprocess. -
Tom Tromey, a GNU hacker and buddy of mine, mentioned that historically, the canonical C++ demangler in
libiberty
(used byc++filt
andgdb
) has had tons of classic C bugs: use-after-free, out-of-bounds array accesses, etc, and that it falls over immediately when faced with a fuzzer. In fact, there were so many of these issues thatgdb
went so far as to install a signal handler to catchSIGSEGV
s during demangling. It “recovered” from the segfaults bylongjmp
ing out of the signal handler and printing a warning message before moving along and pretending that nothing happened. My ears perked up. Those are the exact kinds of things Rust protects us from at compile time! A robust alternative might actually be a boon not just for the Rust community, but everybody who wants to demangle C++ symbols. -
Finally, I was looking for something to kill a little bit of free time.
What I didn’t anticipate was how complicated parsing mangled C++ symbols actually is! The second bullet point should have been a hint. I hadn’t looked to deeply into how C++ symbols are mangled, and I expected simple replacement rules. But if that was the case, why would the canonical demangler have had so many bugs?
What makes demangling C++ symbols difficult?
Maybe I’ve oversold how complicated it is a little bit: it isn’t terribly difficult, its just more difficult than I expected it was going to be. That said, here are some things I didn’t expect.
The grammar is pretty huge, and has to allow encoding all of C++’s
expressions, e.g. because of decltype
. And its not just the grammar that’s
huge, the symbols themselves are too. Here is a pretty big mangled C++ symbol
from SpiderMonkey, Firefox’s JavaScript engine:
__ZN7mozilla6detail21VariantImplementationIhLm3EJNS_5TupleIJPN2js12NativeObjectEP8JSObjectNS3_19CrossCompartmentKey18DebuggerObjectKindEEEEEE5matchIRZNS8_14applyToWrappedIN12_GLOBAL__N_128NeedsSweepUnbarrieredFunctorEEEDTclfp_scPS7_LDnEEET_E14WrappedMatcherNS_7VariantIJS7_P8JSStringNS2_IJS5_P8JSScriptEEESA_EEEEEDTcldtfp_5matchcldtfp0_2asISA_EEEEOSI_RT0_
That’s 355 bytes!
And here is its demangled form – even bigger, with a whopping 909 bytes!
decltype(fp.match(fp0.as<mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >())) mozilla::detail::VariantImplementation<unsigned char, 3ul, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >::match<decltype(fp(static_cast<JSObject**>(std::nullptr_t))) js::CrossCompartmentKey::applyToWrapped<(anonymous namespace)::NeedsSweepUnbarrieredFunctor>((anonymous namespace)::NeedsSweepUnbarrieredFunctor)::WrappedMatcher&, mozilla::Variant<JSObject*, JSString*, mozilla::Tuple<js::NativeObject*, JSScript*>, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> > >((anonymous namespace)::NeedsSweepUnbarrieredFunctor&&, mozilla::Variant<JSObject*, JSString*, mozilla::Tuple<js::NativeObject*, JSScript*>, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >&)
The mangled symbol is smaller than the demangled form because the name mangling algorithm also specifies a symbol compression algorithm:
To minimize the length of external names, we use two mechanisms, a substitution encoding to eliminate repetition of name components, and abbreviations for certain common names. Each non-terminal in the grammar above for which
<substitution>
appears on the right-hand side is both a source of future substitutions and a candidate for being substituted.
After the first appearance of a type in the mangled symbol, all subsequent
references to it must actually be back references pointing to the first
appearance. The back references are encoded as S i _
where
i is the i + 1th substitutable AST node in an in-order walk of
the AST (and S_
is the 0th).
We have to be careful not to mess up the order in which we populate the
substitutions table. If we got it wrong, then all the back references will point
to the wrong component, and the final output of demangling will be
garbage. cpp_demangle
uses a handwritten recursive descent parser. This makes
it easy to build the substitutions table as we parse the mangled symbol,
avoiding a second back reference resolution pass over the AST after
parsing. Except there’s a catch: the grammar contains left-recursion, which will
cause naive recursive descent parsers to infinitely recurse and blow the
stack. Can we transform the grammar to remove the left-recursion?
Nope. Although we would parse the exact same set of mangled symbols, the
resulting parse tree would be different, invalidating our substitutions table
and back references! Instead, we resign ourselves to peeking ahead a character
and switching on it.
The compression also means that the demangled symbol can be exponentially larger than the mangled symbol in the worst case! Here is an example that makes it plain:
The mangled lets_get_exponential
symbol is 91 bytes long:
__Z20lets_get_exponentialP4PairIS_IS_IS_IS_IS_IS_IS_IS_IiiES0_ES1_ES2_ES3_ES4_ES5_ES6_ES7_E
The demangled lets_get_exponential
symbol is 5902 bytes long; too big to
include here, but you can run that symbol through c++filt
to verify it
yourself.
The final thing to watch out for with compression and back references is malicious input that creates cycles with the references. If we aren’t checking for and defending against cycles, we could hang indefinitely in the best case, or more likely infinitely recurse and blow the stack.
What’s the implementation status?
The cpp_demangle
crate can parse every mangled C++ symbol I’ve thrown at it,
and it can demangle them too. However, it doesn’t always match libiberty
’s C++
demangler’s formatting character-for-character. I’m currently in the process of
getting all of libiberty
’s C++ demangling tests passing.
Additionally, I’ve been running American Fuzzy Lop (with afl.rs) on
cpp_demangle
overnight. It found a panic involving unhandled integer overflow,
which I fixed. Since then, AFL hasn’t triggered any panics, and its never been
able to find a crash (thanks Rust!) so I think cpp_demangle
is fairly solid
and robust.
If you aren’t too particular about formatting, then cpp_demangle
is ready for
you right now. If you do care more about formatting, it will be ready for you
soon.
What’s next?
- Finish getting all the tests in
libiberty
’s test suite passing, socpp_demangle
is character-for-character identical withlibiberty
’s C++ demangler. - Add a C API, so that non-Rust projects can use
cpp_demangle
. - Add benchmarks and compare performance with
libiberty
’s C++ demangler. Make surecpp_demangle
is faster ;-) - A couple other things listed in the github issues.
That’s it! See you later, and happy demangling!
Thanks to Tom Tromey for sharing GNU war stories with me, and providing feedback on an early draft of this blog post.