Demangling C++ Symbols

I wrote a Rust crate to demangle C++ symbols: cpp_demangle. Find it on crates.io and github.

C++ symbols are mangled? Why?

Yes.

Linkers only support C identifiers for symbol names. They don’t have any knowledge of C++’s namespaces, monomorphization of a template function, overloaded functions, etc. That means that the C++ compiler needs to generate C identifier compatible symbols for C++ constructs. This process is called “mangling”, the resulting symbol is a “mangled symbol”, and reconstructing the original C++ name is “demangling”.

Let’s take an overloaded function as a concrete example. Say there are two overloads: one that takes a char* and another that takes an int:

void overloaded(char* string) { /* ... */ }
void overloaded(int n) { /* ... */ }

Although overloaded is a valid C identifier that the linker could understand, we can’t name both versions of the function overloaded because we need to differentiate them from each other for the linker.

We can see mangling in action if we run nm on the object file:

$ nm -j overloaded.o
__Z10overloadedPc
__Z10overloadedi

There is some prefixed gunk and then the types of parameters got mangled themselves and appended to the end of the function name: Pc for a pointer to a char, and i for int.

Here is a more complex example:

namespace foo {
    template <typename T>
    struct Bar {
        void some_method(Bar<T>* one, Bar<T>* two, Bar<T>* three) {
            /* ... */
        }
    };
}

void instantiate() {
    // Instantiate a Foo<int> and trigger a Foo<int>::some_method
    // monomorphization.
    foo::Bar<int> foo_bar_int;
    foo_bar_int.some_method(&foo_bar_int, &foo_bar_int, &foo_bar_int);

    // And do the same for Foo<char*>.
    foo::Bar<char*> foo_bar_char;
    foo_bar_char.some_method(&foo_bar_char, &foo_bar_char, &foo_bar_char);
}

This time the mangled symbols are a bit less human readable, and its a bit less obvious where some of the mangled bits even came from:

$ nm -j complex.o
__Z11instantiatev
__ZN3foo3BarIPcE11some_methodEPS2_S3_S3_
__ZN3foo3BarIiE11some_methodEPS1_S2_S2_

Do compilers have to agree on the mangling format?

If they want code compiled from one compiler to inter-operate with code compiled from another compiler (or even different versions of the same compiler), then yes, they need to agree on names.

These days, almost every C++ compiler uses the Itanium C++ ABI’s name mangling rules. The notable exception is MSVC, which uses a completely different format.

We’ve been looking at the Itanium C++ ABI style mangled names, and that’s what cpp_demangle supports.

Why demangle C++ symbols in Rust?

A few reasons:

I wanted to demangle some C++ symbols for gimli’s addr2line clone. But I also didn’t want to introduce complexity into the build system for some old C code, nor a dependency on a system library outside of the Rust ecosystem, nor spawn a c++filt subprocess.
Tom Tromey, a GNU hacker and buddy of mine, mentioned that historically, the canonical C++ demangler in libiberty (used by c++filt and gdb) has had tons of classic C bugs: use-after-free, out-of-bounds array accesses, etc, and that it falls over immediately when faced with a fuzzer. In fact, there were so many of these issues that gdb went so far as to install a signal handler to catch SIGSEGVs during demangling. It “recovered” from the segfaults by longjmping out of the signal handler and printing a warning message before moving along and pretending that nothing happened. My ears perked up. Those are the exact kinds of things Rust protects us from at compile time! A robust alternative might actually be a boon not just for the Rust community, but everybody who wants to demangle C++ symbols.
Finally, I was looking for something to kill a little bit of free time.

What I didn’t anticipate was how complicated parsing mangled C++ symbols actually is! The second bullet point should have been a hint. I hadn’t looked to deeply into how C++ symbols are mangled, and I expected simple replacement rules. But if that was the case, why would the canonical demangler have had so many bugs?

What makes demangling C++ symbols difficult?

Maybe I’ve oversold how complicated it is a little bit: it isn’t terribly difficult, its just more difficult than I expected it was going to be. That said, here are some things I didn’t expect.

The grammar is pretty huge, and has to allow encoding all of C++’s expressions, e.g. because of decltype. And its not just the grammar that’s huge, the symbols themselves are too. Here is a pretty big mangled C++ symbol from SpiderMonkey, Firefox’s JavaScript engine:

__ZN7mozilla6detail21VariantImplementationIhLm3EJNS_5TupleIJPN2js12NativeObjectEP8JSObjectNS3_19CrossCompartmentKey18DebuggerObjectKindEEEEEE5matchIRZNS8_14applyToWrappedIN12_GLOBAL__N_128NeedsSweepUnbarrieredFunctorEEEDTclfp_scPS7_LDnEEET_E14WrappedMatcherNS_7VariantIJS7_P8JSStringNS2_IJS5_P8JSScriptEEESA_EEEEEDTcldtfp_5matchcldtfp0_2asISA_EEEEOSI_RT0_

That’s 355 bytes!

And here is its demangled form – even bigger, with a whopping 909 bytes!

decltype(fp.match(fp0.as<mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >())) mozilla::detail::VariantImplementation<unsigned char, 3ul, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >::match<decltype(fp(static_cast<JSObject**>(std::nullptr_t))) js::CrossCompartmentKey::applyToWrapped<(anonymous namespace)::NeedsSweepUnbarrieredFunctor>((anonymous namespace)::NeedsSweepUnbarrieredFunctor)::WrappedMatcher&, mozilla::Variant<JSObject*, JSString*, mozilla::Tuple<js::NativeObject*, JSScript*>, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> > >((anonymous namespace)::NeedsSweepUnbarrieredFunctor&&, mozilla::Variant<JSObject*, JSString*, mozilla::Tuple<js::NativeObject*, JSScript*>, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >&)

The mangled symbol is smaller than the demangled form because the name mangling algorithm also specifies a symbol compression algorithm:

To minimize the length of external names, we use two mechanisms, a substitution encoding to eliminate repetition of name components, and abbreviations for certain common names. Each non-terminal in the grammar above for which <substitution> appears on the right-hand side is both a source of future substitutions and a candidate for being substituted.

After the first appearance of a type in the mangled symbol, all subsequent references to it must actually be back references pointing to the first appearance. The back references are encoded as S i _ where i is the i + 1^th substitutable AST node in an in-order walk of the AST (and S_ is the 0^th).

We have to be careful not to mess up the order in which we populate the substitutions table. If we got it wrong, then all the back references will point to the wrong component, and the final output of demangling will be garbage. cpp_demangle uses a handwritten recursive descent parser. This makes it easy to build the substitutions table as we parse the mangled symbol, avoiding a second back reference resolution pass over the AST after parsing. Except there’s a catch: the grammar contains left-recursion, which will cause naive recursive descent parsers to infinitely recurse and blow the stack. Can we transform the grammar to remove the left-recursion? Nope. Although we would parse the exact same set of mangled symbols, the resulting parse tree would be different, invalidating our substitutions table and back references! Instead, we resign ourselves to peeking ahead a character and switching on it.

The compression also means that the demangled symbol can be exponentially larger than the mangled symbol in the worst case! Here is an example that makes it plain:

template <typename T, typename U>
struct Pair {
    T t;
    U u;
};

using Pair2 = Pair<int, int>;
using Pair4 = Pair<Pair2, Pair2>;
using Pair8 = Pair<Pair4, Pair4>;
using Pair16 = Pair<Pair8, Pair8>;
using Pair32 = Pair<Pair16, Pair16>;
using Pair64 = Pair<Pair32, Pair32>;
using Pair128 = Pair<Pair64, Pair64>;
using Pair256 = Pair<Pair128, Pair128>;
using Pair512 = Pair<Pair256, Pair256>;

void lets_get_exponential(Pair512* p) {
    /* ... */
}

The mangled lets_get_exponential symbol is 91 bytes long:

__Z20lets_get_exponentialP4PairIS_IS_IS_IS_IS_IS_IS_IS_IiiES0_ES1_ES2_ES3_ES4_ES5_ES6_ES7_E

The demangled lets_get_exponential symbol is 5902 bytes long; too big to include here, but you can run that symbol through c++filt to verify it yourself.

The final thing to watch out for with compression and back references is malicious input that creates cycles with the references. If we aren’t checking for and defending against cycles, we could hang indefinitely in the best case, or more likely infinitely recurse and blow the stack.

What’s the implementation status?

The cpp_demangle crate can parse every mangled C++ symbol I’ve thrown at it, and it can demangle them too. However, it doesn’t always match libiberty’s C++ demangler’s formatting character-for-character. I’m currently in the process of getting all of libiberty’s C++ demangling tests passing.

Additionally, I’ve been running American Fuzzy Lop (with afl.rs) on cpp_demangle overnight. It found a panic involving unhandled integer overflow, which I fixed. Since then, AFL hasn’t triggered any panics, and its never been able to find a crash (thanks Rust!) so I think cpp_demangle is fairly solid and robust.

If you aren’t too particular about formatting, then cpp_demangle is ready for you right now. If you do care more about formatting, it will be ready for you soon.

What’s next?

Finish getting all the tests in libiberty’s test suite passing, so cpp_demangle is character-for-character identical with libiberty’s C++ demangler.
Add a C API, so that non-Rust projects can use cpp_demangle.
Add benchmarks and compare performance with libiberty’s C++ demangler. Make sure cpp_demangle is faster ;-)
A couple other things listed in the github issues.

That’s it! See you later, and happy demangling!

Thanks to Tom Tromey for sharing GNU war stories with me, and providing feedback on an early draft of this blog post.