Introducing Stringleton
Stringleton is a very fast string interning library.
String interning: The technique of representing all strings which are equal by a pointer or ID that is unique to the contents of that strings, such that O(n) string equality check becomes a O(1) pointer equality check.
The defining characteristic of Stringleton is the sym!(...)
macro. It allows
the creation of interned strings (symbols) from literals in the code with
practically zero overhead.
The current state of things
Normally, creating an interned string from a literal in code involves the following steps:
- Acquire a global lock on some symbol registry.
- Determine if the string is already present in the registry.
- If so, return the existing ID or pointer.
- Otherwise, allocate the backing storage for the symbol and insert it in the registry.
All of this means that creating symbols in code is something you don't want to do in inner loops, or you may want to avoid it if you are scared of "death by a thousand cuts". In particular, the need to acquire a global lock or other synchronization primitives every time is quite the impediment.
How Stringleton does it
Instead, Stringleton uses a novel approach with some cooperation from the linker:
- Every call site of
sym!(...)
is statically registered at link time, usinglinkme::distributed_slice
. This creates a small "symbol table" in the.bss
section of the resulting binary. - When the process starts, all entries in the symbol table are "reconciled",
effectively performing steps 2 through 4 of the above (using the
ctor
crate). - When actually reaching the
sym!(...)
call site, it compiles into a single memory load.
The real kicker here is that the memory load at the call site does not even
need to be atomic. The reason this is safe and sound is that static initializer
functions (ctor
) are guaranteed to execute before main()
, and therefore
before any threads can possibly have started. In other words, by the time
main()
starts executing, the symbol table is already fully initialized, and it
is never modified again in the lifetime of the process.
The caveat is, of course, that using sym!(...)
in another static initializer
function is not safe.
Generated code
Consider this function:
fn get_symbol() -> Symbol {
sym!("Hello, World!")
}
This compiles into a single load instruction. Using cargo disasm
on x86-64
(Linux):
get_symbol:
8bf0 mov rax, qword ptr [rip + 0x52471]
8bf7 ret
For the uninitiated, that's a plain non-atomic memory load. Most notably, there are no function calls or locks acquired.
Life before main()
, a diatribe
The general assumption in Rust is that nothing happens before main()
. This
isn't actually true, and it's worth considering what the nature is of that
assumption.
Static initializer functions (or static constructors) fall in a weird kind of
grey area when it comes to safety. The big question is: Is safe code in
main()
allowed to rely on the invariant that static initializers have run
before main()
?
The answer currently seems to be a resounding maybe. It would be a useful rule,
as this crate demonstrates, but at the same time, static constructors is a
linker feature that may not be guaranteed to exist everywhere. In C, they
require nonstandard attributes. In C++ they are guaranteed to execute before
main()
, but esoteric configurations may forbid it.
Multiple things in the standard library do not work inside static
constructors, but some things do: alloc
and atomic synchronization primitives
like OnceLock
. It is unclear whether the things that don't work are guaranteed
to always fail in safe ways.
All #[ctor]
static initializer functions should be fundamentally considered
unsafe
, but not in the normal sense, because it isn't about the function's
preconditions (the function itself may be perfectly safe), but rather about
whether the function relies on the implicit precondition that static
initializers have run. In other words, the semantics are inverted: The function
may be unsafe to call in a static initializer, even though it only invokes safe
APIs, because those safe APIs may rely on other static initializers having
finished.
In effect, the "unsafe" part of a static initializer function is the fact that
it is being called before main()
, and that any function may rely on static
initializers having actually completed.
Before Edition 2024, Rust had no way of expressing this inversion, but now we
have "unsafe attributes". An example of this is no_mangle
, which is now
spelled #[unsafe(no_mangle)]
. The reason it matters for no_mangle
is that
defining a global symbol with a particular name may shadow other symbols with
the same name (on some platforms). It can effectively override some arbitrary
function, given the right linker flags, and that is deeply unsafe for obvious
reasons. On Linux and macOS, there are even official ways to do this: The
LD_PRELOAD
environment variable explicitly allows overriding symbols from an
executable with symbols from a dynamic library, which is another example of a
thing that "happens before main()
".
The ctor
crate does not yet require #[unsafe(ctor)]
, but it hopefully
will.