Porting libyaml to Safe Rust: Some Thoughts
The following is a loose collection of thoughts that came up while porting unsafe-libyaml to safe, idiomatic Rust, which may be useful to someone else who is considering "rewriting it in Rust", as the ancient journeyman prayer goes.
Some context: YAML is a commonly used markup language, similar to JSON, and the primary way that Rust programmers interact with it is through serde_yaml, a serializer/deserializer implementation targeting YAML through the very popular Serde framework, which is backed by an almost verbatim machine translation of the C libyaml library into unsafe Rust, called unsafe-libyaml. The translation is made with c2rust, with only minor fixes applied manually.
In my case, I was in a situation where I needed to interact with YAML in a way where serde_yaml came up short. Specifically I needed input location markers outside of error messages to support diagnostics and debugging in the context of a game, where the AI behavior trees are defined in YAML files.
But the unsafe libyaml API is actually pretty unwieldy to use from Rust, even when translated to Rust code, because it behaves exactly like the C code, so it is no different than calling out to an actual C library with FFI. So I - perhaps foolishly, in a flash of manic youthful optimism - decided to embark on the journey of porting unsafe-libyaml to safe Rust, function by function, line by line. It took a little over a week.
The result is available on GitHub.
Update 2024-02-09: libyaml-safer has now been published to crates.io as well.
What are we dealing with?
libyaml consists of two main components:
- YAML parser, which consumes a stream of bytes and produces "events" representing the structure of the document.
- YAML emitter, which consumes a stream of events and produces a stream of bytes.
The byte streams, both input and output, can be in different encodings, one of UTF-8, UTF-16 little-endian, or UTF-16 big-endian. Most, but not all, Unicode code points are supported - notably, control characters are not supported.
Events in libyaml are a linear representation of the logical structure of a document, so higher level than parser tokens, but lower level than mappings, sequences, and scalar values. Events are things like "begin mapping", "begin mapping key", "scalar", etc.
Libyaml also contains a high-level abstraction, yaml_document_t
, which can
represent any YAML document. The parser can load a stream of events into a
document, and the emitter can produce a stream of events from a document.
Eldritch Horrors, or just C
libyaml is squarely a C library, and reading it from the point of view of a Rust programmer, there is a beautiful insanity to it all. It hails from a simpler time, where concepts like ownership, mutability, and the fabric of reality were up for debate.
It isn't "wrong", just different. Unnecessarily difficult in many ways, but in other ways refreshingly straightforward, so long as you can keep the cosmic horror only just barely out of sight.
String Theory
YAML is a human-readable format, and any parser for it must deal with a lot of strings. Thankfully, libyaml does the sane thing and defines its own string type - which almost every single C library in existence does, and even many larger C++ libraries - and sometimes they don't even contain very many bugs. I swear the history of the world would have looked very, very different if our venerable forebears had extended their otherwise infinite benevolence to the inclusion of a sane string type with the standard C library, but again I digress.
This is the string abstraction in libyaml:
struct yaml_string_t {
char* start;
char* end;
char* pointer; // <-- ???
};
The start
and end
fields seem straightforward: A pointer to the beginning of
the string and a pointer to one-past-the-end of the string. So far so good. But
what about pointer
? Some places in the code it is used as the actual end of
the string (so end
is actually the end of the allocated capacity of the
string), and in others it is used as the actual start of the string, so start
is the beginning of the underlying allocation, and pointer
is used to iterate
over bytes or substrings of the string.
What's going on here?
It turns out that this data structure serves multiple different purposes
simultaneously: It is both &str
, String
, and std::slice::Iter
, all in one.
Sometimes it is even the std::str::Chars
iterator, which is used to iterate
over full UTF-8 codepoints within the string. The only way to determine which
abstraction it is currently emulating is to look at the context.
Thankfully - libyaml being a well-structured C project - there are plenty of
context clues. When the STRING_DEL()
macro is being used, the caller assumes
ownership. When the STRING_ASSIGN()
macro is being used, the string is being
borrowed and/or used as an iterator.
The interesting takeaway is this: For all of the different roles that
yaml_string_t
takes on, it turns out there is an abstraction in the Rust
standard library that matches each use case precisely. At all points in the
libyaml source code where strings are used, the use case maps directly to one of
the several many ways of dealing with strings in Rust. Phew!
Joyful aside: There is a similar type in libyaml, called yaml_buffer_t
. It
is slightly more complicated than strings, because it acts essentially as a ring
buffer, and it turned out to be almost very close to a VecDeque
from the
standard library. Hooray!
Self-referential types - Oh my!
YAML syntax can represent the same thing in multiple ways. For example, sequences and mappings can be represented in both "flow style" and "block style", and strings can be single-quoted, double-quoted, unquoted, block-quoted, etc.
flow_style_mapping: { a: "Hello", b: "World" },
block_style_mapping:
a: Hello
b: World
flow_style_sequence: [1, 2, 3]
block_style_sequence:
- 1
- 2
- 3
The libyaml emitter wants to choose both nice and correct formatting for the output. Not just for the aesthetic, but also because some things actually require specific styles - for example, escape sequences are only supported in double-quoted strings, and some Unicode characters need to be represented as escape sequences.
The emitter works by pushing "events" to a queue. Events are things like "begin mapping", "begin mapping key", "begin mapping value", "end mapping", and so on, and the emitter writes to the output when it has enough information to determine the appropriate style.
This is implemented in libyaml as an "analysis" step that happens for each event before the corresponding YAML is written to the output.
1700 static int
1701 yaml_emitter_analyze_event(yaml_emitter_t *emitter,
1702 yaml_event_t *event)
1703 {
1704 emitter->anchor_data.anchor = NULL;
1705 emitter->anchor_data.anchor_length = 0;
1706 emitter->tag_data.handle = NULL;
1707 emitter->tag_data.handle_length = 0;
1708 emitter->tag_data.suffix = NULL;
1709 emitter->tag_data.suffix_length = 0;
1710 emitter->scalar_data.value = NULL;
1711 emitter->scalar_data.length = 0;
1712
1713 switch (event->type)
1714 {
1715 // ...
1716 }
1717 // ...
1718 }
The analysis step populates the emitter->anchor_data
, emitter->tag_data
, and
emitter->scalar_data
fields depending on the event type. Those fields that can
be NULL
are C strings, and they refer to the strings contained within the
event
. But the event
comes from a queue owned by yaml_emitter_t
. This is
great design in the C library, because it makes the analysis step extremely
cheap - no strings are duplicated, only "borrowed".
So how do we express the analysis fields in safe Rust? Blindly porting the C
approach to Rust would result in yaml_emitter_t
becoming self-referential,
since the analysis pointers would either point into the event queue or into an
event that was just popped from the queue, making it impossible to represent
that lifetime on yaml_emitter_t
itself.
Key Insight: Every time yaml_emitter_analyze_event()
is called, it resets
any previous analysis results. It is called exactly once per event. Aha! What if
the results of the analysis was a return value from
yaml_emitter_analyze_event()
, instead of modifying fields on the emitter?
#[derive(Default)]
struct Analysis<'a> {
pub anchor: Option<AnchorAnalysis<'a>>,
pub tag: Option<TagAnalysis<'a>>,
pub scalar: Option<ScalarAnalysis<'a>>,
}
struct AnchorAnalysis<'a> {
pub anchor: &'a str,
pub alias: bool,
}
struct TagAnalysis<'a> {
pub handle: &'a str,
pub suffix: &'a str,
}
struct ScalarAnalysis<'a> {
/// The scalar value.
pub value: &'a str,
/// Does the scalar contain line breaks?
pub multiline: bool,
/// Can the scalar be expessed in the flow plain style?
pub flow_plain_allowed: bool,
/// Can the scalar be expressed in the block plain style?
pub block_plain_allowed: bool,
/// Can the scalar be expressed in the single quoted style?
pub single_quoted_allowed: bool,
/// Can the scalar be expressed in the literal or folded styles?
pub block_allowed: bool,
/// The output style.
pub style: ScalarStyle,
}
fn yaml_emitter_analyze_event<'a>(
emitter: &yaml_emitter_t,
event: &'a yaml_event_t,
) -> Analysis<'a> {
match event.type_ {
// ...
}
}
Wonderful! Now the Analysis<'a>
struct just needs to be passed into any
emitter function that needs the analysis as an extra parameter. Thankfully no
public API function on the emitter relies on analysis having happened
previously, so threading that argument through the emitter code is fairly
trivial, and guaranteed bug-free due to lifetime annotations, in the sense that
it is not possible for any part of the emitter to use the analysis of one event
during the emission of another event.
On the one hand, this change is something that we're forced to do by Rust, because it would not be possible to make the borrow checker happy without it. On the other hand, the Rust compiler actually nudges us to choose a better design than the original.
Why is this beneficial? Why do we accept such blatant tyranny by the cruel and cold-hearted borrow checker?
Well, see, reading the original source code for libyaml was hard. Upon seeing
the anchor_data
, tag_data
, and scalar_data
fields of yaml_emitter_t
, the
first thing any reader or maintainer has to do is figure out how those pointers
are used and when they are valid. This required a whole bunch of forensics,
which would have to happen again and again every time someone new looked at the
code, and "someone new" also includes "future you".
The original design from the C code could have been ported verbatim to Rust. But
it would require a lot of unsafe code, which would have made it very obvious
that some cognitive overhead was being incurred. By choosing a design that works
in safe Rust, that cognitive overhead has been moved to a statically verified
invariant that is plainly visible in the source code. The new signature of
yaml_emitter_analyze_event
plainly documents that Analysis<'a>
contains
references that are coming from a yaml_event_t
with lifetime 'a
.
Error handling
One of my goals with libyaml-safer was to provide near perfect error fidelity compared with libyaml.
Mercifully, C does not have exceptions. They are easily among the top 20 of humanity's mistakes, and don't even come for me because I will die on this hill.
Instead, C code usually reports errors by returning an error code, and this is
already a primitive version of what Rust does. The flow around error handling is
surprisingly similar between well-structured C code and idiomatic Rust code.
Rust has the "try" operator ?
to make things easier, and one of the important
affordances of unsafe-libyaml (Rust) over libyaml (C) is that integer return
codes are replaced with a bespoke Success
struct. Replacing this with
Result
turned out to be pretty
trivial.
But wait a minute... We can't just use the ?
straight away. It checks if an
error occurred and returns early from the function if it did, but C does not
have RAII, so resources need to be released as well before returning. This is
the one remaining valid use case of goto
in the C language, and it is often
actually the most sustainable way to deal with errors. More civilized languages
(such as Rust) don't even have this misfeature.
void* stupid_c() {
void* some_memory = malloc(1024);
if (some_function(some_memory) != ERROR) {
// ...
} else {
goto cleanup;
}
if (some_other_function(some_memory) != ERROR) {
return some_memory;
}
cleanup:
// We're returning an error, so we need to free what we allocated.
free(some_memory);
return NULL;
}
Thus, in order to start leveraging the error handling features of Rust, all
things that own memory must first be converted to RAII objects, such as
String
, Vec
, VecDeque
, etc.
After that, there is the problem of actually reporting the error.
Rust Result
s are rich: They can contain an error value of your choice, so you
can provide the best possible message for the users, giving them all the context
they need. C does not have this option built in, and most libraries opt for
simply returning an error code, and then sometimes allowing the programmer to
obtain more information about the error after the fact.
In libyaml, this looks like so:
struct yaml_parser_t {
yaml_error_type_t error;
const char* problem;
yaml_mark_t problem_mark;
const char* context;
yaml_mark_t context_mark;
// ...
}
When an error occurs, the relevant function (such as yaml_parser_parse()
) sets
the error fields (yaml_set_parser_error()
), and returns an integer indicating
that an error occurred. The user is then expected to access these fields to
determine what the error was and where in the input it occurred.
In Rust, we don't want a special "error" state for the parser. Ideally, we want
to be able to return a self-contained error instead, which can then have all the
usual niceties, like an implementation of std::fmt::Display
.
Note also that the reported error is not necessarily a "parse error". Lots of errors can occur that aren't typically considered syntax errors: I/O read errors, invalid Unicode, tokenization errors, etc.
Let's try with some enums:
enum ParserError {
Problem {
problem: &'static str,
problem_mark: Mark,
context: &'static str,
context_mark: Mark,
}
Scanner(ScannerError),
}
enum ScannerError {
Problem {
problem: &'static str,
problem_mark: Mark,
context: &'static str,
context_mark: Mark,
},
Reader(ReaderError),
}
enum ReaderError {
Problem {
problem: &'static str,
offset: usize,
value: i32,
},
Io(std::io::Error),
}
So nice, perfect fidelity and high-quality error reporting. Except... On a
64-bit architecture, the size of the top-level ParserError
enum is 88 bytes,
mostly due to the Mark
type, which indicates a location in the input, and
which looks like this:
struct Mark {
pub line: u64,
pub column: u64,
pub index: u64,
}
24 bytes, and each error typically needs 2 marks.
Why is the size a problem?
- Even the minimal
Result<(), ParserError>
can never be returned in a register. - Every function that encounters a
Result<T, ParserError>
and wants to returnResult<U, ParserError>
needs to re-wrap the return value, no matter if an error occurred or not, leading to a lot of copying of large structs in stack memory. - While LLVM will inline calls to
memcpy()
below 128 bytes, instructions to copy these values are still generated.
The parser makes a huge number of tiny calls, and each of them can fail. Every
time it needs a new token, it might also need a new character, meaning it might
also need another byte from the input buffer, meaning it might also need to
perform a read()
operation from the input, which can fail with
std::io::Error
, or with a Unicode decoding error, or with a tokenization
error, and finally a parser error.
All in all, a very large error type like this is... not great.
There are a couple of options for moving forward:
- Mimic the libyaml style of errors, and set error codes on the parser object itself. This gives a much less ergonomic API, or would introduce a lifetime parameter on all return values that isn't very nice.
- Codegolfing the above enums to reduce the size. Not much can be won without losing fidelity, but it could probably get down to about 64 bytes.
- Wrap the error in a
Box
, reducing the size to only 8 bytes on 64-bit platforms. The heap allocation isn't great, but it would only happen in error scenarios, so it's trading some efficiency in the error path for a faster "happy" path. - Accept that this is our life now and move on.
All compromises, either in usability or performance.
I benchmarked options (3) and (4) against each other in the happy path, and there wasn't much difference. But it still feels wrong, and committing to an "inefficient" public API where the user can see the inner workings of the error enums is a semver hazard, so for now I've chosen an API that supports both implementations, and boxed (3) under the hood.
Zooming out, it's worth noting that
serde_yaml::Error
has actually chosen (3), a boxed error, which is an indication that I am not the
only person in the world in considering this an acceptable tradeoff.
Freebies
Rust, being a younger language, has a much richer standard library. This means that several things that are implemented manually in libyaml can be replaced with standard facilities.
- UTF-8 and UTF-16 encoding/decoding. These aren't particularly complicated, but the Rust standard library is likely to be better optimized (for example, using SIMD when available).
- Rather than user-provided unsafe callbacks, input and output can be handled by
the
std::io::Read
andstd::io::Write
traits. This gives users the ability to plug anything in there, such as tty streams, network streams, etc., with no extra effort. - The parser requires some input buffering to perform correct decoding, but
libyaml is practically forced to do this by itself, even when the input is
already buffered by the user. By using the
std::io::BufRead
trait, an entire intermediate buffer was eliminated from the parser, meaning that if the data is already coming from an in-memory buffer, the unnecessary copy is elided.
Testing
unsafe-libyaml
is automatically tested against the official YAML test
suite. Having this comprehensive test
suite was an absolute necessity when porting to safe Rust, and I would not have
dared to embark on this endeavor without it.
For every change to the code, I was able to run the full test suite and verify
that everything was as it should be. And even though the code was ported with a
true dedication to matching the original semantics, and even the general "shape"
of the API, many things could go wrong - particularly around flow control and
errors. See, the use of goto
s is emulated by c2rust using complicated
loop/match/break constructs, all of which went away completely, replaced by
Rust's "try" operator, ?
. Even trivial changes are an opportunity for bugs to
sneak in, and even moreso when there are hundreds.
Even though it is much easier to write correct code in safe Rust than in C, logic bugs are obviously just as possible, particularly when performing very large scale refactoring.
Performance
In order to compare the performance of unsafe-libyaml and the safe Rust port, I found a relatively large YAML document (~700 KiB) and built a very simple benchmark using Criterion.
In short:
unsafe-libyaml | libyaml-safer | |
---|---|---|
Parse 700 KiB document | 38.258 ms | 37.447 ms |
Emit 700 KiB document | 18.491 ms | 17.549 ms |
Hold for applause.
These benchmarks were run on an AMD Ryzen 9 3950X 16-Core CPU (3.5 GHz) running Windows 11.
Analysis
- Important: While I do consistently get slightly faster times for libyaml-safer, I don't consider this benchmark to be conclusive in any way.
- What these numbers do indicate is that the port to safe, idiomatic Rust does
not produce meaningfully slower code.
- The justification for "unsafe" C and C++ libraries is often that some performance is left on the table with safe Rust, primarily due to bounds checking. On modern hardware this justification has always seemed a bit suspicious to me, and I think this is another piece of (circumstantial) evidence that it often just doesn't matter in practice.
- It's certainly possible to build things where bounds checking makes a difference. This just isn't that.
- If libyaml-safer is consistently faster for all workloads, here are some
possible explanations:
- C libyaml needs to assert on a couple of invariants (such as pointers not
being NULL), and unsafe-libyaml performs these assertions even in release
mode. Some of these invariants are statically checked in safe Rust, where
&T
can never be "NULL". - Due to the use of
std::io::BufRead
in the parser, the parser performs fewer in-memory copies.
- C libyaml needs to assert on a couple of invariants (such as pointers not
being NULL), and unsafe-libyaml performs these assertions even in release
mode. Some of these invariants are statically checked in safe Rust, where
- Even if this benchmark is completely bogus, the performance of libyaml-safer is still well within tolerance for what is acceptable, considering the massive increase in maintainability and safety
- Zero effort was made to optimize things, other than just not choosing
obviously slower safe alternatives.
- One optimization that might make sense is to implement more efficient lookahead in the parser. Currently, there are many places where the scanner and parser perform many redundant bounds checks on the input buffer.
Conclusion
This was a lot of fun.
I think, to me, and to a lot of software engineers who came from a C++ background, using Rust is not just about nebulous promises of safety and correctness.
It's about bringing the joy back into programming.
We all like to design pretty, user-friendly APIs, we all like to produce things that just feel high quality. Many of us don't get to do that, because the realities of our industry often do not permit it. And so, does it spark joy?
Coding in Rust feels like having your cake and eating it too. When people are productive in Rust, it is because it makes many things that used to be hard - things that cause tears, blood, and burnout - are no longer as hard. The result is not just safer and easier, it is also just as performant. So please - join us.
Footnotes
- In this post I have preserved type names and function names from
unsafe-libyaml
, but as part of porting to idiomatic Rust, I also did rename all types and enums following idiomatic Rust conventions.