There is room for imperfect fault isolation

Date: 2022-12-06

In a discussion about the merits of non-volatile memory from a software perspective, Bryan Cantrill made a comment which I think is worth digging into:

In many ways, the gnarliest bug I've ever been involved in debugging was a kernel data corruption bug that managed to leap the fireline into ZFS. We had a couple of instances where that wild kernel data corruption had corrupted a buffer that was on its way out to disk... In software, we don't actually keep auxiliary data structures to allow us to repair our state in-memory, but we're gonna need to do that in a world that's all non-volatile.

Whether an OS kernel's interface to persistent storage is a conventional one, or something that looks like RAM, it still has privileged access to all the same functionality and is equally susceptible to memory corruption. Yet the argument made here is that something about the nature of the NVRAM interface makes it more likely for memory corruption within the kernel to escalate into a more severe problem.

I think this is both true — and really uncomfortable to contemplate. The data corruption described here was probably a result of the dreaded Undefined Behavior. Conventionally, the way we teach and reason about Undefined Behavior avoids thinking too hard about exactly what the consequences will be. "The compiler doesn't make any guarantees about what will happen, so you have to assume that anything could happen. You can't rely on anything useful happening, so it is best to just never perform undefined behavior, and you'll never have to think about it." In this view, the way to make software more robust is to use tools, practices, and languages that reduce the risk of a program performing undefined behavior by mistake.

But this doesn't explain the issue at hand: the reason we might speculate that applications backed by NVRAM for persistent storage might be less robust to memory corruption is that a wild pointer could affect persistent data by pointing anywhere within the region of the address space mapped as storage. With conventional storage, the corruption would need to hit a smaller target: a (often short-lived) buffer representing data that is about to be written to persistent storage, or possibly other memory locations representing the size of the buffer or the position on disk to write to. You could do this same risk evaluation in user space for programs in unsafe languages that use memory-mapped file I/O, especially because it is common practice to read these types of files by directly type punning with C structs, making it easier to not be careful about validating everything.

This argument exposes the nuance within undefined behavior: the standard says that anything could happen upon access to an invalid pointer, but in practice some anythings are much worse than others. There is a spectrum of robustness, even when you can no longer rely on your abstractions to make any hard-and-fast guarantees.

As someone who falls more on the theoretical side, I don't like the prospect of having to make these considerations when designing software. When I see an interface that has some kind of undefined condition, I would always hope that we can use better tools, practices, and languages to remove the possibility of getting into this condition in the first place. Alternatively, we could wrap the whole thing in a sandbox at runtime, to at least restrict the scope of what problems the undefined behavior could cause.

What makes me uncomfortable is the idea of having to arbitrate between anythings. I don't want to have to think about what my software does in situations where my normal mental model tells me it's irreparably broken. It doesn't help that a lot of tooling is also built with the assumption that we don't care about handling these cases.

But just because it makes me uncomfortable doesn't make it wrong. A sufficiently complex program, no matter how carefully written, will inevitably reach cases where it fails to uphold its own expectations. You might be able to prevent 99% of undefined behavior with static or language-level solutions, but the burden of doing this rapidly approaches infinity as you try to move that percentage closer to 100. Therefore, in domains where robustness is absolutely critical, we need to acknowledge as a community that "abstinence-only undefined behavior education" is not enough.

(I want to emphasize that when I say "undefined behavior" here, I don't only mean at the language level. It's also possible for libraries and applications to get into situations where "all bets are off" because they can no longer trust their own invariants. We can think of undefined behavior in C and friends to be a situation where the program does something that prevents the abstraction below it — the language — from maintaining its invariants.)

Case studies

A few more interesting examples, which might not all be totally related, but inform my thinking on this:

What does imperfect fault isolation in software actually look like?

I don't have a whole lot of useful thoughts about this. It's not what I originally planned this post to be about. But I will say this much: if a program's state can be divided into different subsystems, it can get fault isolation between these subsystems insofar as it avoids having code in one subsystem depend on invariants of data belonging to another. I have a potential hot take, which is that advice like "parse, don't validate" and "use the type system to enforce invariants" is advising the opposite: introduce more coupling between the invariants or correctness of different parts of the program. This kind of design even gets labeled as "elegant".