Starexe
📖 Tutorial

Navigating Hyrum's Law: A Case Study on Restartable Sequences and TCMalloc

Last updated: 2026-05-02 16:03:35 Intermediate
Complete guide
Follow along with this comprehensive guide

Overview

Hyrum's Law, a principle in software engineering, states that any observable behavior of a system will eventually be depended upon by somebody. This tutorial examines a real-world example from the Linux kernel community: the tension between restartable sequences (rseq) interface updates and Google's TCMalloc library. The kernel's no-regressions rule forces developers to balance API evolution with backward compatibility, even when the documented API is preserved. By understanding this case, you'll learn how to anticipate and manage Hyrum's Law in your own systems.

Navigating Hyrum's Law: A Case Study on Restartable Sequences and TCMalloc

We'll cover the fundamentals of restartable sequences, how TCMalloc inadvertently violated the API, and the steps taken to resolve the conflict. This guide is designed for kernel developers, systems programmers, and anyone interested in API design and compatibility.

Prerequisites

  • Basic understanding of Linux kernel architecture and system calls
  • Familiarity with C programming and memory allocators (e.g., malloc)
  • Knowledge of thread synchronization concepts (e.g., atomic operations, lock-free programming)
  • Access to a Linux development environment for testing (optional but recommended)

Step-by-Step Instructions

1. Understand Restartable Sequences (rseq)

Restartable sequences are a Linux kernel feature that allows user-space code to define critical sections that can be aborted and restarted if interrupted by context switches or signal handlers. They provide a mechanism for efficient, per-CPU data structures without locks.

  • API basics: The rseq system call registers a per-thread structure that contains a restartable sequence signature, a critical section start, and a post-commit step.
  • Observable behavior: The kernel guarantees that if a thread is preempted within a restartable sequence, it will be re-executed from the start, ensuring atomicity.
  • Documented contract: The official API specifies that the kernel will not modify thread-local data outside the sequence’s restart mechanism.

2. Examine TCMalloc's Use of rseq

Google's TCMalloc (Thread-Caching Malloc) uses restartable sequences to implement fast per-thread memory caching. It relies on the rseq mechanism to safely update thread-local pointers without locks.

  • Implementation detail: TCMalloc's code assumed that the kernel would not alter the thread's rseq structure outside the restartable sequence's execution.
  • Violation: In practice, TCMalloc depended on an undocumented behavior: the kernel’s failure to clear a specific flag after a context switch. This flag was part of the rseq structure but not part of the guaranteed contract.
  • Why it happened: Programmers often rely on observed behavior rather than strictly documented contracts—a manifestation of Hyrum's Law.

3. Identify the Conflict with Kernel 6.19

In kernel version 6.19, developers optimized the rseq implementation to improve performance for legitimate use cases. This change intentionally maintained the documented API but inadvertently broke TCMalloc because the library relied on the flag-clearing behavior.

  • Performance improvement: The patch reduced overhead by skipping a conditional branch when the rseq critical section was not active.
  • Breakage: TCMalloc’s code, which checked that flag after a restart, began to malfunction, causing memory corruption and crashes.
  • Impact: Applications using TCMalloc (e.g., many Google services) experienced regressions.

4. Apply the Kernel's No-Regressions Rule

The Linux kernel community adheres to a strict no-regressions policy: changes must not break existing user-space applications. This rule forced developers to accommodate TCMalloc's behavior, even though it was technically a violation of the API.

  • Option A: Revert the patch – This would restore broken behavior but sacrifice the performance improvement.
  • Option B: Update TCMalloc – The library could be fixed to not depend on the undocumented behavior, but this would take time and requires cooperation with Google.
  • Option C: Add a compatibility flag – The kernel could provide a mechanism to opt-in to the old behavior for known problematic libraries.

5. Implement a Solution

After discussion, the kernel developers opted for a two-phase approach:

  1. Short-term fix: Revert the performance optimization in a separate commit, restoring the old behavior to avoid regressions.
  2. Long-term fix: Work with Google to update TCMalloc to adhere strictly to the documented API. Add a kernel warning when a library uses the now-deprecated behavior.

Code example (conceptual):

// Before (6.19): kernel skips clearing flag
if (current->rseq_state == RSEQ_STATE_INACTIVE)
    return; // early exit, flag remains

// After (compatibility patch): always clear flag for TCMalloc compatibility
if (current->rseq_state == RSEQ_STATE_INACTIVE) {
    current->rseq_flags = 0; // restore old behavior
    return;
}

Common Mistakes

  • Assuming API compliance: Never assume that all consumers of an interface follow documented contracts. Always test against real-world usage.
  • Ignoring Hyrum's Law: Every observable behavior, even accidental, can become a dependency. Plan for this by minimizing undocumented side effects.
  • Neglecting backward compatibility: When evolving APIs, provide transition periods and deprecation warnings. The kernel's no-regressions rule is a good model.
  • Over-optimizing prematurely: Performance improvements that change observable behavior—even subtly—can lead to regressions. Validate against a wide range of user-space libraries.

Summary

Hyrum's Law ensures that any observable behavior becomes a de facto contract. The restartable sequences and TCMalloc case illustrates how even a carefully documented API can be broken by changes that preserve the written contract but alter unwritten expectations. By following the steps outlined—understanding the API, identifying hidden dependencies, applying compatibility fixes, and communicating with downstream consumers—you can navigate such challenges. Always anticipate that your system's observable output, no matter how trivial, may be depended upon.