faultbox

Fault injection for distributed systems.
Intercept syscalls and protocol messages to test how your services behave under failure.

Get Started GitHub

Four layers, one spec

Most chaos and fault tools operate at one layer. Faultbox composes four — so a single .star spec can model the failure modes integration tests can't reach.

Syscall write=deny("EIO") Disk failure, ENOSPC, EMFILE, partial writes — the OS-level modes you can't induce from above. Protocol — request drop every 3rd /Get Retry policies, circuit breakers, idempotency. Tests the resilience code most teams write but never exercise. Protocol — response HTTP 200 → 503 Status-code handling, parser robustness, fallback behavior on degraded responses. Mock service delay 800 ms in mock OAuth Token-refresh, deadline propagation — without spinning up real auth infra.

Where Faultbox fits — vs integration tests, load tests, prod chaos →

faultbox.star

api = service("api", binary="./api", http="localhost:8080")
db  = service("db",  binary="./db",  tcp="localhost:5432")

def test_write_failure(t):
    fault(db, write=deny("EIO"))
    resp = api.http.post("/orders", json={"item": "widget"})
    assert_eq(resp.status, 503, "API should return 503 when DB fails")

$ faultbox test faultbox.star

PASS  test_write_failure  (0.42s)
  ✓ fault(db, write=deny("EIO"))
  ✓ POST /orders → 503
  ✓ assert_eq(resp.status, 503)

Install

curl -fsSL https://faultbox.io/install.sh | sh

Detects your platform, downloads the latest release, verifies checksum. Or build from source.

Why Faultbox

Syscall-level injection

Deny, delay, or hold any syscall via seccomp-notify. No eBPF, no ptrace, no code changes. Faultbox automatically expands syscall families — write covers write, writev, pwrite64.

Protocol-level injection

Inject faults at HTTP, HTTP/2, gRPC, Postgres, MySQL, Redis, Kafka, NATS, MongoDB, Cassandra, ClickHouse, AMQP, Memcached, TCP, and UDP protocol level. Target specific queries, paths, topics, or CQL statements via transparent proxy.

Deterministic exploration

hold() and release() control syscall ordering across services. --explore mode walks all interleavings automatically. Seed-based replay for reproducible failures.

Starlark specs

Topology, faults, and assertions in one .star file. No YAML. No separate config language. The spec is executable code.

Two modes

Run local binaries with binary= or real infrastructure (Postgres, Redis, Kafka) in Docker containers with image=.

Event log & traces

Every intercepted syscall recorded with vector clocks. Temporal assertions: assert_eventually(), assert_never(), assert_within(). ShiViz visualization support.

Recipe library

load("@faultbox/recipes/mongodb.star", "mongodb") — curated failure wrappers ship embedded in the binary. Examples: mongodb.disk_full() · cassandra.unavailable() · http2.rate_limited(). Canonical error text, zero name collisions. Browse with faultbox recipes list.

How it works

Write a spec Define topology, faults, and assertions in a single .star file

→

Start services Runtime launches binaries or containers and installs seccomp filters

→

Intercept syscalls Kernel pauses processes on target syscalls and asks Faultbox what to do

→

Inject & assert Deny, delay, or hold syscalls — then verify your service handles it

Powered by seccomp-notify — no ptrace, no eBPF, no code instrumentation. Faults are injected in the kernel, invisible to the target process.

Supported protocols

HTTP HTTP/2 gRPC PostgreSQL MySQL Redis Kafka NATS MongoDB Cassandra ClickHouse AMQP Memcached TCP UDP

Built for LLM agents

LLM agents write code. But who tests what happens when the database crashes, the network drops, or the disk fills up? Faultbox closes the loop.

Agent writes code

Your LLM agent builds a microservice. It writes handlers, connects to Postgres, adds Redis caching.

Faultbox generates tests

One command from docker-compose. Every dependency gets fault scenarios — disk failures, network drops, slow queries.

faultbox init --from-compose

Structured feedback

JSON output with diagnostics: "write fault fired 3 times but service returned 200 — missing error handling in the persist path."

Agent fixes the bug

The agent reads the diagnostic, finds the code, adds error handling. Runs tests again. All pass. Commits with confidence.

MCP native Built-in MCP server with 6 tools. Claude, Cursor, and any MCP client connect directly.

One command setup faultbox init --claude creates slash commands and MCP config. Zero configuration.

Actionable diagnostics Not just "test failed" — structured hints that tell the agent exactly what to fix and where.

Every LLM agent writing microservices needs to answer one question:
"What happens when things break?"

Faultbox is that answer.

LLM Integration Guide