faultbox
Fault injection for distributed systems.
Intercept syscalls and protocol messages to test how your services behave
under failure.
Four layers, one spec
Most chaos and fault tools operate at one layer. Faultbox composes four —
so a single .star spec can model the failure modes
integration tests can't reach.
write=deny("EIO") Disk failure, ENOSPC, EMFILE, partial writes — the OS-level modes you can't induce from above. Protocol — request drop every 3rd /Get Retry policies, circuit breakers, idempotency. Tests the resilience code most teams write but never exercise. Protocol — response HTTP 200 → 503 Status-code handling, parser robustness, fallback behavior on degraded responses. Mock service delay 800 ms in mock OAuth Token-refresh, deadline propagation — without spinning up real auth infra. Where Faultbox fits — vs integration tests, load tests, prod chaos →
api = service("api", binary="./api", http="localhost:8080")
db = service("db", binary="./db", tcp="localhost:5432")
def test_write_failure(t):
fault(db, write=deny("EIO"))
resp = api.http.post("/orders", json={"item": "widget"})
assert_eq(resp.status, 503, "API should return 503 when DB fails") $ faultbox test faultbox.star
PASS test_write_failure (0.42s)
✓ fault(db, write=deny("EIO"))
✓ POST /orders → 503
✓ assert_eq(resp.status, 503) Install
curl -fsSL https://faultbox.io/install.sh | sh Detects your platform, downloads the latest release, verifies checksum. Or build from source.
Why Faultbox
Syscall-level injection
Deny, delay, or hold any syscall via seccomp-notify. No eBPF, no
ptrace, no code changes. Faultbox automatically expands syscall
families — write covers write,
writev, pwrite64.
Protocol-level injection
Inject faults at HTTP, HTTP/2, gRPC, Postgres, MySQL, Redis, Kafka, NATS, MongoDB, Cassandra, ClickHouse, AMQP, Memcached, TCP, and UDP protocol level. Target specific queries, paths, topics, or CQL statements via transparent proxy.
Deterministic exploration
hold() and release() control syscall ordering
across services. --explore mode walks all interleavings
automatically. Seed-based replay for reproducible failures.
Starlark specs
Topology, faults, and assertions in one .star file. No
YAML. No separate config language. The spec is executable code.
Two modes
Run local binaries with binary= or real infrastructure
(Postgres, Redis, Kafka) in Docker containers with image=.
Event log & traces
Every intercepted syscall recorded with vector clocks. Temporal
assertions: assert_eventually(),
assert_never(), assert_within(). ShiViz
visualization support.
Recipe library
load("@faultbox/recipes/mongodb.star", "mongodb") —
curated failure wrappers ship embedded in the binary. Examples:
mongodb.disk_full() ·
cassandra.unavailable() ·
http2.rate_limited(). Canonical error text, zero
name collisions. Browse with faultbox recipes list.
How it works
.star file Powered by seccomp-notify — no ptrace, no eBPF, no code instrumentation. Faults are injected in the kernel, invisible to the target process.
Supported protocols
Built for LLM agents
LLM agents write code. But who tests what happens when the database crashes, the network drops, or the disk fills up? Faultbox closes the loop.
Your LLM agent builds a microservice. It writes handlers, connects to Postgres, adds Redis caching.
One command from docker-compose. Every dependency gets fault scenarios — disk failures, network drops, slow queries.
faultbox init --from-compose JSON output with diagnostics: "write fault fired 3 times but service returned 200 — missing error handling in the persist path."
The agent reads the diagnostic, finds the code, adds error handling. Runs tests again. All pass. Commits with confidence.
faultbox init --claude creates slash commands and MCP config. Zero configuration.
Every LLM agent writing microservices needs to answer one question:
"What happens when things break?"
Faultbox is that answer.
LLM Integration Guide