Syscall-Level Fault Injection

Intercept any syscall via Linux's seccomp-notify and decide — allow, deny, or delay. No eBPF, no ptrace, no code changes. Works on any binary: Go, Rust, Java, Python, C.

Syscall families

write automatically covers write, writev, pwrite64. Think in operations, not syscall numbers.

Path targeting

Fault only writes to /data/*.wal — stdout, TCP, and other writes are unaffected. Powered by fd→path resolution via /proc.

Probabilistic & triggered

deny("EIO", probability="30%") or deny("EIO", trigger="after=5") — intermittent failures and trigger-on-Nth-call.

Protocol-Level Fault Injection

Inject faults at the protocol level via transparent proxy. Target specific HTTP paths, SQL queries, Redis commands, or Kafka topics — without touching the network stack.

HTTP HTTP/2 gRPC PostgreSQL MySQL Redis Kafka NATS MongoDB Cassandra ClickHouse AMQP Memcached TCP UDP
Unified API

fault(service) = syscall level. fault(service.interface) = protocol level. Same builtin, different dispatch.

Response rewriting

Return HTTP 503 for POST /orders, inject Postgres query errors, drop Kafka messages on specific topics.

Deterministic Exploration

Control syscall ordering across services with hold() and release(). Explore all possible interleavings automatically with --explore. Replay any failure with its seed.

Parallel execution

parallel(fn1, fn2) runs operations concurrently. Faultbox controls which syscall proceeds first.

Seed replay

Every test run has a seed. Failed? Replay with --seed 42 for identical interleaving — deterministic debugging.

Exhaustive mode

--explore=all tries every permutation (K! orderings). --explore=sample randomly samples for faster coverage.

Starlark Specs

Topology, faults, and assertions in one .star file. Starlark is a Python dialect — if you know Python, you know Starlark. No YAML, no separate config language. The spec is executable code.

Service declarations

service(), interface(), depends_on, healthcheck — declare topology as code.

Assertions

assert_eq(), assert_eventually(), assert_never(), assert_before() — value checks and temporal properties on the syscall trace.

Scenarios & generation

Register happy paths with scenario(), then faultbox generate creates failure tests automatically.

Event Log & Traces

Every intercepted syscall is recorded with vector clocks, service attribution, and file paths. Assert on internal behavior, not just inputs and outputs.

Temporal assertions

"The WAL write happened before the response" — assert_before() proves ordering guarantees.

ShiViz visualization

--shiviz trace.shiviz produces a space-time diagram with causal arrows between services.

Normalized traces

Capture before/after a refactor. faultbox diff shows exactly what behavioral changes you introduced.

Binary & Container Modes

Run local binaries for fast development, or real infrastructure in Docker containers for integration testing. Same spec, same assertions, same faults.

Binary mode

binary="./my-service" — fork+exec with seccomp filter. Fastest iteration, no Docker needed.

Container mode

image="postgres:16" — Docker containers with faultbox-shim entrypoint. Test against real Postgres, Redis, Kafka.

Monitors & Network Partitions

Define safety invariants as monitors that run on every syscall event. Simulate network partitions between specific services.

Monitors

Callbacks that fire on every matching event — fail immediately if an invariant is violated.

Partitions

partition(orders, inventory, run=scenario) — bidirectional network split, other connectivity intact.

Named Operations

Group related syscalls into logical operations. Fault "persist" instead of "write + fsync". Path filters target specific files.

Semantic faults

ops={"persist": op(syscalls=["write","fsync"], path="*.wal")} — fault the WAL persist operation, not individual syscalls.

Recipe Library

Curated, protocol-specific failure wrappers ship embedded in the faultbox binary. Load them via the @faultbox/ prefix — no filesystem setup, no network fetch. Each recipe encodes a canonical error message, status code, or incident pattern drawn from real postmortems.

Embedded stdlib

load("@faultbox/recipes/mongodb.star", "mongodb") — works from any project, no recipes/ directory needed. Ships with every binary.

Namespace structs

mongodb.disk_full() and postgres.disk_full() coexist. One import per protocol, zero name collisions.

Canonical error text

Say cassandra.unavailable() instead of remembering "Cannot achieve consistency level QUORUM". Recipes stay in sync with real driver behavior.

CLI discovery

faultbox recipes list and recipes show <name> — browse the catalog without reading source.

LLM-First Design

New in v0.2.0

Faultbox is designed for both human engineers and LLM agents. Structured JSON output, MCP server, Claude Code integration — everything an agent needs for an autonomous code → test → fix loop.

MCP server

faultbox mcp — 6 tools for Claude, Cursor, and any MCP client. Run tests, generate specs, analyze failures natively.

Structured output

--format json — machine-parseable results with fault info, syscall summary, and actionable diagnostics.

Claude Code commands

faultbox init --claude — slash commands (/fault-test, /fault-generate, /fault-diagnose) and auto-MCP config.

From docker-compose

faultbox init --from-compose — zero-effort spec generation. Detects protocols, wires dependencies, generates happy-path tests.

Diagnostics

Not just "test failed" — structured hints like "write fault fired but service returned 200 — missing error handling in persist path."

Docker & CI

ghcr.io/faultbox/faultbox image + GitHub Action for automated fault testing on every PR.

Recent Releases

VersionHighlights
v0.13.0 current Five RFCs ship together. RFC-040 (Determinism Levels): the new determinism() top-level builtin declares L0/L1 + strict mode; the runtime emits unmediated_io events when the SUT performs I/O Faultbox can observe but isn't mediating (clock_gettime, getrandom, DNS to a non-Faultbox resolver, connect() to an undeclared address). Strict mode (the default) fails the test on the first untolerated leak. RFC-041 (Temporal Properties): eventually(p), always(p, between=), await_event(matcher), await_stable(quiescence_window=), and a rewritten state-machine monitor(name, on=, state_init=, update=, check=) plus a declarative test(name, body=, expect=, timeout=, terminate_when=) builtin. The test lifecycle gains a three-valued verdict (PASS / FAIL / INCONCLUSIVE) with CLI exit code 3 reserved for inconclusive-only runs. RFC-042 (Exploration Plan): faultbox plan subcommand, plan.json in every bundle, the report's Plan tab, coverage analysis (--coverage), rule-based --suggest, --check-cost --max-instances N CI gate, and the body-re-execution engine that turns named choose("name", [opts]) axes (RFC-043 §5.2), syscall-level probability fan-out (max_fires=N, mode="exhaustive"), and parallel(..., interleavings=) orderings into multi-leaf test executions. Each leaf carries a stable LeafID through TestResult, the bundle manifest, and the HTML report's tests table. RFC-043 (Non-deterministic Operators): four small Starlark primitives — choose, nondet, halt(reason) with a new halted outcome, and assume(predicate) / test(assume=) with per-leaf evaluation, AST denylist sandbox, and predicate Starlark errors mapping to Result="error". RFC-044 (Spec Language Simplification): withdraws RFC-013 (param(), superseded by choose()) and RFC-002 (domain(), service()/interface() proved sufficient); unifies the three fan-out axis kinds under one NonDeterministicChoice interface; collapses event sources under observe.stdout/observe.stderr and decoders under decoder("name", ...); deprecates faultbox generate in favor of faultbox plan --suggest. Two new tutorial chapters (Part 4 — Safety & Verification) cover the operator and fan-out vocabulary end-to-end. Full repo go test -race ./... green; Lima sweep 21/21 PASS across the same 6 integration spec suites as v0.12.29.
v0.12.29 Remote services (RFC-036). New service(remote=...) kwarg points Faultbox at an externally-running endpoint — typically a real pod in a customer's k8s dev cluster — without launching a process. The proxy datapath from RFC-024 dials the remote upstream and the SUT reaches it through the proxy unchanged, so every protocol-level fault (response(), error(), slow(), gRPC method targeting, SQL matchers) keeps working. Process-level kwargs (seed=, reset=, reuse=, volumes=, ports=, args=, seccomp=, observe=, ops=) and syscall-level faults are rejected at spec load with explicit error messages naming the offending kwarg and pointing at protocol faults or mock_service(). Composes with RFC-038 tls=tls_cert(...) for TLS-required upstreams (the auto-generated proxy cert covers 127.0.0.1 so SUT-side verification works against the env-rewritten loopback addr). New typed remotes(...) value for services whose interfaces live on different hosts. New @faultbox/discovery/k8s.star stdlib helper exposing k8s.service, k8s.endpoint, and k8s.local — pure string sugar over <name>.<namespace>.svc.cluster.local, no runtime k8s client. The .fb bundle's env.json records every (service, interface, host, protocol) tuple from a remote-using run; faultbox replay warns when replaying such a bundle and points at RFC-037 (the open companion design RFC for the offline-replay determinism story). Cluster connectivity is the user's responsibility — Telepresence connect, kubectl port-forward, in-cluster execution, or VPN — documented in the new Connectivity guide. 49 new tests across spec-load validation (32), runtime/proxy lifecycle (10 incl. TLS×remote interop), bundle round-trip (2), replay warning (2), and string-grep doc gates (3). Full repo go test ./... green; go vet ./... clean; Lima make demo-container 4/4 PASS (proxy datapath refactor confirmed non-regressive against the seccomp + Postgres + Redis path).
v0.12.28 TLS-aware proxy (RFC-038) + proxy traffic observability (RFC-034) + container fault paths (RFC-035). Twelve patch versions consolidated into one release. Headline: declare interface(..., tls=tls_cert(...)) on a service interface and the proxy terminates TLS at its listener and re-establishes TLS dialing the upstream — all protocol-aware fault rules (http.error(path=...), grpc.error(method=...), kafka.drop(topic=...), redis.error(key=...)) keep firing on the plaintext between the two TLS legs. Six plugins ship migrated this release (http, http2, gRPC, Kafka, Redis, TCP); the remaining 8 (postgres, mysql, mongodb, cassandra, clickhouse, memcached, nats, amqp) tracked in RFC-039 — declarations against unmigrated plugins emit a proxy_tls_pending event. The tls_cert(...) builtin is kwargs-only with full spec-load validation (cert/key pairing, file existence, CA PEM parse, insecure=True + ca= exclusion); empty tls_cert() auto-generates a self-signed proxy cert in memory for dev/test. Two TLS plumbing patterns: listener wrap-and-dial via proxy.ListenTLS(serverCfg) + proxy.Dial(ctx, target, clientCfg) (http, http2, kafka, redis, tcp) and framework credentials via grpc.Creds(credentials.NewTLS(...)) for gRPC. Also: RFC-034 connection lifecycle observability — proxy_conn_open, proxy_conn_close (with byte counts and reason classification), proxy_handshake_complete, proxy_stall (1Hz watchdog, 5s warn / 30s extend tiers) — wired into 13 of 15 plugins. RFC-035 container-consumer fault paths on Linux Docker. New stderr() event source. Container-mode observe=[stdout(...)] now works for container services. Race fix on tcp.go::handle. New internal/proxy/{http,grpc,kafka,redis,tcp}_tls_test.go suites (~30 tests). Full repo go test ./... -race green; Lima sweep 21/21 PASS.
v0.12.16 Report UX overhaul. Driven by inDrive Freight triage feedback on the v0.12.15.x customer report; entirely scoped to internal/report (no bundle format or spec-language changes). Causal links now follow cause, not chronology — findCausalAncestors switched from vector-clock partial order (which on real bundles routinely had only the lifecycle events with complete clocks, so spaghetti pointed at service_ready instead of proxy_fault_applied) to seq-based strict precedence, restricted to faults / violations / errored steps. Hovering an ordinary success step now draws zero lines. New timeline filter bar above every Event Trace block: three presets (Compact default, hides framework lifecycle chatter; Anchors only strips everything except cause-relevant events; All events historical default) plus free-text search across event type / headline / fields. proxy_fault_applied / proxy_fault_removed are now first-class fault markers (red, not default-blue) and survive Phase 3 downsampling. Per-test "Faults applied" section pairs proxy_fault_applied with proxy_fault_removed (one row per assumption: service · protocol · interface · assumption · seq window). Recent block in the Assertion drill-down interleaves fault events between captured step rows by seq, with a fade-and-expand cap (220px max-height + mask-image gradient + Show full assertion toggle). Group-members table on folded markers (paginated 100/page, sticky header, scrollable) replaces the one-line "collapsed run" hint — runs hiding a 5xx among 99 successes are now legible. Fullscreen toggle on the test details modal. STDOUT JSON renders as a 2-column key/value table with dot-path flattening for nested objects. Source block falls back to fault_matrix(...) for matrix-generated test names and surfaces three jump links (scenario, fault, matrix call site). Plus tooltip vertical-text fix, detail panel summary no longer truncates, folded marker click routes through markerEvBySeq so the Group-members table actually appears.
v0.12.15.2 Proxy goroutine context-rooting (Finding K). Customer (Freight) verified v0.12.15.1's redis RESP3 fix landed clean — cold-start path green end-to-end (smoke PASS in 16.3 s). The failure moved to the reuse path: cell 1 of the dbmatrix passes; cells 2–18 all fail identically with error connect to db: invalid connection (go-sql-driver ErrBadConn) or read: connection reset by peer (go-redis). Root cause: Manager.EnsureProxy rooted each proxy's Accept goroutine at the caller's ctx; preStartProxies runs under RunTest's per-test testCtx which cancels via defer cancel() at end of test — at end of cell 1 that cancellation took down the goroutine while the listener fd stayed bound (only Stop() closes it) and the cached m.proxies[key] entry stayed in place. Cells 2..N saw proxy_active(reused) but nobody was Accept()-ing, so the kernel completed the TCP handshake and then RST-ed. v0.12.15.2 roots the proxy's pCtx at context.Background(); StopAll/StopService still drive explicit teardown. Why this surfaced now: v0.12.13's reuse fix kept containers AND proxies alive across cells, exposing the latent ctx-rooting bug. Lima sweep 21/21 PASS; new TestManagerEnsureProxy_SurvivesCallerCtxCancel regression test guards the path.