Features — Faultbox

Syscall-Level Fault Injection

Intercept any syscall via Linux's seccomp-notify and decide — allow, deny, or delay. No eBPF, no ptrace, no code changes. Works on any binary: Go, Rust, Java, Python, C.

Syscall families

write automatically covers write, writev, pwrite64. Think in operations, not syscall numbers.

Path targeting

Fault only writes to /data/*.wal — stdout, TCP, and other writes are unaffected. Powered by fd→path resolution via /proc.

Probabilistic & triggered

deny("EIO", probability="30%") or deny("EIO", trigger="after=5") — intermittent failures and trigger-on-Nth-call.

Tutorial: Fault Injection Spec Reference: Faults Error Code Reference

Protocol-Level Fault Injection

Inject faults at the protocol level via transparent proxy. Target specific HTTP paths, SQL queries, Redis commands, or Kafka topics — without touching the network stack.

HTTP HTTP/2 gRPC PostgreSQL MySQL Redis Kafka NATS MongoDB Cassandra ClickHouse AMQP Memcached TCP UDP

Unified API

fault(service) = syscall level. fault(service.interface) = protocol level. Same builtin, different dispatch.

Response rewriting

Return HTTP 503 for POST /orders, inject Postgres query errors, drop Kafka messages on specific topics.

Tutorial: HTTP & Redis Faults Tutorial: Database Faults Spec Reference: Protocol Faults

Deterministic Exploration

Control syscall ordering across services with hold() and release(). Explore all possible interleavings automatically with --explore. Replay any failure with its seed.

Parallel execution

parallel(fn1, fn2) runs operations concurrently. Faultbox controls which syscall proceeds first.

Seed replay

Every test run has a seed. Failed? Replay with --seed 42 for identical interleaving — deterministic debugging.

Exhaustive mode

--explore=all tries every permutation (K! orderings). --explore=sample randomly samples for faster coverage.

Tutorial: Concurrency Spec Reference: Concurrency

Starlark Specs

Topology, faults, and assertions in one .star file. Starlark is a Python dialect — if you know Python, you know Starlark. No YAML, no separate config language. The spec is executable code.

Service declarations

service(), interface(), depends_on, healthcheck — declare topology as code.

Assertions

assert_eq(), assert_eventually(), assert_never(), assert_before() — value checks and temporal properties on the syscall trace.

Scenarios & generation

Tutorial: First Test Spec Language Reference CLI Reference

Event Log & Traces

Every intercepted syscall is recorded with vector clocks, service attribution, and file paths. Assert on internal behavior, not just inputs and outputs.

Temporal assertions

"The WAL write happened before the response" — assert_before() proves ordering guarantees.

ShiViz visualization

--shiviz trace.shiviz produces a space-time diagram with causal arrows between services.

Normalized traces

Capture before/after a refactor. faultbox diff shows exactly what behavioral changes you introduced.

Tutorial: Traces & Assertions Spec Reference: Assertions

Binary & Container Modes

Run local binaries for fast development, or real infrastructure in Docker containers for integration testing. Same spec, same assertions, same faults.

Binary mode

binary="./my-service" — fork+exec with seccomp filter. Fastest iteration, no Docker needed.

Container mode

image="postgres:16" — Docker containers with faultbox-shim entrypoint. Test against real Postgres, Redis, Kafka.

Tutorial: Containers Spec Reference: Topology

Monitors & Network Partitions

Define safety invariants as monitors that run on every syscall event. Simulate network partitions between specific services.

Monitors

Callbacks that fire on every matching event — fail immediately if an invariant is violated.

Partitions

partition(orders, inventory, run=scenario) — bidirectional network split, other connectivity intact.

Tutorial: From Tests to Domains Spec Reference: Monitors

Named Operations

Group related syscalls into logical operations. Fault "persist" instead of "write + fsync". Path filters target specific files.

Semantic faults

ops={"persist": op(syscalls=["write","fsync"], path="*.wal")} — fault the WAL persist operation, not individual syscalls.

Tutorial: Named Operations Spec Reference: op()

Recipe Library

Curated, protocol-specific failure wrappers ship embedded in the faultbox binary. Load them via the @faultbox/ prefix — no filesystem setup, no network fetch. Each recipe encodes a canonical error message, status code, or incident pattern drawn from real postmortems.

Embedded stdlib

load("@faultbox/recipes/mongodb.star", "mongodb") — works from any project, no recipes/ directory needed. Ships with every binary.

Namespace structs

mongodb.disk_full() and postgres.disk_full() coexist. One import per protocol, zero name collisions.

Canonical error text

Say cassandra.unavailable() instead of remembering "Cannot achieve consistency level QUORUM". Recipes stay in sync with real driver behavior.

CLI discovery

faultbox recipes list and recipes show <name> — browse the catalog without reading source.

Spec Reference: Importing Recipes CLI Reference: faultbox recipes

LLM-First Design

New in v0.2.0

Faultbox is designed for both human engineers and LLM agents. Structured JSON output, MCP server, Claude Code integration — everything an agent needs for an autonomous code → test → fix loop.

MCP server

faultbox mcp — 6 tools for Claude, Cursor, and any MCP client. Run tests, generate specs, analyze failures natively.

Structured output

--format json — machine-parseable results with fault info, syscall summary, and actionable diagnostics.

Claude Code commands

faultbox init --claude — slash commands (/fault-test, /fault-generate, /fault-diagnose) and auto-MCP config.

From docker-compose

faultbox init --from-compose — zero-effort spec generation. Detects protocols, wires dependencies, generates happy-path tests.

Diagnostics

Not just "test failed" — structured hints like "write fault fired but service returned 200 — missing error handling in persist path."

Docker & CI

ghcr.io/faultbox/faultbox image + GitHub Action for automated fault testing on every PR.

Tutorial: LLM Agents & MCP CLI Reference: faultbox mcp

Recent Releases

Version	Highlights
v0.13.0 current	Five RFCs ship together. RFC-040 (Determinism Levels): the new `determinism()` top-level builtin declares L0/L1 + strict mode; the runtime emits `unmediated_io` events when the SUT performs I/O Faultbox can observe but isn't mediating (`clock_gettime`, `getrandom`, DNS to a non-Faultbox resolver, `connect()` to an undeclared address). Strict mode (the default) fails the test on the first untolerated leak. RFC-041 (Temporal Properties): `eventually(p)`, `always(p, between=)`, `await_event(matcher)`, `await_stable(quiescence_window=)`, and a rewritten state-machine `monitor(name, on=, state_init=, update=, check=)` plus a declarative `test(name, body=, expect=, timeout=, terminate_when=)` builtin. The test lifecycle gains a three-valued verdict (PASS / FAIL / INCONCLUSIVE) with CLI exit code 3 reserved for inconclusive-only runs. RFC-042 (Exploration Plan): `faultbox plan` subcommand, `plan.json` in every bundle, the report's Plan tab, coverage analysis (`--coverage`), rule-based `--suggest`, `--check-cost --max-instances N` CI gate, and the body-re-execution engine that turns named `choose("name", [opts])` axes (RFC-043 §5.2), syscall-level probability fan-out (`max_fires=N`, `mode="exhaustive"`), and `parallel(..., interleavings=)` orderings into multi-leaf test executions. Each leaf carries a stable `LeafID` through `TestResult`, the bundle manifest, and the HTML report's tests table. RFC-043 (Non-deterministic Operators): four small Starlark primitives — `choose`, `nondet`, `halt(reason)` with a new `halted` outcome, and `assume(predicate)` / `test(assume=)` with per-leaf evaluation, AST denylist sandbox, and predicate Starlark errors mapping to `Result="error"`. RFC-044 (Spec Language Simplification): withdraws RFC-013 (`param()`, superseded by `choose()`) and RFC-002 (`domain()`, `service()`/`interface()` proved sufficient); unifies the three fan-out axis kinds under one `NonDeterministicChoice` interface; collapses event sources under `observe.stdout`/`observe.stderr` and decoders under `decoder("name", ...)`; deprecates `faultbox generate` in favor of `faultbox plan --suggest`. Two new tutorial chapters (Part 4 — Safety & Verification) cover the operator and fan-out vocabulary end-to-end. Full repo `go test -race ./...` green; Lima sweep 21/21 PASS across the same 6 integration spec suites as v0.12.29.
v0.12.29	Remote services (RFC-036). New `service(remote=...)` kwarg points Faultbox at an externally-running endpoint — typically a real pod in a customer's k8s dev cluster — without launching a process. The proxy datapath from RFC-024 dials the remote upstream and the SUT reaches it through the proxy unchanged, so every protocol-level fault (`response()`, `error()`, `slow()`, gRPC method targeting, SQL matchers) keeps working. Process-level kwargs (`seed=`, `reset=`, `reuse=`, `volumes=`, `ports=`, `args=`, `seccomp=`, `observe=`, `ops=`) and syscall-level faults are rejected at spec load with explicit error messages naming the offending kwarg and pointing at protocol faults or `mock_service()`. Composes with RFC-038 `tls=tls_cert(...)` for TLS-required upstreams (the auto-generated proxy cert covers `127.0.0.1` so SUT-side verification works against the env-rewritten loopback addr). New typed `remotes(...)` value for services whose interfaces live on different hosts. New `@faultbox/discovery/k8s.star` stdlib helper exposing `k8s.service`, `k8s.endpoint`, and `k8s.local` — pure string sugar over `<name>.<namespace>.svc.cluster.local`, no runtime k8s client. The `.fb` bundle's `env.json` records every `(service, interface, host, protocol)` tuple from a remote-using run; `faultbox replay` warns when replaying such a bundle and points at RFC-037 (the open companion design RFC for the offline-replay determinism story). Cluster connectivity is the user's responsibility — Telepresence connect, kubectl port-forward, in-cluster execution, or VPN — documented in the new Connectivity guide. 49 new tests across spec-load validation (32), runtime/proxy lifecycle (10 incl. TLS×remote interop), bundle round-trip (2), replay warning (2), and string-grep doc gates (3). Full repo `go test ./...` green; `go vet ./...` clean; Lima `make demo-container` 4/4 PASS (proxy datapath refactor confirmed non-regressive against the seccomp + Postgres + Redis path).
v0.12.28	TLS-aware proxy (RFC-038) + proxy traffic observability (RFC-034) + container fault paths (RFC-035). Twelve patch versions consolidated into one release. Headline: declare `interface(..., tls=tls_cert(...))` on a service interface and the proxy terminates TLS at its listener and re-establishes TLS dialing the upstream — all protocol-aware fault rules (`http.error(path=...)`, `grpc.error(method=...)`, `kafka.drop(topic=...)`, `redis.error(key=...)`) keep firing on the plaintext between the two TLS legs. Six plugins ship migrated this release (http, http2, gRPC, Kafka, Redis, TCP); the remaining 8 (postgres, mysql, mongodb, cassandra, clickhouse, memcached, nats, amqp) tracked in RFC-039 — declarations against unmigrated plugins emit a `proxy_tls_pending` event. The `tls_cert(...)` builtin is kwargs-only with full spec-load validation (cert/key pairing, file existence, CA PEM parse, `insecure=True` + `ca=` exclusion); empty `tls_cert()` auto-generates a self-signed proxy cert in memory for dev/test. Two TLS plumbing patterns: listener wrap-and-dial via `proxy.ListenTLS(serverCfg)` + `proxy.Dial(ctx, target, clientCfg)` (http, http2, kafka, redis, tcp) and framework credentials via `grpc.Creds(credentials.NewTLS(...))` for gRPC. Also: RFC-034 connection lifecycle observability — `proxy_conn_open`, `proxy_conn_close` (with byte counts and reason classification), `proxy_handshake_complete`, `proxy_stall` (1Hz watchdog, 5s warn / 30s extend tiers) — wired into 13 of 15 plugins. RFC-035 container-consumer fault paths on Linux Docker. New `stderr()` event source. Container-mode `observe=[stdout(...)]` now works for container services. Race fix on tcp.go::handle. New `internal/proxy/{http,grpc,kafka,redis,tcp}_tls_test.go` suites (~30 tests). Full repo `go test ./... -race` green; Lima sweep 21/21 PASS.
v0.12.16	Report UX overhaul. Driven by inDrive Freight triage feedback on the v0.12.15.x customer report; entirely scoped to `internal/report` (no bundle format or spec-language changes). Causal links now follow cause, not chronology — `findCausalAncestors` switched from vector-clock partial order (which on real bundles routinely had only the lifecycle events with complete clocks, so spaghetti pointed at `service_ready` instead of `proxy_fault_applied`) to seq-based strict precedence, restricted to faults / violations / errored steps. Hovering an ordinary success step now draws zero lines. New timeline filter bar above every Event Trace block: three presets (Compact default, hides framework lifecycle chatter; Anchors only strips everything except cause-relevant events; All events historical default) plus free-text search across event type / headline / fields. `proxy_fault_applied` / `proxy_fault_removed` are now first-class fault markers (red, not default-blue) and survive Phase 3 downsampling. Per-test "Faults applied" section pairs proxy_fault_applied with proxy_fault_removed (one row per assumption: service · protocol · interface · assumption · seq window). Recent block in the Assertion drill-down interleaves fault events between captured step rows by seq, with a fade-and-expand cap (220px max-height + mask-image gradient + `Show full assertion` toggle). Group-members table on folded markers (paginated 100/page, sticky header, scrollable) replaces the one-line "collapsed run" hint — runs hiding a 5xx among 99 successes are now legible. Fullscreen toggle on the test details modal. STDOUT JSON renders as a 2-column key/value table with dot-path flattening for nested objects. Source block falls back to `fault_matrix(...)` for matrix-generated test names and surfaces three jump links (scenario, fault, matrix call site). Plus tooltip vertical-text fix, detail panel summary no longer truncates, folded marker click routes through `markerEvBySeq` so the Group-members table actually appears.
v0.12.15.2	Proxy goroutine context-rooting (Finding K). Customer (Freight) verified v0.12.15.1's redis RESP3 fix landed clean — cold-start path green end-to-end (smoke PASS in 16.3 s). The failure moved to the reuse path: cell 1 of the dbmatrix passes; cells 2–18 all fail identically with `error connect to db: invalid connection` (go-sql-driver `ErrBadConn`) or `read: connection reset by peer` (go-redis). Root cause: `Manager.EnsureProxy` rooted each proxy's Accept goroutine at the caller's ctx; `preStartProxies` runs under `RunTest`'s per-test `testCtx` which cancels via `defer cancel()` at end of test — at end of cell 1 that cancellation took down the goroutine while the listener fd stayed bound (only `Stop()` closes it) and the cached `m.proxies[key]` entry stayed in place. Cells 2..N saw `proxy_active(reused)` but nobody was `Accept()`-ing, so the kernel completed the TCP handshake and then RST-ed. v0.12.15.2 roots the proxy's `pCtx` at `context.Background()`; `StopAll/StopService` still drive explicit teardown. Why this surfaced now: v0.12.13's reuse fix kept containers AND proxies alive across cells, exposing the latent ctx-rooting bug. Lima sweep 21/21 PASS; new `TestManagerEnsureProxy_SurvivesCallerCtxCancel` regression test guards the path.

All releases on GitHub →