Features
v0.9.1 · Apache 2.0 · Linux (macOS via Lima)
Syscall-Level Fault Injection
Intercept any syscall via Linux's seccomp-notify and decide — allow, deny, or delay. No eBPF, no ptrace, no code changes. Works on any binary: Go, Rust, Java, Python, C.
write automatically covers write, writev,
pwrite64. Think in operations, not syscall numbers.
Fault only writes to /data/*.wal — stdout, TCP, and other writes
are unaffected. Powered by fd→path resolution via /proc.
deny("EIO", probability="30%") or deny("EIO", trigger="after=5") —
intermittent failures and trigger-on-Nth-call.
Protocol-Level Fault Injection
Inject faults at the protocol level via transparent proxy. Target specific HTTP paths, SQL queries, Redis commands, or Kafka topics — without touching the network stack.
fault(service) = syscall level. fault(service.interface) = protocol
level. Same builtin, different dispatch.
Return HTTP 503 for POST /orders, inject Postgres query errors,
drop Kafka messages on specific topics.
Deterministic Exploration
Control syscall ordering across services with hold() and
release(). Explore all possible interleavings automatically
with --explore. Replay any failure with its seed.
parallel(fn1, fn2) runs operations concurrently.
Faultbox controls which syscall proceeds first.
Every test run has a seed. Failed? Replay with --seed 42 for
identical interleaving — deterministic debugging.
--explore=all tries every permutation (K! orderings).
--explore=sample randomly samples for faster coverage.
Starlark Specs
Topology, faults, and assertions in one .star file.
Starlark is a Python dialect — if you know Python, you know Starlark.
No YAML, no separate config language. The spec is executable code.
service(), interface(), depends_on,
healthcheck — declare topology as code.
assert_eq(), assert_eventually(),
assert_never(), assert_before() — value checks
and temporal properties on the syscall trace.
Register happy paths with scenario(), then
faultbox generate creates failure tests automatically.
Event Log & Traces
Every intercepted syscall is recorded with vector clocks, service attribution, and file paths. Assert on internal behavior, not just inputs and outputs.
"The WAL write happened before the response" —
assert_before() proves ordering guarantees.
--shiviz trace.shiviz produces a space-time diagram
with causal arrows between services.
Capture before/after a refactor. faultbox diff shows
exactly what behavioral changes you introduced.
Binary & Container Modes
Run local binaries for fast development, or real infrastructure in Docker containers for integration testing. Same spec, same assertions, same faults.
binary="./my-service" — fork+exec with seccomp filter.
Fastest iteration, no Docker needed.
image="postgres:16" — Docker containers with faultbox-shim
entrypoint. Test against real Postgres, Redis, Kafka.
Monitors & Network Partitions
Define safety invariants as monitors that run on every syscall event. Simulate network partitions between specific services.
Callbacks that fire on every matching event — fail immediately if an invariant is violated.
partition(orders, inventory, run=scenario) — bidirectional
network split, other connectivity intact.
Named Operations
Group related syscalls into logical operations. Fault "persist" instead of "write + fsync". Path filters target specific files.
ops={"persist": op(syscalls=["write","fsync"], path="*.wal")} —
fault the WAL persist operation, not individual syscalls.
Recipe Library
Curated, protocol-specific failure wrappers ship embedded in the
faultbox binary. Load them via the @faultbox/
prefix — no filesystem setup, no network fetch. Each recipe encodes a
canonical error message, status code, or incident pattern drawn from
real postmortems.
load("@faultbox/recipes/mongodb.star", "mongodb") —
works from any project, no recipes/ directory needed.
Ships with every binary.
mongodb.disk_full() and postgres.disk_full()
coexist. One import per protocol, zero name collisions.
Say cassandra.unavailable() instead of remembering
"Cannot achieve consistency level QUORUM". Recipes stay in sync
with real driver behavior.
faultbox recipes list and recipes show <name>
— browse the catalog without reading source.
LLM-First Design
New in v0.2.0
Faultbox is designed for both human engineers and LLM agents. Structured JSON output, MCP server, Claude Code integration — everything an agent needs for an autonomous code → test → fix loop.
faultbox mcp — 6 tools for Claude, Cursor, and any MCP client.
Run tests, generate specs, analyze failures natively.
--format json — machine-parseable results with fault info,
syscall summary, and actionable diagnostics.
faultbox init --claude — slash commands (/fault-test,
/fault-generate, /fault-diagnose) and auto-MCP config.
faultbox init --from-compose — zero-effort spec generation.
Detects protocols, wires dependencies, generates happy-path tests.
Not just "test failed" — structured hints like "write fault fired but service returned 200 — missing error handling in persist path."
ghcr.io/faultbox/faultbox image + GitHub Action for
automated fault testing on every PR.
Recent Releases
| Version | Highlights |
|---|---|
| v0.13.0 current | Five RFCs ship together. RFC-040 (Determinism Levels): the new determinism() top-level builtin declares L0/L1 + strict mode; the runtime emits unmediated_io events when the SUT performs I/O Faultbox can observe but isn't mediating (clock_gettime, getrandom, DNS to a non-Faultbox resolver, connect() to an undeclared address). Strict mode (the default) fails the test on the first untolerated leak. RFC-041 (Temporal Properties): eventually(p), always(p, between=), await_event(matcher), await_stable(quiescence_window=), and a rewritten state-machine monitor(name, on=, state_init=, update=, check=) plus a declarative test(name, body=, expect=, timeout=, terminate_when=) builtin. The test lifecycle gains a three-valued verdict (PASS / FAIL / INCONCLUSIVE) with CLI exit code 3 reserved for inconclusive-only runs. RFC-042 (Exploration Plan): faultbox plan subcommand, plan.json in every bundle, the report's Plan tab, coverage analysis (--coverage), rule-based --suggest, --check-cost --max-instances N CI gate, and the body-re-execution engine that turns named choose("name", [opts]) axes (RFC-043 §5.2), syscall-level probability fan-out (max_fires=N, mode="exhaustive"), and parallel(..., interleavings=) orderings into multi-leaf test executions. Each leaf carries a stable LeafID through TestResult, the bundle manifest, and the HTML report's tests table. RFC-043 (Non-deterministic Operators): four small Starlark primitives — choose, nondet, halt(reason) with a new halted outcome, and assume(predicate) / test(assume=) with per-leaf evaluation, AST denylist sandbox, and predicate Starlark errors mapping to Result="error". RFC-044 (Spec Language Simplification): withdraws RFC-013 (param(), superseded by choose()) and RFC-002 (domain(), service()/interface() proved sufficient); unifies the three fan-out axis kinds under one NonDeterministicChoice interface; collapses event sources under observe.stdout/observe.stderr and decoders under decoder("name", ...); deprecates faultbox generate in favor of faultbox plan --suggest. Two new tutorial chapters (Part 4 — Safety & Verification) cover the operator and fan-out vocabulary end-to-end. Full repo go test -race ./... green; Lima sweep 21/21 PASS across the same 6 integration spec suites as v0.12.29. |
| v0.12.29 | Remote services (RFC-036). New service(remote=...) kwarg points Faultbox at an externally-running endpoint — typically a real pod in a customer's k8s dev cluster — without launching a process. The proxy datapath from RFC-024 dials the remote upstream and the SUT reaches it through the proxy unchanged, so every protocol-level fault (response(), error(), slow(), gRPC method targeting, SQL matchers) keeps working. Process-level kwargs (seed=, reset=, reuse=, volumes=, ports=, args=, seccomp=, observe=, ops=) and syscall-level faults are rejected at spec load with explicit error messages naming the offending kwarg and pointing at protocol faults or mock_service(). Composes with RFC-038 tls=tls_cert(...) for TLS-required upstreams (the auto-generated proxy cert covers 127.0.0.1 so SUT-side verification works against the env-rewritten loopback addr). New typed remotes(...) value for services whose interfaces live on different hosts. New @faultbox/discovery/k8s.star stdlib helper exposing k8s.service, k8s.endpoint, and k8s.local — pure string sugar over <name>.<namespace>.svc.cluster.local, no runtime k8s client. The .fb bundle's env.json records every (service, interface, host, protocol) tuple from a remote-using run; faultbox replay warns when replaying such a bundle and points at RFC-037 (the open companion design RFC for the offline-replay determinism story). Cluster connectivity is the user's responsibility — Telepresence connect, kubectl port-forward, in-cluster execution, or VPN — documented in the new Connectivity guide. 49 new tests across spec-load validation (32), runtime/proxy lifecycle (10 incl. TLS×remote interop), bundle round-trip (2), replay warning (2), and string-grep doc gates (3). Full repo go test ./... green; go vet ./... clean; Lima make demo-container 4/4 PASS (proxy datapath refactor confirmed non-regressive against the seccomp + Postgres + Redis path). |
| v0.12.28 | TLS-aware proxy (RFC-038) + proxy traffic observability (RFC-034) + container fault paths (RFC-035). Twelve patch versions consolidated into one release. Headline: declare interface(..., tls=tls_cert(...)) on a service interface and the proxy terminates TLS at its listener and re-establishes TLS dialing the upstream — all protocol-aware fault rules (http.error(path=...), grpc.error(method=...), kafka.drop(topic=...), redis.error(key=...)) keep firing on the plaintext between the two TLS legs. Six plugins ship migrated this release (http, http2, gRPC, Kafka, Redis, TCP); the remaining 8 (postgres, mysql, mongodb, cassandra, clickhouse, memcached, nats, amqp) tracked in RFC-039 — declarations against unmigrated plugins emit a proxy_tls_pending event. The tls_cert(...) builtin is kwargs-only with full spec-load validation (cert/key pairing, file existence, CA PEM parse, insecure=True + ca= exclusion); empty tls_cert() auto-generates a self-signed proxy cert in memory for dev/test. Two TLS plumbing patterns: listener wrap-and-dial via proxy.ListenTLS(serverCfg) + proxy.Dial(ctx, target, clientCfg) (http, http2, kafka, redis, tcp) and framework credentials via grpc.Creds(credentials.NewTLS(...)) for gRPC. Also: RFC-034 connection lifecycle observability — proxy_conn_open, proxy_conn_close (with byte counts and reason classification), proxy_handshake_complete, proxy_stall (1Hz watchdog, 5s warn / 30s extend tiers) — wired into 13 of 15 plugins. RFC-035 container-consumer fault paths on Linux Docker. New stderr() event source. Container-mode observe=[stdout(...)] now works for container services. Race fix on tcp.go::handle. New internal/proxy/{http,grpc,kafka,redis,tcp}_tls_test.go suites (~30 tests). Full repo go test ./... -race green; Lima sweep 21/21 PASS. |
| v0.12.16 | Report UX overhaul. Driven by inDrive Freight triage feedback on the v0.12.15.x customer report; entirely scoped to internal/report (no bundle format or spec-language changes). Causal links now follow cause, not chronology — findCausalAncestors switched from vector-clock partial order (which on real bundles routinely had only the lifecycle events with complete clocks, so spaghetti pointed at service_ready instead of proxy_fault_applied) to seq-based strict precedence, restricted to faults / violations / errored steps. Hovering an ordinary success step now draws zero lines. New timeline filter bar above every Event Trace block: three presets (Compact default, hides framework lifecycle chatter; Anchors only strips everything except cause-relevant events; All events historical default) plus free-text search across event type / headline / fields. proxy_fault_applied / proxy_fault_removed are now first-class fault markers (red, not default-blue) and survive Phase 3 downsampling. Per-test "Faults applied" section pairs proxy_fault_applied with proxy_fault_removed (one row per assumption: service · protocol · interface · assumption · seq window). Recent block in the Assertion drill-down interleaves fault events between captured step rows by seq, with a fade-and-expand cap (220px max-height + mask-image gradient + Show full assertion toggle). Group-members table on folded markers (paginated 100/page, sticky header, scrollable) replaces the one-line "collapsed run" hint — runs hiding a 5xx among 99 successes are now legible. Fullscreen toggle on the test details modal. STDOUT JSON renders as a 2-column key/value table with dot-path flattening for nested objects. Source block falls back to fault_matrix(...) for matrix-generated test names and surfaces three jump links (scenario, fault, matrix call site). Plus tooltip vertical-text fix, detail panel summary no longer truncates, folded marker click routes through markerEvBySeq so the Group-members table actually appears. |
| v0.12.15.2 | Proxy goroutine context-rooting (Finding K). Customer (Freight) verified v0.12.15.1's redis RESP3 fix landed clean — cold-start path green end-to-end (smoke PASS in 16.3 s). The failure moved to the reuse path: cell 1 of the dbmatrix passes; cells 2–18 all fail identically with error connect to db: invalid connection (go-sql-driver ErrBadConn) or read: connection reset by peer (go-redis). Root cause: Manager.EnsureProxy rooted each proxy's Accept goroutine at the caller's ctx; preStartProxies runs under RunTest's per-test testCtx which cancels via defer cancel() at end of test — at end of cell 1 that cancellation took down the goroutine while the listener fd stayed bound (only Stop() closes it) and the cached m.proxies[key] entry stayed in place. Cells 2..N saw proxy_active(reused) but nobody was Accept()-ing, so the kernel completed the TCP handshake and then RST-ed. v0.12.15.2 roots the proxy's pCtx at context.Background(); StopAll/StopService still drive explicit teardown. Why this surfaced now: v0.12.13's reuse fix kept containers AND proxies alive across cells, exposing the latent ctx-rooting bug. Lima sweep 21/21 PASS; new TestManagerEnsureProxy_SurvivesCallerCtxCancel regression test guards the path. |