Troubleshooting Playbook

Ten failure modes that consistently consume a half-day of engineering time on every Faultbox onboarding, with the diagnostic shortcut for each. Pulled from the v0.9.x customer reports — most of these were hours-of-Slack-DM situations.

1. “My fault rule installed but nothing fired”

Symptom: faultbox test reports green; --format json shows "hits": 0 for the rule. Since v0.9.7 the terminal also surfaces Zero-traffic faults (N): rule installed, matched no syscalls.

Most-likely causes (ranked):

SUT cached the connection — the upstream connect happened before your fault window opened, and the SUT reused the open socket. Fault write or read instead of connect.
Wrong syscall name — you faulted sendto but the SUT uses plain write (stream socket). Use the family canonical: write=deny(…) expands to write/writev/pwrite64.
Fault window too narrow — the scenario didn’t drive any upstream traffic during the lambda body. Use fault_start/fault_stop to span more of the test, or move the inducing call inside the run= lambda.

Diagnostic: see seccomp-cheatsheet.md for the Go-op-to-syscall table; the v0.9.4 fault_zero_traffic event generally points at one of (1)/(2)/(3) directly.

2. “Test passes locally but fails on Lima/CI”

Symptom: faultbox test is green on the host but red in CI or in a fresh Lima VM.

Most-likely causes:

Stale binary — Faultbox uses cached binaries in bin/faultbox if they exist. Delete and rebuild: rm bin/faultbox && make build.
Different Go toolchain — the version that compiled the SUT affects which syscalls it emits. Check faultbox inspect run-*.fb env.json for go_toolchain on both hosts.
Different image digests — mysql:8.0.32 resolves to different bytes over time. Pin via faultbox.lock (v0.10.0+); meanwhile, pin in your spec: image="mysql@sha256:…".
Different kernel — seccomp behaviour can shift across major kernel versions (rare, but real). Check env.json for kernel.

Diagnostic: faultbox inspect <bundle>.fb shows everything environment-related the run captured. Diff two bundles’ env.json to spot the drift.

3. “TCP healthcheck says ready but app rejects requests”

Symptom: healthcheck=tcp("localhost:5432") returns success, then the first request against the service fails with connection refused or protocol error.

Cause: tcp() only proves a port is bound. Postgres/MySQL etc. have to handshake the protocol layer before accepting queries — Docker’s port-forwarder accepts TCP connections before the app behind it is ready.

Fix: replace tcp(...) with a protocol-aware check:

# http() healthchecks already require a 2xx, so an app /healthz that
# checks its own DB connection is the most reliable readiness gate:
healthcheck = http("http://localhost:8080/healthz")

For a bare SQL service with no app in front, tcp() is the only built-in healthcheck - give the container extra settle time, or have the SUT retry its first query, since Faultbox has no query-based healthcheck builtin.

This was the #5 hour-burner on the inDrive PoC.

4. “Container starts but truck-api can’t reach it”

Symptom: service container is running (you can see it in docker ps), but the SUT container’s request to it times out.

Cause: until v0.9.6 the proxy ran on the host’s loopback only — a container couldn’t reach 127.0.0.1:<port> on the host from inside its netns. v0.9.6 added host.docker.internal via Docker’s ExtraHosts, plus auto-rewriting in buildContainerEnv.

Fix: upgrade to v0.9.6+. If you’re on v0.9.5, set env={"OTHER_SVC_ADDR": other.main.internal_addr} and use the container DNS name (matching container-mode’s pre-RFC-024 behaviour).

5. “JWT-protected request returns 401 even with my mock token”

Symptom: SUT rejects every token your jwt.server() mints.

Most-likely causes:

Wrong claim name — your middleware expects uid but you sent user_id (this was 8h on the inDrive PoC). Check what the SUT actually validates.
Audience mismatch — middleware demands aud="api.example.com" but you didn’t set it. Add aud to the claims dict.
Token expired — iat/exp claims are seconds-since-epoch. If you set them once and re-ran a day later, they’re stale. Either omit exp (some middlewares don’t enforce it) or compute now + 3600 in your driver.
JWKS endpoint unreachable from the SUT — the SUT fetches the JWKS over HTTP. If the issuer URL doesn’t resolve from inside the SUT’s netns, no signature verifies. Check OIDC_JWKS_URL actually works from the SUT.

Diagnostic: enable the SUT’s auth-middleware DEBUG logs and look for the actual rejection reason. Most middlewares log “claim X missing” or “kid not found in JWKS.”

6. “fault_matrix runs but my expectation never fires”

Symptom: every cell of a fault_matrix passes, even ones that shouldn’t.

Cause: default_expect= accepts callables but doesn’t fail unless the callable raises. A returning-truthy lambda is read as “pass.”

Fix: use the v0.9.8 explicit predicates:

fault_matrix(
    scenarios      = [...],
    faults         = [...],
    default_expect = expect_success(),  # explicit, raises on violation
)

Or for ad-hoc cases, use assert_true inside the lambda:

default_expect = lambda r: assert_true(r != None, "scenario hung"),

7. “Lima VM hangs on `make demo`”

Symptom: make demo runs the binary in the Lima VM but the test never completes.

Most-likely causes:

VM out of resources — check limactl shell faultbox-dev free -h. seccomp-notify is memory-light but Docker daemon + your containers add up.
Stale Docker network — docker network rm faultbox-net then re-run.
Stale containers from a previous test — docker ps -a | grep faultbox- and docker rm -f anything reusable. The runtime cleans up on success but a panic mid-test can orphan containers.

If the hang is reproducible: kill it with Ctrl-C, then faultbox inspect run-*.fb — the partial bundle usually shows which service hadn’t reached service_ready yet.

8. “Spec loads then immediately errors with `unknown keyword 'X'`”

Symptom: error: load test.star: fault_assumption() unexpected keyword 'foo'

Cause: undocumented kwargs are a parse error since v0.9.7. You either typed a kwarg name wrong or you’re on a version that predates a feature you saw in docs/examples.

Fix:

Double-check the kwarg name in spec-language.md.
Check your version: faultbox --version. Compare against the feature’s “shipped in vX.Y.Z” callout in the docs.
Bump if needed: brew upgrade faultbox or download from GitHub releases.

9. “Bundle says faultbox 0.9.7 but I have 0.9.8 installed”

Symptom: faultbox inspect run-*.fb warns bundle was produced by faultbox 0.9.7; current is 0.9.8.

Cause: bundle was generated by an older binary; you’ve since upgraded. Reading the bundle still works (inspect/report never refuse on minor-drift). For byte-identical replay, install the producer version.

faultbox replay (v0.10.0+) refuses on major version drift only (0.x → 1.x) — minor/patch drift warns and proceeds. See the bundles.md version-compat table for the full matrix.

10. “Container DNS works from the test driver but not from the SUT”

Symptom: db.main.query(...) from the test body works, but the SUT inside its container errors on dial tcp: lookup db: no such host.

Cause: test-body requests run from the test driver (host process) which uses localhost:<HostPort> to reach the container. The SUT inside its container needs to use the Docker DNS name (db) over the faultbox-net bridge.

Fix: pass the right address into the SUT’s env:

api = service("api", image = "myapi:latest",
    env = {
        "DB_HOST": db.main.internal_addr,  # → "db:5432" inside container
    },
)

Use .internal_addr for service-to-service references in container mode. .addr returns the host-port form, which only the test driver can reach.

11. “Host-binary SUT can’t connect to a Docker DB upstream”

Symptom: truck-api (a host binary) times out at the healthcheck stage while trying to connect to a Docker db service. Trace shows the proxy started cleanly. Spec wires SUT env from db.main.internal_addr.rsplit(":").

Cause: internal_addr returns the container DNS name ("db:3306") which the host-binary process can’t resolve. Worse, the auto-substitution that rewrites real addrs to proxy addrs only matches the literal substring db:3306 in env values — so rsplit(":") decomposition silently breaks it. The SUT ends up dialing db:3306 or 127.0.0.1:3306 (the unmapped container-internal port) and times out.

Fix: use iface.proxy_addr / proxy_host / proxy_port instead. These are late-bound — they return placeholders at spec-load and resolve to the real proxy listener at test-execution:

api = service("truck-api", "/usr/local/bin/truck-api",
    interface("public", "http", 9000),
    env = {
        "MYSQL_HOST": db.main.proxy_host,
        "MYSQL_PORT": db.main.proxy_port,
        "MYSQL_DSN":  "user:pass@tcp(" + db.main.proxy_addr + ")/appdb",
    },
)

Don’t .split() or .rsplit() proxy_addr — operations run at spec-load where it’s still a placeholder. Use the separate proxy_host / proxy_port attributes when you need the parts.

See recipes.md → Wiring SUTs to the proxy for more context. Fixed in v0.12.12 (RFC-033).

12. “Service exited before becoming ready” / missing-binary launch

Symptom (binary mode): the test fails fast with service "truck-api" exited before becoming ready: exec /tmp/truck-api: no such file or directory.

Cause: the target binary path in your spec doesn’t exist, isn’t executable, or wasn’t built for the VM’s architecture. Before v0.13.0 this surfaced as a misleading context deadline exceeded a full healthcheck-timeout later (often 60s), with exit_code=0 in the session log — the exec failure was invisible. v0.13.0 resolves and verifies the target before signaling readiness, so the launch now fails immediately and names the path that couldn’t be exec’d.

Fix:

Confirm the path in your service(...) declaration exists in the VM and is executable: make env-exec CMD='ls -l /tmp/truck-api'.
Rebuild for the VM arch if it’s a cross-compile (GOOS=linux GOARCH=arm64). A host-built (darwin) binary copied into the VM produces an exec format error here.

Related, for container mode: if the faultbox-shim binary is missing alongside faultbox, container services never reach service_ready. Build and install both with make install-lima (see README → Build from source). The shim is the container entrypoint; without it the container can’t start.

Troubleshooting Playbook

1. “My fault rule installed but nothing fired”

2. “Test passes locally but fails on Lima/CI”

3. “TCP healthcheck says ready but app rejects requests”

4. “Container starts but truck-api can’t reach it”

5. “JWT-protected request returns 401 even with my mock token”

6. “fault_matrix runs but my expectation never fires”

7. “Lima VM hangs on make demo”

8. “Spec loads then immediately errors with unknown keyword 'X'”

9. “Bundle says faultbox 0.9.7 but I have 0.9.8 installed”

10. “Container DNS works from the test driver but not from the SUT”

11. “Host-binary SUT can’t connect to a Docker DB upstream”

12. “Service exited before becoming ready” / missing-binary launch

See also

7. “Lima VM hangs on `make demo`”

8. “Spec loads then immediately errors with `unknown keyword 'X'`”