Chapter 4: Traces & Assertions
Duration: 25 minutes Prerequisites: Chapter 0 (Setup) completed
Goals & Purpose
In chapters 2-3 you tested inputs and outputs: “I sent this request, I got this response.” But distributed systems bugs live in the middle — in the sequence of internal operations between request and response.
Consider: your API returns 200, but did it actually write to the WAL? Or did it skip the write and respond from cache? Both return 200, but one is correct and the other is a data loss bug.
The key insight: to verify distributed systems correctly, you need to assert on what happened inside, not just what came out. Faultbox records every intercepted syscall as an event — giving you a complete audit trail of the system’s internal behavior.
This chapter teaches you to:
- Query the syscall trace — “did this operation happen?”
- Assert temporal properties — “A happened before B”
- Prove absence — “this operation did NOT happen”
- Visualize causality — see which service did what, and when
After this chapter, you’ll write tests that don’t just check outputs — they verify the mechanism that produced them.
The demo system
Chapters 4-6 use the full demo: two services that form a simple order processing pipeline:
orders (HTTP :8080) ──→ inventory (TCP :5432 + WAL file)
- orders — HTTP API that receives order requests and forwards them to inventory
- inventory — TCP service that manages stock and writes a
write-ahead log (WAL) to
/tmp/inventory.wal
Create traces-test.star in the project root:
# Linux (native): BIN = "bin"
# macOS (Lima): BIN = "bin/linux"
BIN = "bin/linux"
inventory = service("inventory", BIN + "/inventory-svc",
interface("main", "tcp", 5432),
env = {"PORT": "5432", "WAL_PATH": "/tmp/inventory.wal"},
healthcheck = tcp("localhost:5432"),
)
orders = service("orders", BIN + "/order-svc",
interface("public", "http", 8080),
env = {"PORT": "8080", "INVENTORY_ADDR": inventory.main.addr},
depends_on = [inventory],
healthcheck = http("localhost:8080/health"),
)
You’ll add all test functions from this chapter to this file.
Verify the topology works with a quick happy-path test — add to traces-test.star:
def test_happy_path():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_eq(resp.status, 200)
Run it: Linux:
faultbox test traces-test.star --test happy_path
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test happy_path"
--- PASS: test_happy_path (200ms, seed=0) ---
1 passed, 0 failed
Good — the topology works. Now add the trace assertion tests below.
The event log
Every intercepted syscall is recorded with:
- Sequence number — global ordering
- Timestamp — when it happened
- Service name — which component
- Syscall, PID, decision — what was intercepted and what faultbox decided
- File path — for file syscalls
- Vector clock — logical time per service (for causality)
Even allow decisions are recorded. The trace is a complete record.
trace() — observe without faulting
By default, Faultbox only records events for syscalls that have a
seccomp filter installed (via fault() or trace()). If you want to
observe what a service does without injecting any fault, use trace():
def test_happy_path_observed():
"""Watch what inventory does during a normal order — no faults."""
def scenario():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_eq(resp.status, 200)
# We can now see what syscalls inventory made.
assert_eventually(service="inventory", syscall="write", path="*.wal")
trace(inventory, syscalls=["write", "openat", "fsync"], run=scenario)
Run it: Linux:
faultbox test traces-test.star --test happy_path_observed
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test happy_path_observed"
--- PASS: test_happy_path_observed (200ms, seed=0) ---
syscall trace (4 events):
trace rule on inventory: write=trace → filter:[write,writev,pwrite64], ...
trace() vs fault(): Both install seccomp filters. fault() modifies
syscall outcomes (deny, delay). trace() allows everything but records it.
Use trace() when you want to assert on internal behavior of a healthy
system — no faults, just observation.
There’s also an imperative form:
def test_observe_then_fault():
# Phase 1: observe healthy behavior.
trace_start(inventory, syscalls=["write", "fsync"])
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_eq(resp.status, 200)
assert_eventually(service="inventory", syscall="write", path="*.wal")
trace_stop(inventory)
# Phase 2: now fault and verify error handling.
def break_it():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_true(resp.status != 200, "should fail under fault")
fault(inventory, write=deny("EIO"), run=break_it)
assert_eventually — “this happened”
The most common temporal assertion: verify something occurred.
def test_wal_written():
"""Slow writes — but WAL write still happened."""
def scenario():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_eq(resp.status, 200)
# Prove: the inventory service wrote specifically to the WAL file.
assert_eventually(service="inventory", syscall="write", path="*.wal")
fault(inventory, write=delay("100ms", label="slow WAL"), run=scenario)
The path="*.wal" filter checks not just “a write happened” but “a write
to a .wal file happened.” Without the path check, you’d also match writes
to stdout or TCP sockets — which don’t prove durability.
All filter parameters support glob patterns: path="*.wal", path="/tmp/*",
decision="deny*". For more complex conditions, use where=lambda:
# Lambda for conditions that globs can't express:
assert_eventually(
where=lambda e: e.service == "inventory"
and e.syscall == "write"
and int(e.fields.get("size", "0")) > 1024
)
Why a fault here? Trace assertions query the syscall event log. Events are only recorded for syscalls with a seccomp filter installed.
fault(inventory, write=delay(...))installs filters for[write,writev,pwrite64]— so only write-family events appear in the trace. The fd→path resolution (via/proc/PID/fd) gives us the file path for each write, making the path filter possible.
Run it: Linux:
faultbox test traces-test.star --test wal_written
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test wal_written"
--- PASS: test_wal_written (500ms, seed=0) ---
syscall trace (4 events):
#8 inventory write delay(100ms) [slow WAL] (+100ms)
fault rule on inventory: write=delay(100ms) → filter:[write,writev,pwrite64] label="slow WAL"
Why this matters: the API returned 200, but assert_eventually proves
the write actually happened. Without this, you’re only testing the
happy-path HTTP response — not the durability guarantee.
assert_never — “this didn’t happen”
Prove that something did NOT occur — equally important for correctness:
def test_no_write_when_unreachable():
"""When inventory is unreachable, no denied connect should go through."""
def scenario():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_eq(resp.status, 503)
# Prove: every connect from orders was denied.
assert_never(
service="orders",
syscall="connect",
decision="allow",
)
fault(orders, connect=deny("ECONNREFUSED", label="inventory unreachable"), run=scenario)
Why not
assert_never(service="inventory", syscall="write")? Inventory has background writes (startup logs, healthcheck responses) that are unrelated to order processing. Since we can’t path-filter writes (see Chapter 1 callout), we assert on the cause (connect denied) rather than the effect (no writes).
Run it: Linux:
faultbox test traces-test.star --test no_write_when_unreachable
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test no_write_when_unreachable"
--- PASS: test_no_write_when_unreachable (200ms, seed=0) ---
fault rule on orders: connect=deny(ECONNREFUSED) → filter:[connect]
#8 orders connect deny(connection refused)
Why this matters: assert_never with decision="allow" proves that
no connection succeeded. If any did, inventory might have received a
partial request — a potential data inconsistency bug.
assert_before — ordering guarantees
Verify that events happened in the correct order:
def test_delay_then_deny():
"""Slow writes come before the fsync denial."""
def scenario():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_true(resp.status != 200, "expected failure on fsync deny")
assert_before(
first={"service": "inventory", "syscall": "write"},
then={"service": "inventory", "decision": "deny*"},
)
fault(inventory, write=delay("100ms", label="slow WAL"), fsync=deny("EIO", label="sync failure"), run=scenario)
Note: Both fault keywords are in a single
fault()call, so both filters are installed together.assert_beforeverifies that the delayed write happened before the denied fsync — confirming the WAL write attempt preceded the sync failure.
Run it: Linux:
faultbox test traces-test.star --test delay_then_deny
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test delay_then_deny"
--- PASS: test_delay_then_deny (350ms, seed=0) ---
assert_before: inventory.write → inventory.fsync(deny) ✓
fault rule on inventory: write=delay(100ms), fsync=deny(EIO)
Why this matters: if the write happens before the WAL open, the system
has a durability bug — it confirmed before persisting. assert_before catches
this regardless of timing.
Filter parameters
All temporal assertions accept the same filters:
| Parameter | Matches | Supports glob |
|---|---|---|
service | Service name | No |
syscall | Syscall name | No |
path | File path | No |
decision | Fault decision | Yes: "deny*" matches any denial |
Glob examples: decision="deny*" matches "deny(EIO)", "deny(ENOSPC)", etc.
events() — query for custom logic
Get matching events as a list for advanced assertions:
def test_write_count():
"""Count how many writes inventory makes for a single order."""
def scenario():
resp = orders.post(path="/orders", body='{"sku":"widget","qty":1}')
assert_eq(resp.status, 200)
writes = events(service="inventory", syscall="write")
print("inventory writes per order:", len(writes))
assert_true(len(writes) >= 1, "expected at least 1 WAL write")
assert_true(len(writes) < 10, "too many writes for a single order")
fault(inventory, write=delay("10ms"), run=scenario)
events() returns a list of matching events. You can filter by any
combination of service, syscall, decision, path — then use
len(), loop, or inspect individual fields.
Run it: Linux:
faultbox test traces-test.star --test write_count
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test write_count"
--- PASS: test_write_count (250ms, seed=0) ---
inventory writes per order: 3
fault rule on inventory: write=delay(10ms) → filter:[write,writev,pwrite64]
Trace output formats
JSON trace — for programmatic analysis
Linux:
faultbox test traces-test.star --test wal_written --output trace.json
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test wal_written --output trace.json"
Structured JSON with every event, PObserve-compatible fields, vector clocks, and replay commands for failed tests.
ShiViz — for visual causality
Linux:
faultbox test traces-test.star --test wal_written --shiviz trace.shiviz
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test wal_written --shiviz trace.shiviz"
Open at https://bestchai.bitbucket.io/shiviz/ to see a space-time diagram with arrows between services. Each service gets its own swimlane, plus a “test” swimlane representing the test driver (your step calls).
What you’ll see in the diagram:
- Causal arrows between swimlanes — e.g., test sends a request to orders, orders connects to inventory
- Fault labels on events —
[slow WAL],[sync failure]— so you can identify which fault rule affected each syscall - Paths and latency — file paths for I/O syscalls,
(+100ms)for delays - VIOLATION markers — if a test fails, a
VIOLATION [test_name] reasonevent appears at the failure point in the “test” swimlane
Normalized trace — for determinism verification
The problem: you fix a bug, and the tests pass. But did the fix change the system’s behavior in unexpected ways? Maybe the fix reordered internal operations, or removed a retry, or changed which syscalls get called. The tests still pass, but the system behaves differently — and you won’t know until production.
The solution: capture a normalized trace (stripped of timestamps, PIDs, and other non-deterministic fields) before and after your change. If the traces are identical, the behavior is unchanged. If they differ, the diff shows exactly what changed.
Step 1: Capture a baseline trace: Linux:
faultbox test traces-test.star --test wal_written --normalize trace-before.norm
macOS (Lima):
make lima-run CMD="faultbox test traces-test.star --test wal_written --normalize trace-before.norm"
Step 2: Make a small code change. Open demo/inventory-svc/main.go,
find the walAppend function (~line 151), and add a duplicate write
right after the existing one:
In demo/inventory-svc/main.go, find the walAppend function and
add a duplicate write after the existing WriteString:
if _, err := f.WriteString(entry); err != nil { // ← existing line
return fmt.Errorf("write: %w", err)
}
f.WriteString(entry) // ← add this line (double-write bug)
Rebuild and capture again: Linux:
make build
macOS (Lima):
make lima-build
Step 3: Compare: Linux:
faultbox diff trace-before.norm trace-after.norm
macOS (Lima):
make lima-run CMD="faultbox diff trace-before.norm trace-after.norm"
If the system is deterministic (same seed + same binary), the output is:
traces are identical
If the behavior changed (the double-write you just added), you see:
traces differ (12 vs 13 lines):
line 8:
run1: write allow /tmp/inventory.wal
run2: write allow /tmp/inventory.wal
line 9:
run1: fsync allow
run2: write allow /tmp/inventory.wal ← extra write (your bug)
The diff shows exactly one extra write to the WAL — the double-write
you added. Revert your change (git checkout demo/inventory-svc/main.go)
and rebuild to restore the original behavior.
When to use this:
- Before/after a refactor — prove no behavioral change
- In CI — capture a baseline trace, fail if a PR changes it unexpectedly
- Debugging flaky tests — run twice with the same seed, diff to find non-determinism
What you learned
assert_eventually()proves something happened in the syscall traceassert_never()proves something did NOT happenassert_before()proves ordering between eventsevents()returns matching events for custom logic- Trace output: JSON (programmatic), ShiViz (visual), normalized (determinism)
The mental model: don’t just test inputs and outputs. Test the mechanism: “the WAL was written before the response,” “no data was persisted for a failed operation,” “retries happened fewer than N times.”
What’s next
Your tests verify behavior under specific scenarios. But distributed systems bugs often depend on timing — which request arrives first, which write completes first. The same code can work in one ordering and fail in another.
Chapter 5 introduces parallel() — running operations concurrently and
exploring different interleavings to find timing-dependent bugs.
Exercises
-
WAL ordering: Write a test that places an order and asserts:
- The WAL was opened (
assert_eventuallywith openat + path) - The WAL was written to (
assert_eventuallywith write) - The open happened before the write (
assert_before)
- The WAL was opened (
-
Clean happy path: Write a happy-path test and assert zero denied syscalls:
denied = events(decision="deny*") assert_eq(len(denied), 0, "no syscalls should be denied") -
Connection count: Write a test that counts
connectsyscalls from the orders service. Is it 1 per request? More? Why? -
ShiViz visualization: Run with
--shiviz trace.shivizand open the file in ShiViz. Find thevector_clockarrows. What does a merge arrow between two services represent?