On this page

Fault Testing Methodology

How to think about fault testing for distributed systems. This guide gives you the mental model and process — not which API to call, but how to approach your system systematically.

The core question

Every distributed system makes assumptions about its environment:

  • “The database is available”
  • “Writes succeed”
  • “The network is fast”
  • “Messages are delivered exactly once”

Fault testing asks: what happens when each assumption breaks?

Not “does the system crash?” — that’s the easy case. The hard question is: does the system behave correctly when things go wrong? Does it return useful errors? Does it preserve data integrity? Does it recover?

The three layers

Fault testing has three layers. In the domain-centric model, each layer is a separate artifact that composes with the others:

Layer 1: Scenarios (what the system does)

Define your critical user flows as probes — functions that exercise the system and return an observable result:

def create_order():
    return api.post(path="/orders", body='{"item":"widget","qty":1}')

scenario(create_order)

def health_check():
    return api.get(path="/health")

scenario(health_check)

Why this comes first: if the happy path is broken, fault tests are meaningless. Scenarios run as tests on their own — test_create_order verifies the system works before you inject failures.

How much: one scenario per critical user flow. Not exhaustive input testing — that’s unit tests. Focus on: “does the complete flow work end-to-end?”

Layer 2: Fault assumptions (what can go wrong)

Name each failure mode as a fault assumption — a reusable, composable definition of what breaks:

db_down = fault_assumption("db_down",
    target = db,
    connect = deny("ECONNREFUSED"),
)

db_slow = fault_assumption("db_slow",
    target = db,
    connect = delay("3s"),
)

disk_full = fault_assumption("disk_full",
    target = db,
    write = deny("ENOSPC"),
)

How to think about it: for each dependency, ask three questions:

  1. What if it’s down? (connect refused)
  2. What if it’s slow? (delay)
  3. What if it errors? (I/O error, disk full)

That gives you 3 fault assumptions per dependency — a good starting point.

Layer 3: Invariants (what must always hold)

Define properties that must hold regardless of which scenario runs under which fault. Attach them to fault assumptions as monitors:

def no_negative_stock(event):
    if event.get("stock") and int(event["stock"]) < 0:
        fail("stock went negative: " + event["stock"])

stock_invariant = monitor(no_negative_stock, service="inventory")

# Attach to every fault assumption where this matters:
db_down = fault_assumption("db_down",
    target = db,
    connect = deny("ECONNREFUSED"),
    monitors = [stock_invariant],
)

Why this is the hardest layer: a test checks one scenario under one fault. An invariant must hold across ALL combinations — including ones you haven’t thought of. The monitor fires automatically in every matrix cell that uses the fault assumption.

Examples of invariants:

  • Money is never created or destroyed (financial systems)
  • An order confirmed to the user is always persisted in the database
  • No duplicate messages are published to the event bus
  • A service that returns 200 has actually committed the data
  • Failed operations leave no partial state (atomicity)

Composing the layers: the fault matrix

The three layers compose into a matrix — the cross-product of scenarios and fault assumptions:

fault_matrix(
    scenarios = [create_order, health_check],
    faults = [db_down, db_slow, disk_full],
    default_expect = lambda r: assert_true(r != None),
    overrides = {
        (create_order, db_down): lambda r: (
            assert_eq(r.status, 503),
            assert_true("database" in r.body.lower()),
        ),
        (create_order, db_slow): lambda r: (
            assert_true(r.status in [200, 504]),
            assert_true(r.duration_ms < 5000, "should timeout, not hang"),
        ),
    },
)

2 scenarios × 3 faults = 6 tests from one declaration. Each cell inherits monitors from its fault assumption. Overrides specify expected behavior where it matters.

The dependency matrix

The most practical technique for finding what to test.

Step 1: List your services and dependencies

┌──────────┬────────────────┬──────────────┐
│ Service  │ Dependencies   │ Protocol     │
├──────────┼────────────────┼──────────────┤
│ api      │ db, cache, auth│ HTTP         │
│ worker   │ db, kafka, s3  │ HTTP (admin) │
│ auth     │ db             │ gRPC         │
│ db       │ disk           │ Postgres     │
│ cache    │ (memory only)  │ Redis        │
│ kafka    │ disk           │ Kafka        │
└──────────┴────────────────┴──────────────┘

Step 2: For each dependency, enumerate failure modes as fault assumptions

# api → db
api_db_down     = fault_assumption("api_db_down",    target=api, connect=deny("ECONNREFUSED"))
api_db_slow     = fault_assumption("api_db_slow",    target=db,  write=delay("3s"))
api_db_diskfull = fault_assumption("api_db_diskfull", target=db, write=deny("ENOSPC"))

# api → cache
api_cache_down  = fault_assumption("api_cache_down", target=api, connect=deny("ECONNREFUSED"))
api_cache_slow  = fault_assumption("api_cache_slow", target=cache, read=delay("2s"))

# api → auth
api_auth_down   = fault_assumption("api_auth_down",  target=api, connect=deny("ECONNREFUSED"))

Step 3: Prioritize by impact

Not all failures are equal. Prioritize by:

  1. Data loss risk — can this failure cause data to be lost or corrupted?
  2. User impact — does the user see an error, or does the system silently fail?
  3. Recovery difficulty — can the system recover automatically, or does it need manual intervention?
  4. Frequency in production — how often does this actually happen?

Start with the top-left corner: high data-loss risk + high frequency.

Step 4: Compose into a matrix

fault_matrix(
    scenarios = [create_order, list_orders, health_check],
    faults = [api_db_down, api_db_slow, api_db_diskfull, api_cache_down],
    default_expect = lambda r: assert_true(r != None),
    overrides = {
        (create_order, api_db_down):     lambda r: assert_eq(r.status, 503),
        (create_order, api_db_diskfull): lambda r: assert_eq(r.status, 503),
        (create_order, api_db_slow):     lambda r: assert_true(r.duration_ms < 5000),
    },
    exclude = [
        (health_check, api_db_diskfull),  # health check doesn't write
    ],
)

3 scenarios × 4 faults - 1 excluded = 11 tests, each with targeted expectations or smoke-test defaults.

Syscall vs protocol: decision framework

I want to test…UseWhy
”DB is completely down”connect = deny("ECONNREFUSED")Syscall: blocks all connections
”This specific SQL query fails”rules = [error(query="INSERT INTO orders*")]Protocol: targets one query
”Disk is full”write = deny("ENOSPC")Syscall: affects all writes
”Slow reads from one table”rules = [delay(query="SELECT * FROM orders*")]Protocol: targets one query
”Redis SET fails but GET works”rules = [error(command="SET")]Protocol: targets one command
”Total network partition”partition(api, db) (standalone test)Syscall: blocks connect both ways
”HTTP 429 from upstream”rules = [response(status=429)]Protocol: specific HTTP response
”Kafka message loss”rules = [drop(topic="orders")]Protocol: drops specific messages

Rule of thumb:

  • Start with syscall faults in fault_assumption() for “is X down?” and “is the disk broken?” — they’re simpler and catch broad categories
  • Add protocol faults for “does this specific query/path/command fail correctly?” — they’re more precise
  • Use both via composition when you need precision on some paths and broad coverage on others

Process: from zero to covered

Week 1: Foundation

  1. Write a Faultbox spec with your topology (service(), depends_on, healthcheck)
  2. Write scenario probes for your 3-5 most critical user flows
  3. Run them — fix any issues

Week 2: Fault assumptions + matrix

  1. Build the dependency matrix (services × dependencies × failure modes)
  2. Create fault_assumption() for the top 10 highest-impact failure modes
  3. Compose a fault_matrix() — discover missing error handling, fix it

Week 3: Invariants

  1. Identify 3-5 invariants your system must maintain
  2. Write monitor() definitions and attach to fault assumptions
  3. Re-run the matrix — monitors catch issues across all cells automatically

Week 4: Concurrency & edge cases

  1. Add parallel() in scenarios for concurrent operations
  2. Run --explore=all on critical tests to find timing-dependent bugs
  3. Add protocol-level faults (via rules=) for precise targeting

Ongoing

  • When you add a new service: add it to the topology, write scenario + top 3 fault assumptions
  • When you fix a production incident: create a fault assumption that reproduces it
  • When you add a new dependency: add its failure modes to the matrix

How to know you’re done

You’re never “done” — but you have good coverage when:

  1. Every dependency has at least “down” and “slow” fault assumptions — the minimum
  2. Your top 3 invariants are monitors on fault assumptions — they run on every matrix cell
  3. Production incidents are reproducible — every past outage has a corresponding fault assumption
  4. The team defines fault assumptions for new features — it’s part of the development process, not a separate activity

The metric: count of (fault assumptions defined) vs (possible failure modes from your dependency matrix). 50% is good. 80% is excellent. 100% is unnecessary — diminishing returns after the high-impact failures are covered.

Best practices

Always verify the fault fired

A common mistake: the fault assumption targets a syscall the service never calls. The test passes — but only because the fault was never exercised.

Bad: passes silently even if the fault never fired:

fault_scenario("order_db_write_fail",
    scenario = create_order,
    faults = db_write_fail,
    # No expect → smoke test → passes if scenario completes
)

Good: verify the fault actually intercepted a syscall:

fault_scenario("order_db_write_fail",
    scenario = create_order,
    faults = db_write_fail,
    expect = lambda r: (
        # 1. The service returned an error (not silent success)
        assert_true(r.status >= 400, "should fail when DB can't write"),
        # 2. The fault actually fired (not bypassed)
        assert_eventually(type="syscall", service="db", decision="deny*"),
    ),
)

assert_eventually(decision="deny*") confirms a syscall was intercepted and denied. If the service never attempted a write, this fails — telling you the test didn’t exercise the intended code path.

Monitors vs expect: complementary, not redundant

They check different things:

ToolWhat it checksWhen it fires
Monitor on assumption”IF X happens THEN Y must be true”On every matching event, in real-time
Expect on scenario”The result must look like this”After the scenario completes

Monitor example: “If a Kafka event was published, a DB row must exist.” This is a real-time invariant — it only fires when a Kafka event actually appears. If the service never publishes (because the fault prevented it), the monitor stays silent. That’s correct — the invariant wasn’t violated.

def no_orphan_events(event):
    if event["type"] == "topic" and event.get("topic") == "order-created":
        db_rows = events(where=lambda e:
            e.type == "wal" and e.data.get("action") == "INSERT")
        if len(db_rows) == 0:
            fail("Kafka event published without DB insert")

orphan_check = monitor(no_orphan_events)

Expect example: “The service must not publish a Kafka event at all.” This is a post-hoc assertion — it checks what happened (or didn’t happen) after the scenario completes.

fault_scenario("order_db_write_fail",
    scenario = create_order,
    faults = db_write_fail,
    expect = lambda r: (
        assert_true(r.status >= 400),
        # Verify: no Kafka event was published
        assert_never(where=lambda e:
            e.type == "topic" and e.data.get("topic") == "order-created"),
        # Verify: the fault actually fired
        assert_eventually(type="syscall", service="db", decision="deny*"),
    ),
)

Use both together:

  • The monitor catches the violation if it happens (real-time safety net)
  • The expect verifies the observable outcome is correct (post-hoc check)
  • assert_eventually(decision="deny*") confirms the fault was exercised

Name fault assumptions by failure mode, not by test

Bad: names tied to specific tests:

test1_fault = fault_assumption("test1_fault", target=db, connect=deny("ECONNREFUSED"))
test2_fault = fault_assumption("test2_fault", target=db, write=deny("EIO"))

Good: names describe the failure mode:

db_down = fault_assumption("db_down", target=db, connect=deny("ECONNREFUSED"))
disk_io_error = fault_assumption("disk_io_error", target=db, write=deny("EIO"))

Good names appear in the matrix report and make results readable:

                 │ db_down      │ disk_io_error
─────────────────┼──────────────┼──────────────
create_order     │ PASS (210ms) │ FAIL

Anti-patterns

“Test everything at once” — don’t inject 5 faults simultaneously on day 1. Start with one fault assumption, one scenario. Understand the behavior. Then combine via composition.

“Only check that it doesn’t crash” — if all your matrix cells use default_expect = lambda r: assert_true(r != None), you’re testing that the system survives — not that it behaves correctly. Add overrides with specific expectations for critical cells.

“100% fault coverage” — not every failure mode matters. A fsync fault on a service that never calls fsync is noise. Focus on failures that happen in production.

“Faults in unit tests” — Faultbox tests integration between services. If you’re testing a single function’s error handling, a Go test with a mock is simpler. Use Faultbox when you need to verify behavior across service boundaries.

“Fix the test, not the code” — when a fault matrix cell fails, the instinct is to adjust the override. Resist it. The failing cell is showing you a real bug — the service doesn’t handle this failure correctly. Fix the service, then the cell passes.