Choosing Fault Levels: Syscall vs Protocol
Faultbox operates at two levels. This guide helps you decide which to use and when to combine them.
Two levels, one API
# Syscall level: affects ALL writes by the db process
disk_error = fault_assumption("disk_error",
target = db,
write = deny("EIO"),
)
# Protocol level: affects only INSERT queries to the orders table
insert_fail = fault_assumption("insert_fail",
target = db.pg,
rules = [error(query="INSERT INTO orders*", message="disk full")],
)
The target determines the level:
- Service (
db) → syscall level (seccomp-notify) - Interface reference (
db.pg) → protocol level (transparent proxy)
Both are fault_assumption() values — composable, reusable, nameable.
When to use syscall faults
Syscall faults simulate infrastructure failures — the kind that affect everything a service does, not just specific operations.
| Scenario | Fault | What it simulates |
|---|---|---|
| Server disk dies | write=deny("EIO") | Every write fails |
| Disk fills up | write=deny("ENOSPC") | No space for any write |
| Network cable unplugged | connect=deny("ECONNREFUSED") | Can’t reach anything |
| Network is slow | connect=delay("2s") | Every connection takes 2s |
| Total partition | partition(svc_a, svc_b) | Bidirectional network split |
Strengths:
- Works on ANY binary — no protocol support needed
- Catches unexpected write paths (logging, temp files, metrics)
- Simulates real infrastructure failures accurately
- Simple: one line tests a broad category
Weaknesses:
- Coarse:
write=deny("EIO")blocks stdout, TCP, files — everything - Can’t target specific queries, paths, or commands
- May break service health (can’t respond to healthchecks under write fault)
Best for: “is the infrastructure broken?” questions.
When to use protocol faults
Protocol faults simulate application-level failures — one operation fails while the rest of the service works normally.
| Scenario | Fault | What it simulates |
|---|---|---|
| One SQL query fails | error(query="INSERT*") | DB rejects a specific insert |
| HTTP upstream returns 429 | response(path="/api/*", status=429) | Rate limiting |
| Kafka message dropped | drop(topic="orders") | Message loss on one topic |
| Redis SET fails | error(command="SET") | Write to cache fails |
| Slow specific endpoint | delay(path="/search*", delay="2s") | One endpoint is slow |
Strengths:
- Precise: target specific queries, paths, commands, topics
- Realistic: real services fail at the query level, not the disk level
- Service stays healthy — healthchecks and other operations work normally
- Tests error handling for specific code paths
Weaknesses:
- Only works for supported protocols (HTTP, Postgres, Redis, Kafka, etc.)
- Proxy adds latency (usually <1ms, but measurable)
- Can’t simulate low-level failures (disk corruption, kernel panics)
Best for: “does this specific operation handle errors correctly?” questions.
Decision table
| Question | Level | Example |
|---|---|---|
| ”What if the DB server is completely down?” | Syscall | connect=deny("ECONNREFUSED") |
| ”What if this INSERT query fails?” | Protocol | error(query="INSERT INTO orders*") |
| ”What if the disk is full?” | Syscall | write=deny("ENOSPC") |
| ”What if this HTTP endpoint returns 500?” | Protocol | response(path="/api/v1/orders", status=500) |
| ”What if the network is slow?” | Syscall | connect=delay("2s") |
| ”What if this one Kafka topic drops messages?” | Protocol | drop(topic="order-events") |
| ”What if Redis SET fails but GET works?” | Protocol | error(command="SET") |
| ”What if two services can’t talk to each other?” | Syscall | partition(api, db) |
| ”What if the WAL fsync fails?” | Syscall | fsync=deny("EIO") |
| ”What if the gRPC method returns UNAVAILABLE?” | Protocol | error(method="/orders.OrderService/Create") |
Combining both levels
Use composition to combine syscall and protocol faults into a single assumption:
# Syscall-level: DB disk is slow
db_slow = fault_assumption("db_slow",
target = db,
write = delay("500ms"),
)
# Protocol-level: upstream rate limits POST requests
rate_limited = fault_assumption("rate_limited",
target = api.http,
rules = [response(method="POST", path="/orders*", status=429)],
)
# Combined: both active simultaneously
degraded = fault_assumption("degraded",
faults = [db_slow, rate_limited],
description = "upstream rate-limited AND local disk slow",
)
# Use in a matrix alongside individual faults
fault_matrix(
scenarios = [create_order, get_order],
faults = [db_slow, rate_limited, degraded],
overrides = {
(create_order, rate_limited): lambda r: assert_eq(r.status, 429),
(create_order, degraded): lambda r: assert_eq(r.status, 429),
(get_order, db_slow): lambda r: assert_true(r.duration_ms > 400),
},
)
When to combine:
- Testing graceful degradation (some operations fail, others slow)
- Testing cascading failures (upstream error + local resource issue)
- Testing that partial failures don’t corrupt state
Progression for a new project
-
Start with syscall-level
fault_assumption()—connect=deny,write=denyfor each dependency. Covers the broad “is X broken?” category. -
Add protocol-level assumptions for critical paths — specific queries, endpoints, or commands where you need precise error handling.
-
Compose into a
fault_matrix()— cross-product of scenarios × assumptions. Add overrides for expected behavior. -
Combine levels via
faults=[syscall_assumption, protocol_assumption]for realistic degradation scenarios.
Most projects get 80% of the value from step 1 alone.