Choosing Fault Levels: Syscall vs Protocol

Faultbox operates at two levels. This guide helps you decide which to use and when to combine them.

Two levels, one API

# Syscall level: affects ALL writes by the db process
disk_error = fault_assumption("disk_error",
    target = db,
    write = deny("EIO"),
)

# Protocol level: affects only INSERT queries to the orders table
insert_fail = fault_assumption("insert_fail",
    target = db.pg,
    rules = [error(query="INSERT INTO orders*", message="disk full")],
)

The target determines the level:

Service (db) → syscall level (seccomp-notify)
Interface reference (db.pg) → protocol level (transparent proxy)

Both are fault_assumption() values — composable, reusable, nameable.

When to use syscall faults

Syscall faults simulate infrastructure failures — the kind that affect everything a service does, not just specific operations.

Scenario	Fault	What it simulates
Server disk dies	`write=deny("EIO")`	Every write fails
Disk fills up	`write=deny("ENOSPC")`	No space for any write
Network cable unplugged	`connect=deny("ECONNREFUSED")`	Can’t reach anything
Network is slow	`connect=delay("2s")`	Every connection takes 2s
Total partition	`partition(svc_a, svc_b)`	Bidirectional network split

Strengths:

Works on ANY binary — no protocol support needed
Catches unexpected write paths (logging, temp files, metrics)
Simulates real infrastructure failures accurately
Simple: one line tests a broad category

Weaknesses:

Coarse: write=deny("EIO") blocks stdout, TCP, files — everything
Can’t target specific queries, paths, or commands
May break service health (can’t respond to healthchecks under write fault)

Best for: “is the infrastructure broken?” questions.

When to use protocol faults

Protocol faults simulate application-level failures — one operation fails while the rest of the service works normally.

Scenario	Fault	What it simulates
One SQL query fails	`error(query="INSERT*")`	DB rejects a specific insert
HTTP upstream returns 429	`response(path="/api/*", status=429)`	Rate limiting
Kafka message dropped	`drop(topic="orders")`	Message loss on one topic
Redis SET fails	`error(command="SET")`	Write to cache fails
Slow specific endpoint	`delay(path="/search*", delay="2s")`	One endpoint is slow

Strengths:

Precise: target specific queries, paths, commands, topics
Realistic: real services fail at the query level, not the disk level
Service stays healthy — healthchecks and other operations work normally
Tests error handling for specific code paths

Weaknesses:

Only works for supported protocols (HTTP, Postgres, Redis, Kafka, etc.)
Proxy adds latency (usually <1ms, but measurable)
Can’t simulate low-level failures (disk corruption, kernel panics)

Best for: “does this specific operation handle errors correctly?” questions.

Decision table

Question	Level	Example
”What if the DB server is completely down?”	Syscall	`connect=deny("ECONNREFUSED")`
”What if this INSERT query fails?”	Protocol	`error(query="INSERT INTO orders*")`
”What if the disk is full?”	Syscall	`write=deny("ENOSPC")`
”What if this HTTP endpoint returns 500?”	Protocol	`response(path="/api/v1/orders", status=500)`
”What if the network is slow?”	Syscall	`connect=delay("2s")`
”What if this one Kafka topic drops messages?”	Protocol	`drop(topic="order-events")`
”What if Redis SET fails but GET works?”	Protocol	`error(command="SET")`
”What if two services can’t talk to each other?”	Syscall	`partition(api, db)`
”What if the WAL fsync fails?”	Syscall	`fsync=deny("EIO")`
”What if the gRPC method returns UNAVAILABLE?”	Protocol	`error(method="/orders.OrderService/Create")`

Combining both levels

Use composition to combine syscall and protocol faults into a single assumption:

# Syscall-level: DB disk is slow
db_slow = fault_assumption("db_slow",
    target = db,
    write = delay("500ms"),
)

# Protocol-level: upstream rate limits POST requests
rate_limited = fault_assumption("rate_limited",
    target = api.http,
    rules = [response(method="POST", path="/orders*", status=429)],
)

# Combined: both active simultaneously
degraded = fault_assumption("degraded",
    faults = [db_slow, rate_limited],
    description = "upstream rate-limited AND local disk slow",
)

# Use in a matrix alongside individual faults
fault_matrix(
    scenarios = [create_order, get_order],
    faults = [db_slow, rate_limited, degraded],
    overrides = {
        (create_order, rate_limited): lambda r: assert_eq(r.status, 429),
        (create_order, degraded): lambda r: assert_eq(r.status, 429),
        (get_order, db_slow): lambda r: assert_true(r.duration_ms > 400),
    },
)

When to combine:

Testing graceful degradation (some operations fail, others slow)
Testing cascading failures (upstream error + local resource issue)
Testing that partial failures don’t corrupt state

Progression for a new project

Start with syscall-level fault_assumption() — connect=deny, write=deny for each dependency. Covers the broad “is X broken?” category.
Add protocol-level assumptions for critical paths — specific queries, endpoints, or commands where you need precise error handling.
Compose into a fault_matrix() — cross-product of scenarios × assumptions. Add overrides for expected behavior.
Combine levels via faults=[syscall_assumption, protocol_assumption] for realistic degradation scenarios.

Most projects get 80% of the value from step 1 alone.