On this page

Choosing Fault Levels: Syscall vs Protocol

Faultbox operates at two levels. This guide helps you decide which to use and when to combine them.

Two levels, one API

# Syscall level: affects ALL writes by the db process
disk_error = fault_assumption("disk_error",
    target = db,
    write = deny("EIO"),
)

# Protocol level: affects only INSERT queries to the orders table
insert_fail = fault_assumption("insert_fail",
    target = db.pg,
    rules = [error(query="INSERT INTO orders*", message="disk full")],
)

The target determines the level:

  • Service (db) → syscall level (seccomp-notify)
  • Interface reference (db.pg) → protocol level (transparent proxy)

Both are fault_assumption() values — composable, reusable, nameable.

When to use syscall faults

Syscall faults simulate infrastructure failures — the kind that affect everything a service does, not just specific operations.

ScenarioFaultWhat it simulates
Server disk dieswrite=deny("EIO")Every write fails
Disk fills upwrite=deny("ENOSPC")No space for any write
Network cable unpluggedconnect=deny("ECONNREFUSED")Can’t reach anything
Network is slowconnect=delay("2s")Every connection takes 2s
Total partitionpartition(svc_a, svc_b)Bidirectional network split

Strengths:

  • Works on ANY binary — no protocol support needed
  • Catches unexpected write paths (logging, temp files, metrics)
  • Simulates real infrastructure failures accurately
  • Simple: one line tests a broad category

Weaknesses:

  • Coarse: write=deny("EIO") blocks stdout, TCP, files — everything
  • Can’t target specific queries, paths, or commands
  • May break service health (can’t respond to healthchecks under write fault)

Best for: “is the infrastructure broken?” questions.

When to use protocol faults

Protocol faults simulate application-level failures — one operation fails while the rest of the service works normally.

ScenarioFaultWhat it simulates
One SQL query failserror(query="INSERT*")DB rejects a specific insert
HTTP upstream returns 429response(path="/api/*", status=429)Rate limiting
Kafka message droppeddrop(topic="orders")Message loss on one topic
Redis SET failserror(command="SET")Write to cache fails
Slow specific endpointdelay(path="/search*", delay="2s")One endpoint is slow

Strengths:

  • Precise: target specific queries, paths, commands, topics
  • Realistic: real services fail at the query level, not the disk level
  • Service stays healthy — healthchecks and other operations work normally
  • Tests error handling for specific code paths

Weaknesses:

  • Only works for supported protocols (HTTP, Postgres, Redis, Kafka, etc.)
  • Proxy adds latency (usually <1ms, but measurable)
  • Can’t simulate low-level failures (disk corruption, kernel panics)

Best for: “does this specific operation handle errors correctly?” questions.

Decision table

QuestionLevelExample
”What if the DB server is completely down?”Syscallconnect=deny("ECONNREFUSED")
”What if this INSERT query fails?”Protocolerror(query="INSERT INTO orders*")
”What if the disk is full?”Syscallwrite=deny("ENOSPC")
”What if this HTTP endpoint returns 500?”Protocolresponse(path="/api/v1/orders", status=500)
”What if the network is slow?”Syscallconnect=delay("2s")
”What if this one Kafka topic drops messages?”Protocoldrop(topic="order-events")
”What if Redis SET fails but GET works?”Protocolerror(command="SET")
”What if two services can’t talk to each other?”Syscallpartition(api, db)
”What if the WAL fsync fails?”Syscallfsync=deny("EIO")
”What if the gRPC method returns UNAVAILABLE?”Protocolerror(method="/orders.OrderService/Create")

Combining both levels

Use composition to combine syscall and protocol faults into a single assumption:

# Syscall-level: DB disk is slow
db_slow = fault_assumption("db_slow",
    target = db,
    write = delay("500ms"),
)

# Protocol-level: upstream rate limits POST requests
rate_limited = fault_assumption("rate_limited",
    target = api.http,
    rules = [response(method="POST", path="/orders*", status=429)],
)

# Combined: both active simultaneously
degraded = fault_assumption("degraded",
    faults = [db_slow, rate_limited],
    description = "upstream rate-limited AND local disk slow",
)

# Use in a matrix alongside individual faults
fault_matrix(
    scenarios = [create_order, get_order],
    faults = [db_slow, rate_limited, degraded],
    overrides = {
        (create_order, rate_limited): lambda r: assert_eq(r.status, 429),
        (create_order, degraded): lambda r: assert_eq(r.status, 429),
        (get_order, db_slow): lambda r: assert_true(r.duration_ms > 400),
    },
)

When to combine:

  • Testing graceful degradation (some operations fail, others slow)
  • Testing cascading failures (upstream error + local resource issue)
  • Testing that partial failures don’t corrupt state

Progression for a new project

  1. Start with syscall-level fault_assumption()connect=deny, write=deny for each dependency. Covers the broad “is X broken?” category.

  2. Add protocol-level assumptions for critical paths — specific queries, endpoints, or commands where you need precise error handling.

  3. Compose into a fault_matrix() — cross-product of scenarios × assumptions. Add overrides for expected behavior.

  4. Combine levels via faults=[syscall_assumption, protocol_assumption] for realistic degradation scenarios.

Most projects get 80% of the value from step 1 alone.