Recipes
Curated, protocol-specific failure helpers that ship embedded in the
faultbox binary. Each recipe wraps the core fault primitives
(response, error, delay, drop) with the canonical error
message, status code, or body shape that a real server emits — so
specs say “disk full” instead of remembering
"assertion: 10334 disk full".
Why recipes exist
Writing a realistic fault today requires knowing the exact error shape the real service would emit:
# What the user wants to say:
rules = [error(query="INSERT*", message="disk full")]
# What the real Postgres would say (if you match on SQLSTATE):
rules = [error(query="INSERT*",
message='ERROR: could not extend file "base/...": No space left on device (SQLSTATE 53100)')]
Most users write the short version on their first attempt. The SUT’s error handling might match on SQLSTATE codes or specific substrings — so the injected fault passes, but the production fault would fail. The test gives false confidence.
Recipes bridge the gap: they encode the canonical shape of each failure once, in the stdlib, so your specs stay readable and your faults stay realistic.
See RFC-018 for the design and RFC-019 for how recipes reach your specs.
How to use them
Recipes ship embedded in the faultbox binary. Load them via the
@faultbox/ prefix — no local recipes/ directory needed:
load("@faultbox/recipes/mongodb.star", "mongodb")
load("@faultbox/recipes/cassandra.star", "cassandra")
load("@faultbox/recipes/clickhouse.star", "clickhouse")
broken = fault_assumption("broken",
target = db.main,
rules = [
mongodb.disk_full(collection = "orders"),
cassandra.unavailable(),
clickhouse.too_many_parts(),
],
)
The namespace struct pattern
Each recipe file exports one struct named after the protocol
(RFC-018). This prevents name collisions when you load recipes for
multiple protocols — mongodb.disk_full and postgres.disk_full
coexist naturally:
load("@faultbox/recipes/mongodb.star", "mongodb")
load("@faultbox/recipes/postgres.star", "postgres")
rules = [
mongodb.disk_full(collection = "orders"),
postgres.disk_full(), # same recipe name, different namespace
]
One import per protocol, clean call sites, zero collisions.
CLI discovery
Browse the catalog from the command line, no source checkout needed:
$ faultbox recipes list
Available stdlib recipes (load via @faultbox/recipes/<name>.star):
amqp
cassandra
clickhouse
grpc
http
http2
kafka
memcached
mongodb
mysql
nats
postgres
redis
udp
$ faultbox recipes show mongodb
# prints the full mongodb.star source
Wiring SUTs to the proxy
Recipes apply rules at the proxy layer. For a fault to fire, the SUT must
actually dial the proxy — not the real upstream. For a host-binary SUT
talking to a Docker upstream, this means using iface.proxy_addr (or
proxy_host / proxy_port) in the SUT’s env, not internal_addr:
db = service("db",
interface("main", "mysql", 3306),
image = "mysql:8",
reuse = True,
)
api = service("truck-api", "/usr/local/bin/truck-api",
interface("public", "http", 9000),
env = {
"MYSQL_HOST": db.main.proxy_host, # → "127.0.0.1"
"MYSQL_PORT": db.main.proxy_port, # → "36643" (auto-assigned)
"MYSQL_DSN": "user:pass@tcp(" + db.main.proxy_addr + ")/appdb",
},
)
Why not internal_addr? It returns db:3306 (the Docker DNS name), which
the host-binary SUT can’t resolve. Manual rsplit(":") decomposition
breaks the late-bound substitution and silently produces an unroutable
address. See spec-language.md → InterfaceRef
for the full attribute reference.
For container-to-container topologies (every service runs in Docker),
internal_addr is still the right choice — Docker’s internal DNS handles
resolution and the proxy substitution catches the literal addr in env values.
Shipped recipes
All recipes shipped in the current release. Each bullet describes what the recipe injects at the proxy level.
@faultbox/recipes/http.star
HTTP/1.x twin of @faultbox/recipes/http2.star — same status-code semantics, with connection_drop in place of HTTP/2’s stream_reset.
| Recipe | What it simulates |
|---|---|
http.rate_limited(path="/*") | HTTP 429 with Retry-After body — client back-off + jitter tests |
http.server_error(path="/*") | HTTP 500 — generic internal error |
http.service_unavailable(path="/*") | HTTP 503 — retryable per HTTP semantics |
http.gateway_timeout(path="/*") | HTTP 504 — upstream timeout at an intermediary |
http.slow_endpoint(path="/*", duration="3s") | Fixed-latency injection — tests client read-timeout |
http.maintenance_window(path="/*") | 503 with long Retry-After — LB “we’re deploying” response |
http.connection_drop(path="/*") | TCP close mid-request — keep-alive pool eviction tests |
http.flaky(path="/*", probability="20%") | Probabilistic 500s — retry-policy and exponential-backoff tests |
http.unauthorized(path="/*") | HTTP 401 — token-expiry + refresh-flow tests |
http.forbidden(path="/*") | HTTP 403 — authorization-failure paths |
@faultbox/recipes/postgres.star
| Recipe | What it simulates |
|---|---|
postgres.deadlock(query="*") | SQLSTATE 40P01 — deadlock detected; victim transaction aborted. Triggers retry or crash in no-retry paths. |
postgres.lock_not_available(query="*") | SQLSTATE 55P03 — lock_timeout / statement_timeout canceled statement |
postgres.serialization_failure(query="*") | SQLSTATE 40001 — concurrent update invalidated snapshot; SERIALIZABLE / REPEATABLE READ only |
postgres.too_many_connections() | SQLSTATE 53300 — max_connections saturated; surfaces pool-init error-swallowing bugs |
postgres.read_only_transaction(query="INSERT*") | SQLSTATE 25006 — writes routed to a hot-standby / read-replica |
postgres.disk_full(query="INSERT*") | SQLSTATE 53100 — “No space left on device” during file extension |
postgres.admin_shutdown(query="*") | SQLSTATE 57P01 — server shutting down; pools must evict rather than retry |
postgres.connection_failure(query="*") | Drop connection mid-query — drivers surface SQLSTATE 08006 |
postgres.slow_query(duration="3s", query="*") | Delays any statement — tests statement_timeout + context deadlines |
postgres.slow_writes(duration="3s", query="INSERT*") | Delays writes only |
@faultbox/recipes/redis.star
| Recipe | What it simulates |
|---|---|
redis.oom(key="*") | ”OOM command not allowed…” — maxmemory reached on writes |
redis.cluster_down(key="*") | ”CLUSTERDOWN The cluster is down” — quorum lost |
redis.loading(key="*") | ”LOADING Redis is loading…” — server replaying RDB/AOF after restart |
redis.readonly_replica(key="*") | ”READONLY You can’t write against a read only replica.” |
redis.busy(key="*") | ”BUSY Redis is busy running a script.” — Lua script blocking |
redis.noauth(key="*") | ”NOAUTH Authentication required.” — server restarted into authenticated mode |
redis.wrongtype(key="*") | ”WRONGTYPE Operation against a key holding the wrong kind of value” |
redis.slow_command(duration="3s", key="*") | Delays every command — tests pool timeout cascade |
redis.connection_drop(key="*") | Connection close mid-command — pool reconnect path |
@faultbox/recipes/mysql.star
| Recipe | What it simulates |
|---|---|
mysql.deadlock(query="*") | ER_LOCK_DEADLOCK (1213) — circular row-lock wait. Triggers retry or crash in no-retry paths. |
mysql.lock_wait_timeout(query="*") | ER_LOCK_WAIT_TIMEOUT (1205) — innodb_lock_wait_timeout exceeded |
mysql.too_many_connections() | ER_CON_COUNT_ERROR (1040) — surfaces nil-pointer bugs in pool init paths that don’t check for connect errors |
mysql.read_only_replica(query="INSERT*") | ER_OPTION_PREVENTS_STATEMENT (1290) — writes routed to a replica |
mysql.disk_full(query="INSERT*") | ER_RECORD_FILE_FULL (1114) — “The table is full” |
mysql.gone_away(query="*") | Drop connection mid-query — driver sees classic “MySQL server has gone away” |
mysql.slow_query(duration="3s", query="*") | Delays any statement — tests client query timeouts |
mysql.slow_writes(duration="3s", query="INSERT*") | Delays writes only |
@faultbox/recipes/grpc.star
| Recipe | What it simulates |
|---|---|
grpc.unavailable(method="*") | Code 14 (UNAVAILABLE) — most retried gRPC error; transient server outage |
grpc.deadline_exceeded(method="*") | Code 4 (DEADLINE_EXCEEDED) — per-call deadline missed |
grpc.resource_exhausted(method="*") | Code 8 (RESOURCE_EXHAUSTED) — quota / rate limit / inflight cap |
grpc.unauthenticated(method="*") | Code 16 (UNAUTHENTICATED) — missing/invalid/expired credentials |
grpc.permission_denied(method="*") | Code 7 (PERMISSION_DENIED) — identity known but unauthorized |
grpc.internal(method="*") | Code 13 (INTERNAL) — generic server-side failure; non-retryable |
grpc.not_found(method="*") | Code 5 (NOT_FOUND) — target resource doesn’t exist |
grpc.aborted(method="*") | Code 10 (ABORTED) — transactional/optimistic-concurrency conflict |
grpc.slow_method(method="*", duration="3s") | Delays RPC response — tests client deadline propagation |
grpc.connection_drop(method="*") | TCP close mid-call — resolver + subchannel reconnect paths |
@faultbox/recipes/kafka.star
| Recipe | What it simulates |
|---|---|
kafka.not_leader_for_partition(topic="*") | Error 6 — produced to a broker no longer the partition leader |
kafka.rebalancing(topic="*") | Error 27 (REBALANCE_IN_PROGRESS) — simplified trigger for consumer rebalance handlers |
kafka.offset_out_of_range(topic="*") | Error 1 — consumer offset past log head / before retention cutoff |
kafka.message_too_large(topic="*") | Error 10 — produce payload exceeds message.max.bytes |
kafka.coordinator_not_available(topic="*") | Error 15 — consumer-group coordinator unavailable; exposes shutdown-order bugs |
kafka.broker_overloaded(topic="*") | Request quota exceeded — tests client back-pressure |
kafka.slow_produce(duration="3s", topic="*") | Delays produce requests — tests linger.ms + batching |
kafka.connection_drop(topic="*") | TCP drop mid-request — forces driver reconnect |
@faultbox/recipes/mongodb.star
| Recipe | What it simulates |
|---|---|
mongodb.disk_full(collection="*") | Full data disk on insert — assertion: 10334 disk full |
mongodb.auth_failed() | SASL authentication rejection |
mongodb.replica_unavailable(collection="*") | Write concern failure; no primary available during election |
mongodb.slow_query(collection="*", duration="3s") | Delays find() — tests client read-timeout and retry |
mongodb.slow_writes(collection="*", duration="3s") | Delays insert — tests write-timeout + transaction rollback |
mongodb.connection_drop(collection="*", op="*") | Closes connection mid-command — triggers driver reconnect path |
mongodb.duplicate_key_error(collection="*") | Unique-index violation on insert (E11000) |
mongodb.write_conflict(collection="*") | Transient transaction error — drivers retry per protocol |
@faultbox/recipes/http2.star
| Recipe | What it simulates |
|---|---|
http2.rate_limited(path="/*") | HTTP 429 with Retry-After |
http2.server_error(path="/*") | HTTP 500 — generic internal error |
http2.service_unavailable(path="/*") | HTTP 503 — retryable |
http2.gateway_timeout(path="/*") | HTTP 504 — upstream timeout |
http2.slow_endpoint(path="/*", duration="3s") | Fixed-latency injection |
http2.maintenance_window(path="/*") | 503 with Retry-After — typical LB “we’re deploying” response |
http2.stream_reset(path="/*") | RST_STREAM via drop |
http2.flaky(path="/*", probability="20%") | Probabilistic 500s — retry tests |
http2.unauthorized(path="/*") | HTTP 401 — auth / token-refresh tests |
http2.forbidden(path="/*") | HTTP 403 — authorization failures |
@faultbox/recipes/cassandra.star
| Recipe | What it simulates |
|---|---|
cassandra.write_timeout(query="INSERT*") | Coordinator-level write timeout |
cassandra.read_timeout(query="SELECT*") | Read timeout — driver retries |
cassandra.unavailable(query="*") | Insufficient replicas for consistency level |
cassandra.overloaded(query="*") | OverloadedException — driver tries different coordinator |
cassandra.slow_reads(duration="3s") | Delays SELECTs — tests speculative execution |
cassandra.slow_writes(duration="3s") | Delays INSERT/UPDATE/DELETE |
cassandra.connection_drop(query="*") | Connection reset mid-statement |
cassandra.schema_mismatch(query="*") | Stale schema version — drivers refresh cache |
@faultbox/recipes/nats.star
| Recipe | What it simulates |
|---|---|
nats.slow_consumer(subject=">") | ”Slow Consumer Detected” — server dropped messages the subscriber couldn’t drain |
nats.no_responders(subject=">") | ”503 No Responders” — request-reply sent to a subject with zero subscribers |
nats.max_payload(subject=">") | ”Maximum Payload Exceeded” — publish larger than max_payload |
nats.authorization_violation(subject=">") | Authorization violation — credential/account-level denial |
nats.permissions_violation(subject=">") | Permissions violation — subject-level denial |
nats.stale_connection(subject=">") | ”Stale Connection” — server-side keep-alive failure; forces reconnect |
nats.slow_delivery(duration="3s", subject=">") | Delays message delivery — consumer processing deadlines |
nats.connection_drop(subject=">") | TCP close mid-stream — server-list failover test |
@faultbox/recipes/memcached.star
| Recipe | What it simulates |
|---|---|
memcached.server_error(command="*", key="*") | SERVER_ERROR out of memory — classic under-provisioned cache |
memcached.client_error(command="*", key="*") | CLIENT_ERROR — protocol error (non-retryable) |
memcached.not_stored(command="add", key="*") | NOT_STORED — add-finds-existing or replace-finds-missing |
memcached.exists(command="cas", key="*") | EXISTS — stale CAS token; optimistic-concurrency retry path |
memcached.item_too_large(command="set", key="*") | SERVER_ERROR object too large — -I/item_size_max exceeded |
memcached.busy(command="*", key="*") | SERVER_ERROR busy — slab reassignment / LRU maintenance |
memcached.slow_command(duration="3s", command="*", key="*") | Delays every command — read-timeout tests |
memcached.connection_drop(command="*", key="*") | TCP close mid-command — pool evict + reconnect backoff |
@faultbox/recipes/clickhouse.star
| Recipe | What it simulates |
|---|---|
clickhouse.too_many_parts(query="INSERT*") | Insert rate exceeds merge rate — drivers back off |
clickhouse.memory_limit(query="SELECT*") | Query exceeds memory quota (code 241) |
clickhouse.table_not_exists(query="*") | Missing table (code 60) |
clickhouse.readonly_mode(query="INSERT*") | Server refuses writes during maintenance |
clickhouse.slow_analytics(duration="5s") | Delays SELECTs — dashboard / ETL timeout tests |
clickhouse.slow_ingest(duration="3s") | Delays INSERTs — producer back-pressure tests |
clickhouse.connection_drop(query="*") | HTTP connection reset mid-query |
clickhouse.replica_stale(query="SELECT*") | Replica too far behind leader |
@faultbox/recipes/amqp.star
| Recipe | What it simulates |
|---|---|
amqp.channel_error(routing_key="#") | Channel-level soft error — channel closed, connection alive |
amqp.connection_error(routing_key="#") | CONNECTION_FORCED — hard error, full reconnect + redeclare |
amqp.resource_locked(routing_key="#") | RESOURCE_LOCKED — exclusive-consumer contention |
amqp.access_refused(routing_key="#") | ACCESS_REFUSED — vhost or exchange-level permission denial |
amqp.precondition_failed(routing_key="#") | PRECONDITION_FAILED — queue redeclared with different args |
amqp.publish_nack(routing_key="#") | Publisher-confirm nack — broker refused the publish |
amqp.broker_unavailable(routing_key="#") | Broker unreachable — rolling restart / cluster failover |
amqp.slow_publish(duration="3s", routing_key="#") | Delays publishes — confirm-deadline + back-pressure tests |
amqp.connection_drop(routing_key="#") | TCP close mid-frame — reconnect + topology-redeclare path |
@faultbox/recipes/udp.star
| Recipe | What it simulates |
|---|---|
udp.packet_loss(probability="100%") | Datagram drops (default 100% blackout) |
udp.dns_flap(probability="50%") | Aggressive 50% loss typical of unreliable DNS |
udp.metrics_slow(duration="1s") | Delays datagrams — StatsD / metrics slow-path |
udp.jitter(duration="100ms") | Fixed per-packet delay — congestion simulation |
udp.blackhole() | Drops every datagram — total UDP partition |
User-authored recipes
The @faultbox/ prefix is reserved for the stdlib. Your own recipes
live on the filesystem and follow the same namespace struct pattern:
# my-company/recipes/checkout.star
checkout = struct(
# post_q2_race simulates the specific race that took us down in Q2.
post_q2_race = lambda: [
delay(path = "/checkout", delay = "800ms"),
error(path = "/inventory/reserve", status = 409),
],
)
Load them via a relative path:
load("@faultbox/recipes/mongodb.star", "mongodb") # stdlib
load("./recipes/checkout.star", "checkout") # your project
rules = [mongodb.disk_full(), checkout.post_q2_race()]
Forking a stdlib recipe
To customize a stdlib recipe, copy its source into your project and load from there:
$ faultbox recipes show mongodb > recipes/mongodb-custom.star
$ # edit recipes/mongodb-custom.star as needed
load("./recipes/mongodb-custom.star", "mongodb")
Contributing to the stdlib
Recipes are data, not code — no Go changes, just Starlark. To add a new one:
- Open the relevant
recipes/<protocol>.starfile in the Faultbox repo. - Add a new field to the existing struct (not a top-level
def). - The field must return a
ProxyFaultDef(output ofresponse(),error(),delay(), ordrop()). - Use sensible defaults — the zero-arg call should do something useful.
- Add a one-line comment above describing the real-world failure.
- Update this catalog and
recipes/README.mdin the repo.
See RFC-018 for the stability contract and what recipes must not do (no I/O, no control flow, no nested loads).