On this page

Design Document: Rule-Based Failure Scenario Generator

Problem

Users write happy-path tests but manually add failure scenarios one by one. This requires knowing which syscalls to target, which errnos to use, and which dependencies can fail. Most engineers don’t think in syscalls — they miss failure modes that are obvious in hindsight.

Worse: the generator can’t guess what API calls to make or what status codes a service should return under failure — only the user knows that.

Goal: The user describes how the system works when everything is fine (scenario()). The generator takes that happy path and systematically wraps it in every possible fault — producing mutations, not inventions.

Core Model: Happy Path Mutation

User writes:    scenario(fn)     → "this is how my system works"
Generator:      mutate(scenario) → "here's everything that can go wrong"

The generator never invents API calls or assertions. It takes the user’s exact happy path body and wraps it in fault scopes. The user then reviews each mutation and adds assertions for the behavior they want.

The scenario() Builtin

# faultbox.star

db = service("db", ..., interface("main", "postgres", 5432))
cache = service("cache", ..., interface("main", "redis", 6379))
api = service("api", ..., depends_on=[db, cache])

# Happy path — describes how the system works when healthy.
# Registered with scenario() so the generator knows to mutate it.
def order_flow():
    api.post(path="/orders", body='{"sku": "widget", "qty": 1}')
    api.post(path="/payments", body='{"order_id": 1, "amount": 100}')
    resp = api.get(path="/orders/1")
    assert_eq(resp.data["status"], "paid")

scenario(order_flow)

# You can register multiple scenarios:
def health_check():
    resp = api.get(path="/health")
    assert_eq(resp.status, 200)

scenario(health_check)

scenario(fn) does two things:

  1. Registers the function as a happy path for the generator
  2. Runs it as a test (same as test_* functions) — so happy paths are always verified

What the Generator Produces

faultbox generate faultbox.star --output failures.star

For each registered scenario() × each dependency × each failure mode, the generator produces a test that runs the exact same body under fault:

# failures.star — generated by: faultbox generate faultbox.star
load("faultbox.star", "api", "db", "cache", "order_flow", "health_check")

# --- order_flow × db failures ---

def test_gen_order_flow_db_down():
    """order_flow with db connection refused."""
    fault(api, connect=deny("ECONNREFUSED", label="db down"), run=order_flow)

def test_gen_order_flow_db_slow():
    """order_flow with db writes delayed 5s."""
    fault(db, write=delay("5s", label="db slow"), run=order_flow)

def test_gen_order_flow_db_disk_full():
    """order_flow with db disk full."""
    fault(db, write=deny("ENOSPC", label="disk full"), run=order_flow)

def test_gen_order_flow_db_io_error():
    """order_flow with db disk I/O error."""
    fault(db, write=deny("EIO", label="disk I/O error"), run=order_flow)

def test_gen_order_flow_db_fsync_failure():
    """order_flow with db fsync failure."""
    fault(db, fsync=deny("EIO", label="fsync failure"), run=order_flow)

def test_gen_order_flow_db_connection_reset():
    """order_flow with db dropping mid-request."""
    fault(api, read=deny("ECONNRESET", label="db connection reset"), run=order_flow)

def test_gen_order_flow_db_partition():
    """order_flow with network partition between api and db."""
    partition(api, db, run=order_flow)

# --- order_flow × cache failures ---

def test_gen_order_flow_cache_down():
    """order_flow with cache connection refused."""
    fault(api, connect=deny("ECONNREFUSED", label="cache down"), run=order_flow)

def test_gen_order_flow_cache_slow():
    """order_flow with cache delayed 5s."""
    fault(cache, write=delay("5s", label="cache slow"), run=order_flow)

# --- health_check × db failures ---

def test_gen_health_check_db_down():
    """health_check with db connection refused."""
    fault(api, connect=deny("ECONNREFUSED", label="db down"), run=health_check)

# ... etc

Key properties:

  • No generated assertions — the happy path’s own assertions will either pass (system handles the fault) or fail (system doesn’t). The user reviews failures and decides what’s correct behavior.
  • Happy path body preserved exactlyrun=order_flow passes the original function, same API calls, same sequence.
  • Each mutation = one fault scope — simple, isolated, debuggable.

User Workflow

1. Write happy paths with scenario()
2. faultbox generate → produces mutations
3. Run mutations → see which fail
4. For each failure:
   a. Expected failure → adjust assertion (change assert_eq to assert_true(status >= 500))
   b. Unexpected failure → found a bug, fix the code
   c. Irrelevant → delete the generated test
5. Commit the kept mutations alongside happy paths
6. Regenerate when topology changes (new dependency, new scenario)

CLI

# Generate faults for all scenarios (one file per scenario):
faultbox generate faultbox.star
# → order_flow.faults.star
# → health_check.faults.star

# Generate faults for specific scenario:
faultbox generate faultbox.star --scenario order_flow
# → order_flow.faults.star

# Override output name:
faultbox generate faultbox.star --scenario order_flow --output custom.star

# Filter by dependency or category:
faultbox generate faultbox.star --service db
faultbox generate faultbox.star --category network

# Dry run — list mutations without writing files:
faultbox generate faultbox.star --dry-run

Default output naming

When --output is not specified, the generator writes one file per registered scenario using the convention:

<scenario_name>.faults.star
Scenario functionOutput file
order_floworder_flow.faults.star
health_checkhealth_check.faults.star
checkout_flowcheckout_flow.faults.star

When --output is specified, all mutations go into a single file. When --scenario is specified without --output, only that scenario’s file is written.

Failure Matrix

For each dependency edge, the generator applies these fault templates:

Network failures (all protocols)

FaultErrnoLabelSeverity
connect refusedECONNREFUSED<dep> downcritical
connect slowdelay 5s<dep> slowhigh
connection resetECONNRESET<dep> connection resethigh
network partitionpartition()<dep> partitionedcritical

Disk failures (services with storage)

FaultErrnoLabelSeverity
write I/O errorEIOdisk I/O errorcritical
disk fullENOSPCdisk fullhigh
fsync failureEIOfsync failurecritical
read-only FSEROFSread-only filesystemmedium

Protocol-specific failures

ProtocolExtra faults
postgresread delay (query timeout)
redisread delay (cache timeout)
kafkawrite delay (publish backpressure)
httpread delay (slow upstream response)

Technical Implementation

Architecture

cmd/faultbox/main.go
    └── generateCmd(args)

            ├── Load .star file (star.New + LoadFile)

            ├── internal/generate/analyzer.go
            │   └── Analyze(rt) → Analysis
            │       ├── services, interfaces, protocols
            │       ├── dependency edges
            │       ├── registered scenarios (from scenario() calls)
            │       └── existing fault coverage

            ├── internal/generate/matrix.go
            │   └── BuildMatrix(analysis) → []Mutation
            │       ├── for each scenario × each edge × each fault
            │       └── deduplication against existing faults

            ├── internal/generate/codegen.go
            │   └── Generate(mutations, opts) → string
            │       ├── load() header
            │       ├── test function per mutation
            │       └── comments and labels

            └── Output (stdout / file)

New Package: internal/generate/

analyzer.go:

type Analysis struct {
    Services   []ServiceInfo
    Edges      []DependencyEdge
    Scenarios  []ScenarioInfo     // from scenario() registrations
    Covered    []CoveredFault     // already tested fault combinations
}

type ScenarioInfo struct {
    Name     string   // function name (e.g., "order_flow")
    VarName  string   // Starlark variable name
    Services []string // services referenced in the body
}

type DependencyEdge struct {
    From     string  // service that depends
    To       string  // service depended on
    Via      string  // "depends_on", "env"
    Protocol string  // protocol of the target interface
}

func Analyze(rt *star.Runtime) (*Analysis, error)

matrix.go:

type Mutation struct {
    Name          string // test_gen_<scenario>_<target>_<fault>
    Scenario      string // scenario function name
    Category      string // "network", "disk"
    Description   string
    FaultTarget   string // service to fault
    Syscall       string // "connect", "write", "fsync"
    Action        string // "deny", "delay"
    Errno         string // "ECONNREFUSED", "EIO", ""
    Delay         string // "5s", ""
    Label         string // human-readable
    UsePartition  bool   // use partition() instead of fault()
    PartitionA    string // first service for partition
    PartitionB    string // second service for partition
    Severity      string
}

func BuildMatrix(analysis *Analysis) []Mutation

codegen.go:

type GenerateOpts struct {
    Scenario  string // filter to one scenario
    Service   string // filter to one dependency
    Category  string // filter to one category
    DryRun    bool
    Source    string // source .star filename (for load())
}

func Generate(mutations []Mutation, analysis *Analysis, opts GenerateOpts) string

scenario() Builtin Implementation

In internal/star/builtins.go:

func (rt *Runtime) builtinScenario(thread *starlark.Thread, fn *starlark.Builtin,
    args starlark.Tuple, kwargs []starlark.Tuple) (starlark.Value, error) {

    if len(args) != 1 {
        return nil, fmt.Errorf("scenario() requires exactly one callable")
    }
    callable, ok := args[0].(starlark.Callable)
    if !ok {
        return nil, fmt.Errorf("scenario() argument must be a callable")
    }

    // Register as scenario for the generator.
    rt.scenarios = append(rt.scenarios, ScenarioRegistration{
        Name: callable.Name(),
        Fn:   callable,
    })

    // Also register as a test (happy path should always run).
    rt.registerTest("test_"+callable.Name(), callable)

    return starlark.None, nil
}

The generator reads rt.scenarios to know which functions to mutate.

Scenario Extraction for Codegen

The generator needs the scenario function name (not its body) because the generated code uses run=order_flow — calling the original function. The load() statement imports both service variables and scenario functions.

load("faultbox.star", "api", "db", "cache", "order_flow", "health_check")

Deduplication

The generator scans existing .star source for fault() calls to avoid generating duplicates:

// If source already contains:
//   fault(db, write=deny("EIO"), run=order_flow)
// Then don't generate:
//   test_gen_order_flow_db_io_error

Matching is by: scenario name + fault target + syscall + errno.

Use Cases

1. New Project Bootstrap

faultbox init --name api --port 8080 ./api-svc --output faultbox.star
# Edit faultbox.star: add dependencies, write scenario()
faultbox generate faultbox.star
# → order_flow.faults.star
faultbox test order_flow.faults.star
# Review failures, add assertions, commit

2. New Dependency Added

# Added Redis cache to the topology
faultbox generate faultbox.star --service cache
# → order_flow.faults.star (regenerated with cache mutations added)

3. Post-Incident

# Incident: payment gateway timeout cascaded to checkout
cat order_flow.faults.star | grep -A3 "slow"
# Find: test_gen_order_flow_gateway_slow was generated but never kept
# Uncomment it, add assertions, commit

4. Coverage Review

faultbox generate faultbox.star --dry-run
# Output:
#   order_flow × db: 7 mutations (3 already covered)
#   order_flow × cache: 4 mutations (0 already covered)
#   health_check × db: 3 mutations (0 already covered)
#   Total: 14 mutations, 11 new

File Structure

project/
├── faultbox.star                  # topology + happy paths (hand-written)
├── order_flow.faults.star         # generated, then curated
├── health_check.faults.star       # generated, then curated
└── ...
# faultbox.star — topology + happy paths (hand-written, committed)
db = service("db", ...)
api = service("api", ..., depends_on=[db])

def order_flow():
    api.post(path="/orders", body='{"sku": "widget"}')
    resp = api.get(path="/orders/1")
    assert_eq(resp.data["status"], "created")

scenario(order_flow)
# order_flow.faults.star — generated by: faultbox generate faultbox.star
load("faultbox.star", "api", "db", "order_flow")

def test_gen_order_flow_db_down():
    """order_flow with db down."""
    fault(api, connect=deny("ECONNREFUSED", label="db down"), run=order_flow)

def test_gen_order_flow_db_slow():
    """order_flow with db slow."""
    fault(db, write=delay("5s", label="db slow"), run=order_flow)

# ... more mutations
# Run happy paths only:
faultbox test faultbox.star

# Run faults for one scenario:
faultbox test order_flow.faults.star

# Run everything:
faultbox test faultbox.star order_flow.faults.star health_check.faults.star

Rollout Plan

  1. scenario() builtin — register happy paths, run as tests
  2. load() support in Starlark runtime — prerequisite for separate files
  3. internal/generate/analyzer.go — topology + scenario extraction
  4. internal/generate/matrix.go — failure matrix from edges × faults
  5. internal/generate/codegen.go — Starlark output with load() header
  6. cmd/faultbox/main.gogenerate subcommand
  7. Tests — unit tests for analyzer, matrix, codegen
  8. Docs — CLI reference, tutorial chapter