Design Document: Rule-Based Failure Scenario Generator

Problem

Users write happy-path tests but manually add failure scenarios one by one. This requires knowing which syscalls to target, which errnos to use, and which dependencies can fail. Most engineers don’t think in syscalls — they miss failure modes that are obvious in hindsight.

Worse: the generator can’t guess what API calls to make or what status codes a service should return under failure — only the user knows that.

Goal: The user describes how the system works when everything is fine (scenario()). The generator takes that happy path and systematically wraps it in every possible fault — producing mutations, not inventions.

Core Model: Happy Path Mutation

User writes:    scenario(fn)     → "this is how my system works"
Generator:      mutate(scenario) → "here's everything that can go wrong"

The generator never invents API calls or assertions. It takes the user’s exact happy path body and wraps it in fault scopes. The user then reviews each mutation and adds assertions for the behavior they want.

The `scenario()` Builtin

# faultbox.star

db = service("db", ..., interface("main", "postgres", 5432))
cache = service("cache", ..., interface("main", "redis", 6379))
api = service("api", ..., depends_on=[db, cache])

# Happy path — describes how the system works when healthy.
# Registered with scenario() so the generator knows to mutate it.
def order_flow():
    api.post(path="/orders", body='{"sku": "widget", "qty": 1}')
    api.post(path="/payments", body='{"order_id": 1, "amount": 100}')
    resp = api.get(path="/orders/1")
    assert_eq(resp.data["status"], "paid")

scenario(order_flow)

# You can register multiple scenarios:
def health_check():
    resp = api.get(path="/health")
    assert_eq(resp.status, 200)

scenario(health_check)

scenario(fn) does two things:

Registers the function as a happy path for the generator
Runs it as a test (same as test_* functions) — so happy paths are always verified

What the Generator Produces

faultbox generate faultbox.star --output failures.star

For each registered scenario() × each dependency × each failure mode, the generator produces a test that runs the exact same body under fault:

# failures.star — generated by: faultbox generate faultbox.star
load("faultbox.star", "api", "db", "cache", "order_flow", "health_check")

# --- order_flow × db failures ---

def test_gen_order_flow_db_down():
    """order_flow with db connection refused."""
    fault(api, connect=deny("ECONNREFUSED", label="db down"), run=order_flow)

def test_gen_order_flow_db_slow():
    """order_flow with db writes delayed 5s."""
    fault(db, write=delay("5s", label="db slow"), run=order_flow)

def test_gen_order_flow_db_disk_full():
    """order_flow with db disk full."""
    fault(db, write=deny("ENOSPC", label="disk full"), run=order_flow)

def test_gen_order_flow_db_io_error():
    """order_flow with db disk I/O error."""
    fault(db, write=deny("EIO", label="disk I/O error"), run=order_flow)

def test_gen_order_flow_db_fsync_failure():
    """order_flow with db fsync failure."""
    fault(db, fsync=deny("EIO", label="fsync failure"), run=order_flow)

def test_gen_order_flow_db_connection_reset():
    """order_flow with db dropping mid-request."""
    fault(api, read=deny("ECONNRESET", label="db connection reset"), run=order_flow)

def test_gen_order_flow_db_partition():
    """order_flow with network partition between api and db."""
    partition(api, db, run=order_flow)

# --- order_flow × cache failures ---

def test_gen_order_flow_cache_down():
    """order_flow with cache connection refused."""
    fault(api, connect=deny("ECONNREFUSED", label="cache down"), run=order_flow)

def test_gen_order_flow_cache_slow():
    """order_flow with cache delayed 5s."""
    fault(cache, write=delay("5s", label="cache slow"), run=order_flow)

# --- health_check × db failures ---

def test_gen_health_check_db_down():
    """health_check with db connection refused."""
    fault(api, connect=deny("ECONNREFUSED", label="db down"), run=health_check)

# ... etc

Key properties:

No generated assertions — the happy path’s own assertions will either pass (system handles the fault) or fail (system doesn’t). The user reviews failures and decides what’s correct behavior.
Happy path body preserved exactly — run=order_flow passes the original function, same API calls, same sequence.
Each mutation = one fault scope — simple, isolated, debuggable.

User Workflow

1. Write happy paths with scenario()
2. faultbox generate → produces mutations
3. Run mutations → see which fail
4. For each failure:
   a. Expected failure → adjust assertion (change assert_eq to assert_true(status >= 500))
   b. Unexpected failure → found a bug, fix the code
   c. Irrelevant → delete the generated test
5. Commit the kept mutations alongside happy paths
6. Regenerate when topology changes (new dependency, new scenario)

CLI

# Generate faults for all scenarios (one file per scenario):
faultbox generate faultbox.star
# → order_flow.faults.star
# → health_check.faults.star

# Generate faults for specific scenario:
faultbox generate faultbox.star --scenario order_flow
# → order_flow.faults.star

# Override output name:
faultbox generate faultbox.star --scenario order_flow --output custom.star

# Filter by dependency or category:
faultbox generate faultbox.star --service db
faultbox generate faultbox.star --category network

# Dry run — list mutations without writing files:
faultbox generate faultbox.star --dry-run

Default output naming

When --output is not specified, the generator writes one file per registered scenario using the convention:

<scenario_name>.faults.star

Scenario function	Output file
`order_flow`	`order_flow.faults.star`
`health_check`	`health_check.faults.star`
`checkout_flow`	`checkout_flow.faults.star`

When --output is specified, all mutations go into a single file. When --scenario is specified without --output, only that scenario’s file is written.

Failure Matrix

For each dependency edge, the generator applies these fault templates:

Network failures (all protocols)

Fault	Errno	Label	Severity
connect refused	ECONNREFUSED	`<dep> down`	critical
connect slow	delay 5s	`<dep> slow`	high
connection reset	ECONNRESET	`<dep> connection reset`	high
network partition	partition()	`<dep> partitioned`	critical

Disk failures (services with storage)

Fault	Errno	Label	Severity
write I/O error	EIO	`disk I/O error`	critical
disk full	ENOSPC	`disk full`	high
fsync failure	EIO	`fsync failure`	critical
read-only FS	EROFS	`read-only filesystem`	medium

Protocol-specific failures

Protocol	Extra faults
postgres	read delay (query timeout)
redis	read delay (cache timeout)
kafka	write delay (publish backpressure)
http	read delay (slow upstream response)

Technical Implementation

Architecture

cmd/faultbox/main.go
    └── generateCmd(args)
            │
            ├── Load .star file (star.New + LoadFile)
            │
            ├── internal/generate/analyzer.go
            │   └── Analyze(rt) → Analysis
            │       ├── services, interfaces, protocols
            │       ├── dependency edges
            │       ├── registered scenarios (from scenario() calls)
            │       └── existing fault coverage
            │
            ├── internal/generate/matrix.go
            │   └── BuildMatrix(analysis) → []Mutation
            │       ├── for each scenario × each edge × each fault
            │       └── deduplication against existing faults
            │
            ├── internal/generate/codegen.go
            │   └── Generate(mutations, opts) → string
            │       ├── load() header
            │       ├── test function per mutation
            │       └── comments and labels
            │
            └── Output (stdout / file)

New Package: `internal/generate/`

analyzer.go:

type Analysis struct {
    Services   []ServiceInfo
    Edges      []DependencyEdge
    Scenarios  []ScenarioInfo     // from scenario() registrations
    Covered    []CoveredFault     // already tested fault combinations
}

type ScenarioInfo struct {
    Name     string   // function name (e.g., "order_flow")
    VarName  string   // Starlark variable name
    Services []string // services referenced in the body
}

type DependencyEdge struct {
    From     string  // service that depends
    To       string  // service depended on
    Via      string  // "depends_on", "env"
    Protocol string  // protocol of the target interface
}

func Analyze(rt *star.Runtime) (*Analysis, error)

matrix.go:

type Mutation struct {
    Name          string // test_gen_<scenario>_<target>_<fault>
    Scenario      string // scenario function name
    Category      string // "network", "disk"
    Description   string
    FaultTarget   string // service to fault
    Syscall       string // "connect", "write", "fsync"
    Action        string // "deny", "delay"
    Errno         string // "ECONNREFUSED", "EIO", ""
    Delay         string // "5s", ""
    Label         string // human-readable
    UsePartition  bool   // use partition() instead of fault()
    PartitionA    string // first service for partition
    PartitionB    string // second service for partition
    Severity      string
}

func BuildMatrix(analysis *Analysis) []Mutation

codegen.go:

type GenerateOpts struct {
    Scenario  string // filter to one scenario
    Service   string // filter to one dependency
    Category  string // filter to one category
    DryRun    bool
    Source    string // source .star filename (for load())
}

func Generate(mutations []Mutation, analysis *Analysis, opts GenerateOpts) string

`scenario()` Builtin Implementation

In internal/star/builtins.go:

func (rt *Runtime) builtinScenario(thread *starlark.Thread, fn *starlark.Builtin,
    args starlark.Tuple, kwargs []starlark.Tuple) (starlark.Value, error) {

    if len(args) != 1 {
        return nil, fmt.Errorf("scenario() requires exactly one callable")
    }
    callable, ok := args[0].(starlark.Callable)
    if !ok {
        return nil, fmt.Errorf("scenario() argument must be a callable")
    }

    // Register as scenario for the generator.
    rt.scenarios = append(rt.scenarios, ScenarioRegistration{
        Name: callable.Name(),
        Fn:   callable,
    })

    // Also register as a test (happy path should always run).
    rt.registerTest("test_"+callable.Name(), callable)

    return starlark.None, nil
}

The generator reads rt.scenarios to know which functions to mutate.

Scenario Extraction for Codegen

The generator needs the scenario function name (not its body) because the generated code uses run=order_flow — calling the original function. The load() statement imports both service variables and scenario functions.

load("faultbox.star", "api", "db", "cache", "order_flow", "health_check")

Deduplication

The generator scans existing .star source for fault() calls to avoid generating duplicates:

// If source already contains:
//   fault(db, write=deny("EIO"), run=order_flow)
// Then don't generate:
//   test_gen_order_flow_db_io_error

Matching is by: scenario name + fault target + syscall + errno.

Use Cases

1. New Project Bootstrap

faultbox init --name api --port 8080 ./api-svc --output faultbox.star
# Edit faultbox.star: add dependencies, write scenario()
faultbox generate faultbox.star
# → order_flow.faults.star
faultbox test order_flow.faults.star
# Review failures, add assertions, commit

2. New Dependency Added

# Added Redis cache to the topology
faultbox generate faultbox.star --service cache
# → order_flow.faults.star (regenerated with cache mutations added)

3. Post-Incident

# Incident: payment gateway timeout cascaded to checkout
cat order_flow.faults.star | grep -A3 "slow"
# Find: test_gen_order_flow_gateway_slow was generated but never kept
# Uncomment it, add assertions, commit

4. Coverage Review

faultbox generate faultbox.star --dry-run
# Output:
#   order_flow × db: 7 mutations (3 already covered)
#   order_flow × cache: 4 mutations (0 already covered)
#   health_check × db: 3 mutations (0 already covered)
#   Total: 14 mutations, 11 new

File Structure

project/
├── faultbox.star                  # topology + happy paths (hand-written)
├── order_flow.faults.star         # generated, then curated
├── health_check.faults.star       # generated, then curated
└── ...

# faultbox.star — topology + happy paths (hand-written, committed)
db = service("db", ...)
api = service("api", ..., depends_on=[db])

def order_flow():
    api.post(path="/orders", body='{"sku": "widget"}')
    resp = api.get(path="/orders/1")
    assert_eq(resp.data["status"], "created")

scenario(order_flow)

# order_flow.faults.star — generated by: faultbox generate faultbox.star
load("faultbox.star", "api", "db", "order_flow")

def test_gen_order_flow_db_down():
    """order_flow with db down."""
    fault(api, connect=deny("ECONNREFUSED", label="db down"), run=order_flow)

def test_gen_order_flow_db_slow():
    """order_flow with db slow."""
    fault(db, write=delay("5s", label="db slow"), run=order_flow)

# ... more mutations

# Run happy paths only:
faultbox test faultbox.star

# Run faults for one scenario:
faultbox test order_flow.faults.star

# Run everything:
faultbox test faultbox.star order_flow.faults.star health_check.faults.star

Rollout Plan

scenario() builtin — register happy paths, run as tests
load() support in Starlark runtime — prerequisite for separate files
internal/generate/analyzer.go — topology + scenario extraction
internal/generate/matrix.go — failure matrix from edges × faults
internal/generate/codegen.go — Starlark output with load() header
cmd/faultbox/main.go — generate subcommand
Tests — unit tests for analyzer, matrix, codegen
Docs — CLI reference, tutorial chapter