3.8 KiB
Reliability
Exit Codes (machine surface)
sf headless is the current machine-surface command. These codes describe the
non-interactive runner and are independent from output format: text, one JSON
result, and streaming JSONL use the same completion semantics.
| Code | Meaning |
|---|---|
| 0 | Success — unit or session completed cleanly |
| 1 | Error or timeout |
| 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort |
| 11 | Cancelled — SIGINT or SIGTERM received |
| 12 | Reload — agent requested restart-with-resume on the same session |
Failure Modes and Recovery
Process crash mid-unit
Detection: Lock file in .sf/ is present on next launch; RPC child process is gone.
Recovery path (src/resources/extensions/sf/auto-recovery.ts):
- Read the surviving session JSONL from
~/.sf/sessions/<session-id>/ - Synthesize a recovery briefing from every tool call recorded on disk
- Resume the LLM mid-unit with the briefing as context — no state is lost
- If the session JSONL is unreadable, fall back to starting the unit fresh
Timeout
Detection: Machine-surface parent receives no heartbeat within HEADLESS_HEARTBEAT_INTERVAL_MS (60 000 ms), or the unit wall-clock exceeds the configured timeout.
Recovery path: auto-timeout-recovery.ts writes a timeout summary, marks the unit needs_fix, and advances the loop. The parent exits with code 1 unless --max-restarts allows a retry.
Stuck detection (repeating-pattern loops)
Detection (src/resources/extensions/sf/auto-stuck-detection.ts): Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.
Recovery path: Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.
Provider API errors (transient)
Detection: bootstrap/provider-error-resume.ts intercepts 429, 500, 503 responses.
Recovery path: Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.
Verification gate failures
Detection: auto-verification.ts runs lint/test after each task; non-zero exit = failure.
Recovery path: Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked needs_fix and the loop advances with a warning.
Budget ceiling hit
Detection: auto-budget.ts tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.
Recovery path: Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.
Restart Loop (machine surface)
sf headless autonomous --max-restarts 3 applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.
Observability
| Signal | Location |
|---|---|
| Structured trace | .sf/traces/trace-<timestamp>.json — full session span tree with tokens, cost, duration |
| Event audit log | .sf/event-log.jsonl — every unit completion, tool call, decision save (v2 format) |
| Desktop notifications | OS-native; configurable via preferences (notifications.*) |
| Stderr progress | Human-readable machine-surface progress goes to stderr; stdout carries the batch JSON result for --output-format json or JSONL events for --output-format stream-json |
| Heartbeat | Emitted every 60 s to detect hung parent/child communication |
Release Checks
Before shipping a build:
just test # full unit test suite
just smoke-test # sf --version, sf --help, sf --print
just typecheck # tsc extensions, no emit
just lint # eslint