singularity-forge/docs/RELIABILITY.md

3.8 KiB
Raw Permalink Blame History

Reliability

Exit Codes (machine surface)

sf headless is the current machine-surface command. These codes describe the non-interactive runner and are independent from output format: text, one JSON result, and streaming JSONL use the same completion semantics.

Code Meaning
0 Success — unit or session completed cleanly
1 Error or timeout
10 Blocked — LLM called an interactive tool that requires user input; parent must respond or abort
11 Cancelled — SIGINT or SIGTERM received
12 Reload — agent requested restart-with-resume on the same session

Failure Modes and Recovery

Process crash mid-unit

Detection: Lock file in .sf/ is present on next launch; RPC child process is gone.

Recovery path (src/resources/extensions/sf/auto-recovery.ts):

  1. Read the surviving session JSONL from ~/.sf/sessions/<session-id>/
  2. Synthesize a recovery briefing from every tool call recorded on disk
  3. Resume the LLM mid-unit with the briefing as context — no state is lost
  4. If the session JSONL is unreadable, fall back to starting the unit fresh

Timeout

Detection: Machine-surface parent receives no heartbeat within HEADLESS_HEARTBEAT_INTERVAL_MS (60 000 ms), or the unit wall-clock exceeds the configured timeout.

Recovery path: auto-timeout-recovery.ts writes a timeout summary, marks the unit needs_fix, and advances the loop. The parent exits with code 1 unless --max-restarts allows a retry.

Stuck detection (repeating-pattern loops)

Detection (src/resources/extensions/sf/auto-stuck-detection.ts): Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.

Recovery path: Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.

Provider API errors (transient)

Detection: bootstrap/provider-error-resume.ts intercepts 429, 500, 503 responses.

Recovery path: Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.

Verification gate failures

Detection: auto-verification.ts runs lint/test after each task; non-zero exit = failure.

Recovery path: Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked needs_fix and the loop advances with a warning.

Budget ceiling hit

Detection: auto-budget.ts tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.

Recovery path: Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.

Restart Loop (machine surface)

sf headless autonomous --max-restarts 3 applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.

Observability

Signal Location
Structured trace .sf/traces/trace-<timestamp>.json — full session span tree with tokens, cost, duration
Event audit log .sf/event-log.jsonl — every unit completion, tool call, decision save (v2 format)
Desktop notifications OS-native; configurable via preferences (notifications.*)
Stderr progress Human-readable machine-surface progress goes to stderr; stdout carries the batch JSON result for --output-format json or JSONL events for --output-format stream-json
Heartbeat Emitted every 60 s to detect hung parent/child communication

Release Checks

Before shipping a build:

just test          # full unit test suite
just smoke-test    # sf --version, sf --help, sf --print
just typecheck     # tsc extensions, no emit
just lint          # eslint