# sf v3 Build Plan A practical cut of the 56 NEW items in `SPEC.md` into tiers. Not every spec item is worth building for v3 — some were polish from late-stage adversarial review iterations and only matter at scale or in deployments we don't have. This document is the answer to: **what should we actually ship for v3?** ## Strategic frame — 2026-05 We are already on a strong base: Forge is the product, UOK is the kernel, and core work is gated by purpose-driven TDD plus the eight PDD fields. The goal of this build plan is not to turn SF into a generic CLI coder. The goal is to sharpen Forge's autonomous single-repo execution while borrowing the best ideas from adjacent systems. This file is a **planning document**, not a verified implementation ledger. An item can be mapped here and still be open, partial, or only folded into milestone planning. Close-out still requires code evidence, tests, and milestone artifacts that prove the behavior exists in the repo. Use external comparisons to sharpen, not to steer identity: - **Claude Code / Codex** — interaction and execution ergonomics - **Aider / gsd-2** — direct execution and repo work loop - **Plandex** — workflow decomposition and staged progress - **ACE Coder** — future multi-repo and large-scale convergence patterns, not the near-term product path for Forge The end state is not "SF plus a pile of borrowed references." The end state is that proven workflow, execution, and reliability patterns are absorbed into Forge and UOK as first-party behavior. ## High-level milestone sequence 1. **Stabilize the core.** Keep UOK, purpose-driven TDD, the eight PDD fields, and repo-local state/evidence as the non-negotiable base. 2. **Sharpen single-repo execution.** Port the highest-value correctness and workflow ideas from pi-mono, gsd-2, and adjacent CLI systems where they improve Forge without changing its product identity. 3. **Deepen autonomous reliability.** Improve evidence capture, recovery, verification, and self-improvement loops inside the single-repo boundary. 4. **Polish product surfaces.** Make the autonomous workflow legible in TUI, CLI, and docs without introducing separate planning semantics. 5. **Absorb and converge deliberately.** Fold proven external patterns into Forge/UOK as native behavior, and keep interfaces/concepts compatible with ACE Coder where useful, while letting Forge and ACE grow from their different starting points. --- ## Tier 0 — Pi-mono ports (sf: do these FIRST) Pi-mono (`badlogic/pi-mono`) has shipped 4 releases (v0.70.3 → v0.70.6) since our last vendor sync. These should be picked up before other v3 work because: - They're security/correctness fixes for code we already use. - They land cleanly (no namespace divergence — `packages/pi-*` were vendored from pi-mono with same paths and type names). - Skipping them means dragging known bugs into v3 work. Order: **security first → real bugs → infra → features**. | Order | Pi-mono fix | Why | Status | Reference | |---|---|---|---|---| | 1 | **HTML export: escape image data + session metadata** | Security — crafted session content could inject markup in exported HTML | ✅ `701ec8fb8` + dist `92c6d933c` | PRs #3819, #3883 | | 2 | **Empty `tools` array fix for providers that reject** | Correctness bug — some providers reject the call | ✅ `58b1d7c60` | PR #3650 | | 3 | **Anthropic SSE: ignore unknown proxy events** | Correctness bug — proxies emit OpenAI-style `done` events | **DEFERRED** — fix doesn't apply directly. Pi-mono moved off the SDK to a custom SSE parser (3 commits: `4b926a30a` + `e58d631c8` + `3e7ffff18`); we still use `client.messages.stream()` from `@anthropic-ai/sdk`. To get this protection we'd need to port the entire pi-mono custom-SSE refactor (~200 LOC). Real engineering effort, separate item. | issue #3708 | | 4 | **Long local-LLM SSE timeout (5-min undici cutoff)** | Correctness bug — local Ollama / LM Studio over 5 min die with UND_ERR_BODY_TIMEOUT | ✅ `d0907b6d8` | issue #3715 | | 5 | **Bedrock inference profile normalization** | Bedrock prompt-caching + adaptive-thinking checks fail on inference profile ARNs | ✅ `7c487bb60` | PR #3527 | | 6 | **Symlinked packages/resources/skills/sessions dedup** | Selectors and loaders show duplicates when paths are symlinked | TODO | PR #3818 | | 7 | **`ctx.ui.setWorkingVisible()` extension API** | Lets extensions hide the built-in working-loader row; useful for autopilot UX | TODO | issue #3674 | | 8 | **Cloudflare Workers AI provider** | New provider option (`CLOUDFLARE_API_KEY`/`CLOUDFLARE_ACCOUNT_ID`) | TODO | PR #3851 | | 9 | **Azure Cognitive Services endpoint** | Azure OpenAI Responses base URL support | TODO | PR #3799 | | **NEW** | **Port pi-mono custom Anthropic SSE parsing (replaces SDK)** | Address #3 properly: own the SSE parser like pi-mono, then unknown-event filter applies. Multi-commit refactor. | TODO | `4b926a30a` + `e58d631c8` + `3e7ffff18` | **Process for each:** read the pi-mono commit, port the fix to our `packages/pi-*` (cherry-pick should work cleanly here — same namespace as upstream); commit with `port(pi-mono): (refs )` style. **Skip from pi-mono** (not applicable to us): - `pi update --self`, `pi.dev` update endpoint, Windows self-update — we vendor; no pi-binary auto-update path - Bun startup / sandbox `/proc/self/environ` fixes — we run on Node, not Bun - Packaged session selector import — our dist layout differs --- ## Tier 0.5 — gsd-2 high-value manual ports (after Tier 0) `gsd-build/gsd-2` has 4,589 commits we're missing. Cherry-pick **fails** on virtually all of them because of our namespace divergence (`gsd_*` → `sf_*` rename, `extensions/gsd/` → `extensions/sf/` rename, prior pi-mono direct cherry-picks). These have to be **manually ported** — read the commit, write equivalent code against our paths and naming. Process for each: 1. Read the commit at `gsd-build/gsd-2` (we have it as `upstream/main`). 2. Find the equivalent file(s) in our `extensions/sf/` tree. 3. Apply the fix manually with `gsd_*` → `sf_*` and `.gsd/` → `.sf/` translations. 4. Commit with `port(gsd-2): (refs )` style. **Critical fixes worth porting** (limit to security + correctness; skip parallel-evolution churn): | Order | gsd-2 fix | Why | gsd-2 SHA | |---|---|---|---| | 1 | **`fix(safety): persist bash evidence at tool_call` (close mid-unit re-dispatch race)** | Real race condition; bash tool calls can lose evidence between dispatch and re-dispatch | `da7dd56e7` (PR #5056 → #5058) | | 2 | **`fix(security): harden project-controlled surfaces`** | We have a partial cherry-pick at `66ff949c1`; supersede with the full fix | `65ca5aa2e` | | 3 | **`fix(search): narrow native web_search injection`** | Only inject web_search context when the provider accepts it | `4370bedf3` | | 4 | **`fix(gsd): self-heal symlinked .sf staging`** (path-translated) | Data-loss prevention — when the staging dir is a symlink that's broken or points outside expected scope, detect and self-heal instead of silently writing to wrong location. Path-translate `.gsd/` → `.sf/` in the port; the substance is symlink-resilience, not the path string. | `9340f1e9b` (#4423) | | 5 | **`fix(knowledge): scope + budget milestone KNOWLEDGE injection`** | Prevents milestone-scope knowledge from blowing the context budget | `58d3d4d6c` (#4721) | | 6 | **MCP server stdout-buffer deadlock** | Not applicable — SF no longer ships an MCP server package. Do not port unless a future accepted ADR reintroduces an SF-owned MCP server. | N/A | | 7 | **`fix(agent-session): guard synthetic agent_end transitions`** | Session-transition race when agent_end was synthesised | `71114fccf` | | 8 | **`fix(agent-session): skip idle wait after agent_end`** | Idle wait was burning time on a session that was already ending | `6d7e4ccb5` | | 9 | **`Fix agent_end session switch handoff`** | Session handoff during agent_end could drop the next session | `c162c44bf` | | 10 | **`Fix session transition during agent_end`** | Companion to the above | `e3bd04551` | | 11 | **`fix(claude-code-cli): persist Always Allow for non-Bash tools`** | Always-Allow grants didn't persist for non-Bash tools | `a88baeae9` (PR #5096) | **Normal-value features worth porting** (not critical, but real): | Order | gsd-2 feature | Why | Effort | gsd-2 SHA(s) | |---|---|---|---|---| | 12 | **`/gsd eval-review` (slim, like product-audit)** | New milestone-end evaluation review command + frontmatter schema. We don't have it. Slim port pattern: prompt + tool + workflow template; skip parallel rewrites of dispatch/prompts. | 2 hrs | `979487735` `6971f4333` `a2f8f0e08` `83bcb054c` `a686d22cb` (+11 polish commits) | | 13 | **Workflow state machine hardening (5 commits as a unit)** | `harden workflow state transitions`, `persist workflow retry and summary state`, `fail closed on unreadable milestone summaries`, `restore slice dependency fallback`. Reliability of long auto runs. | 2 hrs | `f2377eedd` `b9a1c6743` `153fb328a` `381ccdef5` `371b2eb31` (PR #4758) | | 14 | **Proactive rate limiting via `min_request_interval_ms`** | Self-throttle to avoid 429s — model-side rate-limit data is observability-only (per SPEC.md §19.6); this is the per-dispatch knob. | 1 hr | `f980929f1` `73bc4d2f1` (PR #5007) | | 15 | **Per-call token telemetry (opt-in)** | pi-coding-agent gains opt-in per-call token telemetry hooks. Useful for cost dashboards. | 0.5 hr | `b4d4725ad` (PR #5023) | | 16 | **Worktree TUI commands (`worktree {list,merge,clean,remove}`)** | Adds these to the TUI dispatcher. We may have parts of this; check before porting. | 1 hr | `2361ceeb1` (PR #5055) | | 17 | **Doctor check for orphan milestone directories** | Diagnostic — flags `.sf/active/` artifacts whose milestones are gone. Aligns with SPEC.md C-24 startup cleanup. | 0.5 hr | `420354f99` (PR #4998) | **Skip from gsd-2** (parallel evolution; we have own implementations): - `auto-dispatch.ts`, `auto-prompts.ts`, `benchmark-selector.ts` rewrites — we have these and ours are richer (e.g. our benchmark-selector has more eval types). - UnitContextManifest / Composer rewrite (~15 commits, PRs #4782 / #4924 / #4925 / #4926) — major architectural refactor that conflicts heavily; revisit during v3 §3 schema reconciliation. - xiaomi/minimax/product-audit features — already ported in commits `ae0bbe32f`, `2eebeccb9`, `a8cf2cd94`. - All headless UX, prompt edits (DeepWiki/Context7), Serena hints, and global MCP loading — already addressed in our session (commits `c41912ff5`, `dff0df5fd`); we have own equivalents. **See `UPSTREAM_CHERRY_PICK_CANDIDATES.md`** for the full audit (all 4,589 commits surveyed; this Tier 0.5 list is the 17 worth porting — 11 critical + 6 normal value). --- ## Tier 1+ active follow-ups (after Tier 0 lands) These came up during recent ports and refactor passes — tracked here so they don't get lost. | Follow-up | Why | Tier | Effort | |---|---|---|---| | **Minimax search tests** | Search agent ported the feature but explicitly skipped tests because bunker's tests don't match our preferences/provider export shape. Need: `getMiniMaxSearchApiKey()` priority order, `resolveSearchProvider()` returning "minimax", `/search-provider minimax` CLI behavior, no-key error messages, `executeMiniMaxSearch` request shape. | 1 | 0.5 day | | **Product-audit phase machine wire-up** | Slim port (commit `a8cf2cd94`) shipped the prompt + `sf_product_audit` tool + workflow template, but doesn't yet dispatch into PhaseMerge or PhaseComplete. The tool is callable; the phase doesn't auto-fire. | 2 | 0.5 day | | **Headless assistant-text preview** | Headless UX commit (`dff0df5fd`) covered notification spam, categorization, and phase/status tag distinction. The fourth bunker improvement — separating `assistantTextBuffer` from `thinkingBuffer` and flushing both as concise previews on tool-execution-start / message-end — was deferred because it's a meatier change in `headless.ts`. | 2 | 0.5 day | | **Search provider registry refactor** | Adding minimax took 9 files because the provider list is duplicated across `provider.ts` (type + VALID_PREFERENCES), `native-search.ts`, `command-search-provider.ts` (CLI), `tool-search.ts` + `tool-llm-context.ts` (two separate execute paths!), `preferences-types.ts`, `preferences-validation.ts`, manifest, docs. A single `SearchProviderRegistry` array would let everything iterate. | 2 | 3-5 days | | **Pi-mono SDK sync** | We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check `pi-mono/main` for SDK improvements worth taking. The remote is set up; cadence is not. | 3 | recurring | | **Caveman input-side compression** (manual) | Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (`execute-task.md`, `discuss.md`, `plan-*.md`, etc.) — is verbose: 10-step instruction lists, `runtimeContext`, `memoriesSection`, `taskPlanInline`, `slicePlanExcerpt`. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. | 2 | 1-2 days | | **Runtime input preprocessor** (caveman-compress) | Add a transformation step in dispatch that pipes sf's rendered prompt through `caveman-compress` (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a `terse_prompts: true` preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. | 3 | 3-4 days | | **Full swarm chat for `subagent` tool** | Round-robin debate mode now exists as `subagent({ mode: "debate", rounds: N, tasks: [...] })`, so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md): full inbox-based swarm chat after the persistent-agent layer (SPEC §17–18) lands. | 3 | ~3 weeks (depends on persistent-agent layer) | | **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases | | **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform | | **Judge calibration + eval runner service** | Documentation-only for now. When implemented, keep SF core in TS for repo profiling and `.sf/sf.db` run ledgers, but build model-judge execution/calibration as a Go/Charm service using `fantasy`/`catwalk`, with durable false-positive/false-negative lessons retained into Singularity Memory. See [repo-native-harness-architecture.md](docs/dev/repo-native-harness-architecture.md#judge-rig). | 2 | ~2-3 weeks after Singularity Memory remote-mode | | **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks | | **Charm TUI client (`sf-tui`)** | Build a new Go-based TUI client on `pony` + `ultraviolet` + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` + `x/mosaic`. Talks to sf daemon over RPC. Two-stage replacement of `pi-tui`: ship parallel as `sf --tui=charm`, reach parity, flip default, delete `pi-tui` (sheds ~10k LOC of TS from sf core). See [ADR-017](docs/dev/ADR-017-charm-tui-client.md). | 2 | ~12-16 weeks across stages | | **Flight recorder** (`x/vcr`) | Frame-accurate session recording for sf auto-loop dispatches. Go service using `charmbracelet/x/vcr`. Records to `.sf/recordings/{unit-id}.vcr`; `sf replay ` opens TUI player. Frame-level redaction parity with `event-log.jsonl`. See [ADR-015](docs/dev/ADR-015-flight-recorder.md). | 3 | ~3 weeks | | **Multi-instance federation (other surfaces)** | Federated benchmarks, federated persistent agents, cross-repo unit graph — all deferred. Decide ride-Singularity-Memory vs separate service for benchmarks after §16 lands and we observe duplicated discovery cost. Cross-repo orch is out-of-scope for sf (meta-coordinator territory). Federated agents wait until concrete pain shows up. See [ADR-012](docs/dev/ADR-012-multi-instance-federation.md). | 3 | depends on which surface — re-scope after Singularity Memory lands | It is opinionated. Each item has a tier and a one-line rationale. Reorder freely. --- ## Upstream stance **sf is a fork.** We do not periodically sync from `gsd-build/gsd-2`. We tried (see attempt log in `UPSTREAM_CHERRY_PICK_CANDIDATES.md`). The conflicts run deep because of three structural choices that are intentional and won't be reverted: - We renamed `gsd_*` tool names → `sf_*` (`421fccd89`). - We renamed `@sf-run/*` → `@singularity-forge/*` package scope (`f92ee8d64`). - We've cherry-picked tool fixes from `pi-mono` upstream directly (`f153521c2`), which addresses some bugs that `gsd-2` fixed differently. Pretending we still track gsd-2 means weeks of merge work for diminishing return. Better to: - **Treat `gsd-build/gsd-2` upstream as an intelligence source.** We read it. We hand-port fixes when one specifically bites us. `UPSTREAM_CHERRY_PICK_CANDIDATES.md` is a reference list of what's available, not an action plan. - **Pull from `pi-mono` directly for SDK improvements.** We've already been doing this; continue. - **Track our own roadmap** via `SPEC.md` and this file. If a specific upstream fix matters (e.g. a CVE, a bug we hit), port it manually and credit upstream in the commit message. Don't try to sync the whole tree. --- ## Tier 1 — ESSENTIAL (block v3 ship) These resolve real product or correctness gaps. v3 isn't v3 without them. ### 1.1 Vault secret resolver **Spec:** § 24, C-38, C-83. **What:** `vault://secret/path#field` URI resolver, replacing any plaintext provider keys in current config. Auth chain: `VAULT_TOKEN` → `~/.vault-token` → AppRole. **Why essential:** sf is a real tool used against real models with real billing. Plaintext keys in config files are a security regression we should not ship past. **Effort:** 1–2 days. `pi-ai` config layer adds a resolver. ### 1.2 Singularity Memory integration decision + execution **Spec:** § 16, § 24, C-94, C-95, K-01 through K-06. **What:** Decide whether sm replaces sf's existing memory layer, layers on top, or stays absent — then execute. The repo at `singularity-ng/singularity-memory` exists; integrating means replacing or augmenting `memory-store.ts`, `memory-extractor.ts`, `memory-relations.ts`, `tools/memory-tools.ts`, `bootstrap/memory-tools.ts`. **Why essential:** the spec leans heavily on sm (anti-patterns, two-bank recall, cross-tool sharing). Either commit to it or rewrite §16 to match what sf actually has. **Recommended path:** **keep sf's local memory as a hot cache + use sm as durable cross-tool store**. This is the layered model — sf's local memory becomes the operational fast-path; sm holds long-term cross-session, cross-project, cross-tool memories. **Effort:** 1–2 weeks for the integration; 1 day to decide. ### 1.3 Schema reconciliation: `units` vs `milestones`/`slices`/`tasks` **Spec:** § 3.1. **What:** sf has 3 tables, spec has 1 with a `type` column. Either: - **(a)** Migrate sf to single `units` table (data migration; touches many files). - **(b)** Update spec to 3-table model (no code change; spec rewrite). **Recommended path:** **(b) — keep what sf has.** The 3-table shape is more granular and integrates with `decisions`, `requirements`, `artifacts`, `assessments`, `replan_history` which have rich schemas of their own. Forcing them into one `units` table loses information. **Effort:** 2–3 days for spec rewrite, 0 days code. ### 1.4 Config schema alignment **Spec:** § 14.2, C-25, C-26, C-73. **What:** `config-overlay.ts` exposes whatever keys sf has today. Spec specifies `context_compact_at`, `context_hard_limit`, `unit_timeout`, `unit_timeout_by_phase`, `max_agents_by_phase`, `turn_input_required`, `worktree_mode`, `tool_abort_grace`, `max_turns_per_attempt`, `hot_cache_turns`, etc. Add missing keys with defaults; document each. **Why essential:** users can't tune behavior they can't configure. Spec promises configurability that doesn't exist yet. **Effort:** 3–5 days. Add keys, plumb through, write doctor checks. --- ## Tier 2 — STRONG (ship with v3 if possible, otherwise v3.1) Real value-add. Defer is allowed but disappointing. ### 2.1 Persistent agents v1 (basic, no messaging) **Spec:** § 17, A-01, A-02, A-03, A-04, A-09, A-10. **Defer:** A-05, A-06, A-07, A-08 (messaging) to v3.1. **What:** named agents with their own memory blocks, system prompt, message history, durable across sessions. `core_memory_append` / `core_memory_replace` tools. `/sf agent run|reset|delete|inspect` commands. **Why strong:** the persistent-agent pattern was the main draw from Letta and a recurring user interest throughout this spec process. Shipping basic persistent agents in v3 unlocks the architecture; messaging can come in v3.1. **Effort:** 2 weeks for basic; +1–2 weeks for messaging. ### 2.2 Doc-sync sub-step **Spec:** § 10.5, C-20, C-45, C-68. **What:** at the end of the last code-mutating phase (Merge or, for spike workflows, Execute), run a `fast`-tier dispatch to check whether `ARCHITECTURE.md`/`CONVENTIONS.md`/`STACK.md` need updates and propose a diff for user approval. **Why strong:** project docs rotting is the most predictable failure mode of long autopilot runs. Catching it costs ~5 minutes per merge. **Effort:** 3–5 days. ### 2.3 Intent chapters **Spec:** § 19.4, C-34. **What:** spans grouped into named "what was the agent trying to do" chapters. Inferred from phase transitions or agent-declared via `chapter_open(name)`. Used for crash-resume context and Hindsight recall. **Why strong:** crash-resume reconstruction is currently weak. Chapters give the resumed agent a coherent "what was I doing" header instead of replaying raw tool calls. **Effort:** 1 week. ### 2.4 PhaseReview 3-pass review **Spec:** § 13.3, C-39, C-63. **What:** establish-context pass (single fast dispatch) → parallel chunked review (per-file, ≤300 lines each, standard tier) → synthesis pass. **Why strong:** the current single-pass review on large diffs is known to gloss the tail. The 3-pass shape catches more. **Effort:** 1 week. ### 2.5 `turn_status` marker **Spec:** § 5.4.1, C-81. **What:** parse `complete|blocked|giving_up` from end of agent output. `blocked` triggers `SignalPause`; `giving_up` transitions to `PhaseReassess` immediately. **Why strong:** a per-turn semantic checkpoint between transport-success and phase-boundary. Currently the harness has no way to know "the agent thinks it's stuck" except by waiting for stuck-loop timeout. **Effort:** 2–3 days. ### 2.6 `last_error` cap **Spec:** § 7.3, C-74. **What:** truncate `last_error` to 4 KB head+tail; full payload to `.sf/active/{unit-id}/last-error-full.txt`. Agent reads the file if needed. **Why strong:** lint output / traceback dumps can blow the prompt. Current behaviour is "inject and pray." **Effort:** 1 day. ### 2.7 Cost stored as integer micro-USD **Spec:** C-69. **What:** rename `cost_usd REAL` → `cost_micro_usd INTEGER` in `runs`, `benchmark_results`. Float drift on accumulated costs is real over thousands of runs. **Why strong:** small change, real correctness improvement, easier reasoning about totals. **Effort:** 1 day with the migration. --- ## Tier 3 — NICE (v3.1 or v3.2) Worth building, just not blocking. Ship after Tier 2 if calendar allows. | Item | Spec | One-line | |---|---|---| | Inter-agent messaging | § 18, A-05..A-08 | send_message + inbox + wait_for_reply + handoff. Builds on Tier 2.1 persistent agents. ~1–2 weeks. | | Workflow content pinning | § 4.5, C-71 | SHA-256 hash of template content stored per unit; in-flight units use pinned content. Defends against operator editing the template mid-run. ~3 days. | | Trace `_meta` record | § 19.3, C-79 | First line of each daily JSONL is a schema-version record. Forward-compatible. ~1 day. | | `runs` table | § 3.1, C-48, C-49, C-59 | Unifies unit_attempt and agent_run history. sf has `audit_events` already; either repurpose or add a new view. Decision required. ~1 week. | | `pending_retain` queue | § 16.1, C-51 | Sm retain failures queue locally and retry with backoff. Required if and only if sm is integrated (Tier 1.2). | | Capability-tag handoff | § 18.4, C-82, C-90 | `handoff("capability:go,testing", ...)` resolves to any matching agent. Adds `agent_capabilities` index. Builds on Tier 2.1 + Tier 3 inter-agent messaging. ~3 days. | | `agent_run` budget + termination | § 17.5, C-54, C-65 | When does an agent run end? (inbox drained / explicit stop / budget hard-limit / supervisor signal / timeout). Compaction preserves wake message. ~1 week. | --- ## Tier 4 — DEFER (only if a deployment actually demands it) Spec sections that landed during late-stage adversarial review and only matter at scale or in specific deployments. | Item | Spec | Why deferred | |---|---|---| | SSH worker extension | § 22, C-64, C-75, E-02 | Real for fleet deployments (bunker, inference-fabric scaling). Not real for daily-driver development. Build when a user actually needs to dispatch to a remote box. | | HTTP API auth | § 19.5, C-77 | Only needed if the HTTP API ships. SF currently supports MCP as a client surface only, not as an SF workflow server. | | `trace_index` SQL | § 19.3.1, C-80 | Forensics over JSONL is fine until grep gets slow. Build the index when you have months of trace files, not before. | | PhaseUAT | § 4.6, C-53, C-76 | Only matters for "release" workflows where humans sign off before merge. Add when needed. | | Multi-orchestrator atomic claim | C-47 | The single-process `run.lock` is sufficient. The atomic UPDATE pattern matters when two orchestrators race against the same DB; sf doesn't deploy that way today. | | `specs.check` JSDoc CI | C-37 | Useful but not blocking. Add when JSDoc rot becomes a real issue. | --- ## Tier 5 — DROP from spec These crept in during adversarial review iterations and don't earn their keep. | Item | Spec | Why drop | |---|---|---| | Cost-`per_1k_micro_usd` field type rename | C-69 (partial) | If we accept `cost_micro_usd` for runs (Tier 2.7), the `benchmark_results.cost_per_1k_micro_usd` rename is internally consistent — but the user-facing pricing model that benchmark uses already varies per provider; the integer-micro-USD constraint there is over-engineered. Keep `REAL` for benchmark, integer for runs. | | `runs` snap_ columns (`unit_id_snap`, `agent_name_snap`) | C-59 | If we use soft-delete (`archived_at`) and never hard-delete, snapshots are unnecessary. Drop the columns. | | `workflow_pins` content snapshot table | C-71 | If we just hash the file at first dispatch and store the hash on the unit (`units.workflow_hash`), we don't need a separate pins table. The hash is enough; the content can be re-read from disk. Simplify. | | `agent_capabilities` separate indexed table | C-90 | At fleet sizes <100 agents, the JSON-array-LIKE scan is fine. Add the index when you have a measurement showing it's slow. | --- ## Suggested v3 milestone breakdown **v3.0 — ship target: ~6–8 weeks** - Tier 1.1 Vault (1–2d) - Tier 1.2 sm integration, layered model (2 weeks) - Tier 1.3 spec schema rewrite to 3-table (3d) - Tier 1.4 config alignment (1 week) - Tier 2.2 doc-sync (1 week) - Tier 2.5 turn_status marker (3d) - Tier 2.6 last_error cap (1d) - Tier 2.7 cost_micro_usd (1d) That's **~5 weeks of work** for the must-haves. **v3.1 — ~4 weeks after v3.0** - Tier 2.1 persistent agents v1 (2 weeks) - Tier 2.3 intent chapters (1 week) - Tier 2.4 PhaseReview 3-pass (1 week) **v3.2 — when ready** - Tier 3 items as appetite allows. --- ## Decisions needed before starting v3.0 1. **sm: replace, layer, or keep?** Recommended: layer (sf local cache + sm durable). 2. **Schema: migrate to single `units` or update spec to 3-table?** Recommended: update spec. 3. **Persistent agents in v3.0 or v3.1?** Recommended: v3.1 — too much new surface to land alongside Tier 1 + 2. 4. **Does any deployment actually need SSH workers in v3.x?** If not, drop §22 from spec entirely; re-add when needed.