29 KiB
sf v3 Build Plan
A practical cut of the 56 NEW items in SPEC.md into tiers. Not every spec item is worth building for v3 — some were polish from late-stage adversarial review iterations and only matter at scale or in deployments we don't have.
This document is the answer to: what should we actually ship for v3?
Strategic frame — 2026-05
We are already on a strong base: Forge is the product, UOK is the kernel, and core work is gated by purpose-driven TDD plus the eight PDD fields. The goal of this build plan is not to turn SF into a generic CLI coder. The goal is to sharpen Forge's autonomous single-repo execution while borrowing the best ideas from adjacent systems.
This file is a planning document, not a verified implementation ledger. An item can be mapped here and still be open, partial, or only folded into milestone planning. Close-out still requires code evidence, tests, and milestone artifacts that prove the behavior exists in the repo.
Use external comparisons to sharpen, not to steer identity:
- Claude Code / Codex — interaction and execution ergonomics
- Aider / gsd-2 — direct execution and repo work loop
- Plandex — workflow decomposition and staged progress
- ACE Coder — future multi-repo and large-scale convergence patterns, not the near-term product path for Forge
The end state is not "SF plus a pile of borrowed references." The end state is that proven workflow, execution, and reliability patterns are absorbed into Forge and UOK as first-party behavior.
High-level milestone sequence
- Stabilize the core. Keep UOK, purpose-driven TDD, the eight PDD fields, and repo-local state/evidence as the non-negotiable base.
- Sharpen single-repo execution. Port the highest-value correctness and workflow ideas from pi-mono, gsd-2, and adjacent CLI systems where they improve Forge without changing its product identity.
- Deepen autonomous reliability. Improve evidence capture, recovery, verification, and self-improvement loops inside the single-repo boundary.
- Polish product surfaces. Make the autonomous workflow legible in TUI, CLI, and docs without introducing separate planning semantics.
- Absorb and converge deliberately. Fold proven external patterns into Forge/UOK as native behavior, and keep interfaces/concepts compatible with ACE Coder where useful, while letting Forge and ACE grow from their different starting points.
Tier 0 — Pi-mono ports (sf: do these FIRST)
Pi-mono (badlogic/pi-mono) has shipped 4 releases (v0.70.3 → v0.70.6) since our last vendor sync. These should be picked up before other v3 work because:
- They're security/correctness fixes for code we already use.
- They land cleanly (no namespace divergence —
packages/pi-*were vendored from pi-mono with same paths and type names). - Skipping them means dragging known bugs into v3 work.
Order: security first → real bugs → infra → features.
| Order | Pi-mono fix | Why | Status | Reference |
|---|---|---|---|---|
| 1 | HTML export: escape image data + session metadata | Security — crafted session content could inject markup in exported HTML | ✅ 701ec8fb8 + dist 92c6d933c |
PRs #3819, #3883 |
| 2 | Empty tools array fix for providers that reject |
Correctness bug — some providers reject the call | ✅ 58b1d7c60 |
PR #3650 |
| 3 | Anthropic SSE: ignore unknown proxy events | Correctness bug — proxies emit OpenAI-style done events |
DEFERRED — fix doesn't apply directly. Pi-mono moved off the SDK to a custom SSE parser (3 commits: 4b926a30a + e58d631c8 + 3e7ffff18); we still use client.messages.stream() from @anthropic-ai/sdk. To get this protection we'd need to port the entire pi-mono custom-SSE refactor (~200 LOC). Real engineering effort, separate item. |
issue #3708 |
| 4 | Long local-LLM SSE timeout (5-min undici cutoff) | Correctness bug — local Ollama / LM Studio over 5 min die with UND_ERR_BODY_TIMEOUT | ✅ d0907b6d8 |
issue #3715 |
| 5 | Bedrock inference profile normalization | Bedrock prompt-caching + adaptive-thinking checks fail on inference profile ARNs | ✅ 7c487bb60 |
PR #3527 |
| 6 | Symlinked packages/resources/skills/sessions dedup | Selectors and loaders show duplicates when paths are symlinked | TODO | PR #3818 |
| 7 | ctx.ui.setWorkingVisible() extension API |
Lets extensions hide the built-in working-loader row; useful for autopilot UX | TODO | issue #3674 |
| 8 | Cloudflare Workers AI provider | New provider option (CLOUDFLARE_API_KEY/CLOUDFLARE_ACCOUNT_ID) |
TODO | PR #3851 |
| 9 | Azure Cognitive Services endpoint | Azure OpenAI Responses base URL support | TODO | PR #3799 |
| NEW | Port pi-mono custom Anthropic SSE parsing (replaces SDK) | Address #3 properly: own the SSE parser like pi-mono, then unknown-event filter applies. Multi-commit refactor. | TODO | 4b926a30a + e58d631c8 + 3e7ffff18 |
Process for each: read the pi-mono commit, port the fix to our packages/pi-* (cherry-pick should work cleanly here — same namespace as upstream); commit with port(pi-mono): <description> (refs <pi-mono SHA>) style.
Skip from pi-mono (not applicable to us):
pi update --self,pi.devupdate endpoint, Windows self-update — we vendor; no pi-binary auto-update path- Bun startup / sandbox
/proc/self/environfixes — we run on Node, not Bun - Packaged session selector import — our dist layout differs
Tier 0.5 — gsd-2 high-value manual ports (after Tier 0)
gsd-build/gsd-2 has 4,589 commits we're missing. Cherry-pick fails on virtually all of them because of our namespace divergence (gsd_* → sf_* rename, extensions/gsd/ → extensions/sf/ rename, prior pi-mono direct cherry-picks). These have to be manually ported — read the commit, write equivalent code against our paths and naming.
Process for each:
- Read the commit at
gsd-build/gsd-2(we have it asupstream/main). - Find the equivalent file(s) in our
extensions/sf/tree. - Apply the fix manually with
gsd_*→sf_*and.gsd/→.sf/translations. - Commit with
port(gsd-2): <description> (refs <gsd-2 SHA>)style.
Critical fixes worth porting (limit to security + correctness; skip parallel-evolution churn):
| Order | gsd-2 fix | Why | gsd-2 SHA |
|---|---|---|---|
| 1 | fix(safety): persist bash evidence at tool_call (close mid-unit re-dispatch race) |
Real race condition; bash tool calls can lose evidence between dispatch and re-dispatch | da7dd56e7 (PR #5056 → #5058) |
| 2 | fix(security): harden project-controlled surfaces |
We have a partial cherry-pick at 66ff949c1; supersede with the full fix |
65ca5aa2e |
| 3 | fix(search): narrow native web_search injection |
Only inject web_search context when the provider accepts it | 4370bedf3 |
| 4 | fix(gsd): self-heal symlinked .sf staging (path-translated) |
Data-loss prevention — when the staging dir is a symlink that's broken or points outside expected scope, detect and self-heal instead of silently writing to wrong location. Path-translate .gsd/ → .sf/ in the port; the substance is symlink-resilience, not the path string. |
9340f1e9b (#4423) |
| 5 | fix(knowledge): scope + budget milestone KNOWLEDGE injection |
Prevents milestone-scope knowledge from blowing the context budget | 58d3d4d6c (#4721) |
| 6 | MCP server stdout-buffer deadlock | Not applicable — SF no longer ships an MCP server package. Do not port unless a future accepted ADR reintroduces an SF-owned MCP server. | N/A |
| 7 | fix(agent-session): guard synthetic agent_end transitions |
Session-transition race when agent_end was synthesised | 71114fccf |
| 8 | fix(agent-session): skip idle wait after agent_end |
Idle wait was burning time on a session that was already ending | 6d7e4ccb5 |
| 9 | Fix agent_end session switch handoff |
Session handoff during agent_end could drop the next session | c162c44bf |
| 10 | Fix session transition during agent_end |
Companion to the above | e3bd04551 |
| 11 | fix(claude-code-cli): persist Always Allow for non-Bash tools |
Always-Allow grants didn't persist for non-Bash tools | a88baeae9 (PR #5096) |
Normal-value features worth porting (not critical, but real):
| Order | gsd-2 feature | Why | Effort | gsd-2 SHA(s) |
|---|---|---|---|---|
| 12 | /gsd eval-review (slim, like product-audit) |
New milestone-end evaluation review command + frontmatter schema. We don't have it. Slim port pattern: prompt + tool + workflow template; skip parallel rewrites of dispatch/prompts. | 2 hrs | 979487735 6971f4333 a2f8f0e08 83bcb054c a686d22cb (+11 polish commits) |
| 13 | Workflow state machine hardening (5 commits as a unit) | harden workflow state transitions, persist workflow retry and summary state, fail closed on unreadable milestone summaries, restore slice dependency fallback. Reliability of long auto runs. |
2 hrs | f2377eedd b9a1c6743 153fb328a 381ccdef5 371b2eb31 (PR #4758) |
| 14 | Proactive rate limiting via min_request_interval_ms |
Self-throttle to avoid 429s — model-side rate-limit data is observability-only (per SPEC.md §19.6); this is the per-dispatch knob. | 1 hr | f980929f1 73bc4d2f1 (PR #5007) |
| 15 | Per-call token telemetry (opt-in) | pi-coding-agent gains opt-in per-call token telemetry hooks. Useful for cost dashboards. | 0.5 hr | b4d4725ad (PR #5023) |
| 16 | Worktree TUI commands (worktree {list,merge,clean,remove}) |
Adds these to the TUI dispatcher. We may have parts of this; check before porting. | 1 hr | 2361ceeb1 (PR #5055) |
| 17 | Doctor check for orphan milestone directories | Diagnostic — flags .sf/active/ artifacts whose milestones are gone. Aligns with SPEC.md C-24 startup cleanup. |
0.5 hr | 420354f99 (PR #4998) |
Skip from gsd-2 (parallel evolution; we have own implementations):
auto-dispatch.ts,auto-prompts.ts,benchmark-selector.tsrewrites — we have these and ours are richer (e.g. our benchmark-selector has more eval types).- UnitContextManifest / Composer rewrite (~15 commits, PRs #4782 / #4924 / #4925 / #4926) — major architectural refactor that conflicts heavily; revisit during v3 §3 schema reconciliation.
- xiaomi/minimax/product-audit features — already ported in commits
ae0bbe32f,2eebeccb9,a8cf2cd94. - All headless UX, prompt edits (DeepWiki/Context7), Serena hints, and global MCP loading — already addressed in our session (commits
c41912ff5,dff0df5fd); we have own equivalents.
See UPSTREAM_CHERRY_PICK_CANDIDATES.md for the full audit (all 4,589 commits surveyed; this Tier 0.5 list is the 17 worth porting — 11 critical + 6 normal value).
Tier 1+ active follow-ups (after Tier 0 lands)
These came up during recent ports and refactor passes — tracked here so they don't get lost.
| Follow-up | Why | Tier | Effort |
|---|---|---|---|
| Minimax search tests | Search agent ported the feature but explicitly skipped tests because bunker's tests don't match our preferences/provider export shape. Need: getMiniMaxSearchApiKey() priority order, resolveSearchProvider() returning "minimax", /search-provider minimax CLI behavior, no-key error messages, executeMiniMaxSearch request shape. |
1 | 0.5 day |
| Product-audit phase machine wire-up | Slim port (commit a8cf2cd94) shipped the prompt + sf_product_audit tool + workflow template, but doesn't yet dispatch into PhaseMerge or PhaseComplete. The tool is callable; the phase doesn't auto-fire. |
2 | 0.5 day |
| Headless assistant-text preview | Headless UX commit (dff0df5fd) covered notification spam, categorization, and phase/status tag distinction. The fourth bunker improvement — separating assistantTextBuffer from thinkingBuffer and flushing both as concise previews on tool-execution-start / message-end — was deferred because it's a meatier change in headless.ts. |
2 | 0.5 day |
| Search provider registry refactor | Adding minimax took 9 files because the provider list is duplicated across provider.ts (type + VALID_PREFERENCES), native-search.ts, command-search-provider.ts (CLI), tool-search.ts + tool-llm-context.ts (two separate execute paths!), preferences-types.ts, preferences-validation.ts, manifest, docs. A single SearchProviderRegistry array would let everything iterate. |
2 | 3-5 days |
| Pi-mono SDK sync | We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check pi-mono/main for SDK improvements worth taking. The remote is set up; cadence is not. |
3 | recurring |
| Caveman input-side compression (manual) | Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (execute-task.md, discuss.md, plan-*.md, etc.) — is verbose: 10-step instruction lists, runtimeContext, memoriesSection, taskPlanInline, slicePlanExcerpt. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. |
2 | 1-2 days |
| Runtime input preprocessor (caveman-compress) | Add a transformation step in dispatch that pipes sf's rendered prompt through caveman-compress (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a terse_prompts: true preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. |
3 | 3-4 days |
Full swarm chat for subagent tool |
Round-robin debate mode now exists as subagent({ mode: "debate", rounds: N, tasks: [...] }), so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from ADR-011: full inbox-based swarm chat after the persistent-agent layer (SPEC §17–18) lands. |
3 | ~3 weeks (depends on persistent-agent layer) |
| Singularity Knowledge + Agent Platform (Go re-platform) | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See ADR-014 and singularity-memory/MIGRATION.md. |
1 | ~12 weeks across phases |
| Wire sf to Singularity Memory remote-mode | sf-side: change memory-store.ts provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. |
1 | 1 week post-platform |
| Judge calibration + eval runner service | Documentation-only for now. When implemented, keep SF core in TS for repo profiling and .sf/sf.db run ledgers, but build model-judge execution/calibration as a Go/Charm service using fantasy/catwalk, with durable false-positive/false-negative lessons retained into Singularity Memory. See repo-native-harness-architecture.md. |
2 | ~2-3 weeks after Singularity Memory remote-mode |
| sf-worker SSH host | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): wish + xpty/conpty + promwish. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See ADR-013. |
2 | ~3 weeks |
Charm TUI client (sf-tui) |
Build a new Go-based TUI client on pony + ultraviolet + bubbles + lipgloss + glamour + huh + harmonica + x/mosaic. Talks to sf daemon over RPC. Two-stage replacement of pi-tui: ship parallel as sf --tui=charm, reach parity, flip default, delete pi-tui (sheds ~10k LOC of TS from sf core). See ADR-017. |
2 | ~12-16 weeks across stages |
Flight recorder (x/vcr) |
Frame-accurate session recording for sf auto-loop dispatches. Go service using charmbracelet/x/vcr. Records to .sf/recordings/{unit-id}.vcr; sf replay <unit-id> opens TUI player. Frame-level redaction parity with event-log.jsonl. See ADR-015. |
3 | ~3 weeks |
| Multi-instance federation (other surfaces) | Federated benchmarks, federated persistent agents, cross-repo unit graph — all deferred. Decide ride-Singularity-Memory vs separate service for benchmarks after §16 lands and we observe duplicated discovery cost. Cross-repo orch is out-of-scope for sf (meta-coordinator territory). Federated agents wait until concrete pain shows up. See ADR-012. | 3 | depends on which surface — re-scope after Singularity Memory lands |
It is opinionated. Each item has a tier and a one-line rationale. Reorder freely.
Upstream stance
sf is a fork. We do not periodically sync from gsd-build/gsd-2.
We tried (see attempt log in UPSTREAM_CHERRY_PICK_CANDIDATES.md). The conflicts run deep because of three structural choices that are intentional and won't be reverted:
- We renamed
gsd_*tool names →sf_*(421fccd89). - We renamed
@sf-run/*→@singularity-forge/*package scope (f92ee8d64). - We've cherry-picked tool fixes from
pi-monoupstream directly (f153521c2), which addresses some bugs thatgsd-2fixed differently.
Pretending we still track gsd-2 means weeks of merge work for diminishing return. Better to:
- Treat
gsd-build/gsd-2upstream as an intelligence source. We read it. We hand-port fixes when one specifically bites us.UPSTREAM_CHERRY_PICK_CANDIDATES.mdis a reference list of what's available, not an action plan. - Pull from
pi-monodirectly for SDK improvements. We've already been doing this; continue. - Track our own roadmap via
SPEC.mdand this file.
If a specific upstream fix matters (e.g. a CVE, a bug we hit), port it manually and credit upstream in the commit message. Don't try to sync the whole tree.
Tier 1 — ESSENTIAL (block v3 ship)
These resolve real product or correctness gaps. v3 isn't v3 without them.
1.1 Vault secret resolver
Spec: § 24, C-38, C-83.
What: vault://secret/path#field URI resolver, replacing any plaintext provider keys in current config. Auth chain: VAULT_TOKEN → ~/.vault-token → AppRole.
Why essential: sf is a real tool used against real models with real billing. Plaintext keys in config files are a security regression we should not ship past.
Effort: 1–2 days. pi-ai config layer adds a resolver.
1.2 Singularity Memory integration decision + execution
Spec: § 16, § 24, C-94, C-95, K-01 through K-06.
What: Decide whether sm replaces sf's existing memory layer, layers on top, or stays absent — then execute. The repo at singularity-ng/singularity-memory exists; integrating means replacing or augmenting memory-store.ts, memory-extractor.ts, memory-relations.ts, tools/memory-tools.ts, bootstrap/memory-tools.ts.
Why essential: the spec leans heavily on sm (anti-patterns, two-bank recall, cross-tool sharing). Either commit to it or rewrite §16 to match what sf actually has.
Recommended path: keep sf's local memory as a hot cache + use sm as durable cross-tool store. This is the layered model — sf's local memory becomes the operational fast-path; sm holds long-term cross-session, cross-project, cross-tool memories.
Effort: 1–2 weeks for the integration; 1 day to decide.
1.3 Schema reconciliation: units vs milestones/slices/tasks
Spec: § 3.1.
What: sf has 3 tables, spec has 1 with a type column. Either:
- (a) Migrate sf to single
unitstable (data migration; touches many files). - (b) Update spec to 3-table model (no code change; spec rewrite).
Recommended path: (b) — keep what sf has. The 3-table shape is more granular and integrates withdecisions,requirements,artifacts,assessments,replan_historywhich have rich schemas of their own. Forcing them into oneunitstable loses information.
Effort: 2–3 days for spec rewrite, 0 days code.
1.4 Config schema alignment
Spec: § 14.2, C-25, C-26, C-73.
What: config-overlay.ts exposes whatever keys sf has today. Spec specifies context_compact_at, context_hard_limit, unit_timeout, unit_timeout_by_phase, max_agents_by_phase, turn_input_required, worktree_mode, tool_abort_grace, max_turns_per_attempt, hot_cache_turns, etc. Add missing keys with defaults; document each.
Why essential: users can't tune behavior they can't configure. Spec promises configurability that doesn't exist yet.
Effort: 3–5 days. Add keys, plumb through, write doctor checks.
Tier 2 — STRONG (ship with v3 if possible, otherwise v3.1)
Real value-add. Defer is allowed but disappointing.
2.1 Persistent agents v1 (basic, no messaging)
Spec: § 17, A-01, A-02, A-03, A-04, A-09, A-10. Defer: A-05, A-06, A-07, A-08 (messaging) to v3.1.
What: named agents with their own memory blocks, system prompt, message history, durable across sessions. core_memory_append / core_memory_replace tools. /sf agent run|reset|delete|inspect commands.
Why strong: the persistent-agent pattern was the main draw from Letta and a recurring user interest throughout this spec process. Shipping basic persistent agents in v3 unlocks the architecture; messaging can come in v3.1.
Effort: 2 weeks for basic; +1–2 weeks for messaging.
2.2 Doc-sync sub-step
Spec: § 10.5, C-20, C-45, C-68.
What: at the end of the last code-mutating phase (Merge or, for spike workflows, Execute), run a fast-tier dispatch to check whether ARCHITECTURE.md/CONVENTIONS.md/STACK.md need updates and propose a diff for user approval.
Why strong: project docs rotting is the most predictable failure mode of long autopilot runs. Catching it costs ~5 minutes per merge.
Effort: 3–5 days.
2.3 Intent chapters
Spec: § 19.4, C-34.
What: spans grouped into named "what was the agent trying to do" chapters. Inferred from phase transitions or agent-declared via chapter_open(name). Used for crash-resume context and Hindsight recall.
Why strong: crash-resume reconstruction is currently weak. Chapters give the resumed agent a coherent "what was I doing" header instead of replaying raw tool calls.
Effort: 1 week.
2.4 PhaseReview 3-pass review
Spec: § 13.3, C-39, C-63.
What: establish-context pass (single fast dispatch) → parallel chunked review (per-file, ≤300 lines each, standard tier) → synthesis pass.
Why strong: the current single-pass review on large diffs is known to gloss the tail. The 3-pass shape catches more.
Effort: 1 week.
2.5 turn_status marker
Spec: § 5.4.1, C-81.
What: parse <turn_status>complete|blocked|giving_up</turn_status> from end of agent output. blocked triggers SignalPause; giving_up transitions to PhaseReassess immediately.
Why strong: a per-turn semantic checkpoint between transport-success and phase-boundary. Currently the harness has no way to know "the agent thinks it's stuck" except by waiting for stuck-loop timeout.
Effort: 2–3 days.
2.6 last_error cap
Spec: § 7.3, C-74.
What: truncate last_error to 4 KB head+tail; full payload to .sf/active/{unit-id}/last-error-full.txt. Agent reads the file if needed.
Why strong: lint output / traceback dumps can blow the prompt. Current behaviour is "inject and pray."
Effort: 1 day.
2.7 Cost stored as integer micro-USD
Spec: C-69.
What: rename cost_usd REAL → cost_micro_usd INTEGER in runs, benchmark_results. Float drift on accumulated costs is real over thousands of runs.
Why strong: small change, real correctness improvement, easier reasoning about totals.
Effort: 1 day with the migration.
Tier 3 — NICE (v3.1 or v3.2)
Worth building, just not blocking. Ship after Tier 2 if calendar allows.
| Item | Spec | One-line |
|---|---|---|
| Inter-agent messaging | § 18, A-05..A-08 | send_message + inbox + wait_for_reply + handoff. Builds on Tier 2.1 persistent agents. ~1–2 weeks. |
| Workflow content pinning | § 4.5, C-71 | SHA-256 hash of template content stored per unit; in-flight units use pinned content. Defends against operator editing the template mid-run. ~3 days. |
Trace _meta record |
§ 19.3, C-79 | First line of each daily JSONL is a schema-version record. Forward-compatible. ~1 day. |
runs table |
§ 3.1, C-48, C-49, C-59 | Unifies unit_attempt and agent_run history. sf has audit_events already; either repurpose or add a new view. Decision required. ~1 week. |
pending_retain queue |
§ 16.1, C-51 | Sm retain failures queue locally and retry with backoff. Required if and only if sm is integrated (Tier 1.2). |
| Capability-tag handoff | § 18.4, C-82, C-90 | handoff("capability:go,testing", ...) resolves to any matching agent. Adds agent_capabilities index. Builds on Tier 2.1 + Tier 3 inter-agent messaging. ~3 days. |
agent_run budget + termination |
§ 17.5, C-54, C-65 | When does an agent run end? (inbox drained / explicit stop / budget hard-limit / supervisor signal / timeout). Compaction preserves wake message. ~1 week. |
Tier 4 — DEFER (only if a deployment actually demands it)
Spec sections that landed during late-stage adversarial review and only matter at scale or in specific deployments.
| Item | Spec | Why deferred |
|---|---|---|
| SSH worker extension | § 22, C-64, C-75, E-02 | Real for fleet deployments (bunker, inference-fabric scaling). Not real for daily-driver development. Build when a user actually needs to dispatch to a remote box. |
| HTTP API auth | § 19.5, C-77 | Only needed if the HTTP API ships. SF currently supports MCP as a client surface only, not as an SF workflow server. |
trace_index SQL |
§ 19.3.1, C-80 | Forensics over JSONL is fine until grep gets slow. Build the index when you have months of trace files, not before. |
| PhaseUAT | § 4.6, C-53, C-76 | Only matters for "release" workflows where humans sign off before merge. Add when needed. |
| Multi-orchestrator atomic claim | C-47 | The single-process run.lock is sufficient. The atomic UPDATE pattern matters when two orchestrators race against the same DB; sf doesn't deploy that way today. |
specs.check JSDoc CI |
C-37 | Useful but not blocking. Add when JSDoc rot becomes a real issue. |
Tier 5 — DROP from spec
These crept in during adversarial review iterations and don't earn their keep.
| Item | Spec | Why drop |
|---|---|---|
Cost-per_1k_micro_usd field type rename |
C-69 (partial) | If we accept cost_micro_usd for runs (Tier 2.7), the benchmark_results.cost_per_1k_micro_usd rename is internally consistent — but the user-facing pricing model that benchmark uses already varies per provider; the integer-micro-USD constraint there is over-engineered. Keep REAL for benchmark, integer for runs. |
runs snap_ columns (unit_id_snap, agent_name_snap) |
C-59 | If we use soft-delete (archived_at) and never hard-delete, snapshots are unnecessary. Drop the columns. |
workflow_pins content snapshot table |
C-71 | If we just hash the file at first dispatch and store the hash on the unit (units.workflow_hash), we don't need a separate pins table. The hash is enough; the content can be re-read from disk. Simplify. |
agent_capabilities separate indexed table |
C-90 | At fleet sizes <100 agents, the JSON-array-LIKE scan is fine. Add the index when you have a measurement showing it's slow. |
Suggested v3 milestone breakdown
v3.0 — ship target: ~6–8 weeks
- Tier 1.1 Vault (1–2d)
- Tier 1.2 sm integration, layered model (2 weeks)
- Tier 1.3 spec schema rewrite to 3-table (3d)
- Tier 1.4 config alignment (1 week)
- Tier 2.2 doc-sync (1 week)
- Tier 2.5 turn_status marker (3d)
- Tier 2.6 last_error cap (1d)
- Tier 2.7 cost_micro_usd (1d)
That's ~5 weeks of work for the must-haves.
v3.1 — ~4 weeks after v3.0
- Tier 2.1 persistent agents v1 (2 weeks)
- Tier 2.3 intent chapters (1 week)
- Tier 2.4 PhaseReview 3-pass (1 week)
v3.2 — when ready
- Tier 3 items as appetite allows.
Decisions needed before starting v3.0
- sm: replace, layer, or keep? Recommended: layer (sf local cache + sm durable).
- Schema: migrate to single
unitsor update spec to 3-table? Recommended: update spec. - Persistent agents in v3.0 or v3.1? Recommended: v3.1 — too much new surface to land alongside Tier 1 + 2.
- Does any deployment actually need SSH workers in v3.x? If not, drop §22 from spec entirely; re-add when needed.