singularity-forge/BUILD_PLAN.md

29 KiB
Raw Blame History

sf v3 Build Plan

A practical cut of the 56 NEW items in SPEC.md into tiers. Not every spec item is worth building for v3 — some were polish from late-stage adversarial review iterations and only matter at scale or in deployments we don't have.

This document is the answer to: what should we actually ship for v3?

Strategic frame — 2026-05

We are already on a strong base: Forge is the product, UOK is the kernel, and core work is gated by purpose-driven TDD plus the eight PDD fields. The goal of this build plan is not to turn SF into a generic CLI coder. The goal is to sharpen Forge's autonomous single-repo execution while borrowing the best ideas from adjacent systems.

This file is a planning document, not a verified implementation ledger. An item can be mapped here and still be open, partial, or only folded into milestone planning. Close-out still requires code evidence, tests, and milestone artifacts that prove the behavior exists in the repo.

Use external comparisons to sharpen, not to steer identity:

  • Claude Code / Codex — interaction and execution ergonomics
  • Aider / gsd-2 — direct execution and repo work loop
  • Plandex — workflow decomposition and staged progress
  • ACE Coder — future multi-repo and large-scale convergence patterns, not the near-term product path for Forge

The end state is not "SF plus a pile of borrowed references." The end state is that proven workflow, execution, and reliability patterns are absorbed into Forge and UOK as first-party behavior.

High-level milestone sequence

  1. Stabilize the core. Keep UOK, purpose-driven TDD, the eight PDD fields, and repo-local state/evidence as the non-negotiable base.
  2. Sharpen single-repo execution. Port the highest-value correctness and workflow ideas from pi-mono, gsd-2, and adjacent CLI systems where they improve Forge without changing its product identity.
  3. Deepen autonomous reliability. Improve evidence capture, recovery, verification, and self-improvement loops inside the single-repo boundary.
  4. Polish product surfaces. Make the autonomous workflow legible in TUI, CLI, and docs without introducing separate planning semantics.
  5. Absorb and converge deliberately. Fold proven external patterns into Forge/UOK as native behavior, and keep interfaces/concepts compatible with ACE Coder where useful, while letting Forge and ACE grow from their different starting points.

Tier 0 — Pi-mono ports (sf: do these FIRST)

Pi-mono (badlogic/pi-mono) has shipped 4 releases (v0.70.3 → v0.70.6) since our last vendor sync. These should be picked up before other v3 work because:

  • They're security/correctness fixes for code we already use.
  • They land cleanly (no namespace divergence — packages/pi-* were vendored from pi-mono with same paths and type names).
  • Skipping them means dragging known bugs into v3 work.

Order: security first → real bugs → infra → features.

Order Pi-mono fix Why Status Reference
1 HTML export: escape image data + session metadata Security — crafted session content could inject markup in exported HTML 701ec8fb8 + dist 92c6d933c PRs #3819, #3883
2 Empty tools array fix for providers that reject Correctness bug — some providers reject the call 58b1d7c60 PR #3650
3 Anthropic SSE: ignore unknown proxy events Correctness bug — proxies emit OpenAI-style done events DEFERRED — fix doesn't apply directly. Pi-mono moved off the SDK to a custom SSE parser (3 commits: 4b926a30a + e58d631c8 + 3e7ffff18); we still use client.messages.stream() from @anthropic-ai/sdk. To get this protection we'd need to port the entire pi-mono custom-SSE refactor (~200 LOC). Real engineering effort, separate item. issue #3708
4 Long local-LLM SSE timeout (5-min undici cutoff) Correctness bug — local Ollama / LM Studio over 5 min die with UND_ERR_BODY_TIMEOUT d0907b6d8 issue #3715
5 Bedrock inference profile normalization Bedrock prompt-caching + adaptive-thinking checks fail on inference profile ARNs 7c487bb60 PR #3527
6 Symlinked packages/resources/skills/sessions dedup Selectors and loaders show duplicates when paths are symlinked TODO PR #3818
7 ctx.ui.setWorkingVisible() extension API Lets extensions hide the built-in working-loader row; useful for autopilot UX TODO issue #3674
8 Cloudflare Workers AI provider New provider option (CLOUDFLARE_API_KEY/CLOUDFLARE_ACCOUNT_ID) TODO PR #3851
9 Azure Cognitive Services endpoint Azure OpenAI Responses base URL support TODO PR #3799
NEW Port pi-mono custom Anthropic SSE parsing (replaces SDK) Address #3 properly: own the SSE parser like pi-mono, then unknown-event filter applies. Multi-commit refactor. TODO 4b926a30a + e58d631c8 + 3e7ffff18

Process for each: read the pi-mono commit, port the fix to our packages/pi-* (cherry-pick should work cleanly here — same namespace as upstream); commit with port(pi-mono): <description> (refs <pi-mono SHA>) style.

Skip from pi-mono (not applicable to us):

  • pi update --self, pi.dev update endpoint, Windows self-update — we vendor; no pi-binary auto-update path
  • Bun startup / sandbox /proc/self/environ fixes — we run on Node, not Bun
  • Packaged session selector import — our dist layout differs

Tier 0.5 — gsd-2 high-value manual ports (after Tier 0)

gsd-build/gsd-2 has 4,589 commits we're missing. Cherry-pick fails on virtually all of them because of our namespace divergence (gsd_*sf_* rename, extensions/gsd/extensions/sf/ rename, prior pi-mono direct cherry-picks). These have to be manually ported — read the commit, write equivalent code against our paths and naming.

Process for each:

  1. Read the commit at gsd-build/gsd-2 (we have it as upstream/main).
  2. Find the equivalent file(s) in our extensions/sf/ tree.
  3. Apply the fix manually with gsd_*sf_* and .gsd/.sf/ translations.
  4. Commit with port(gsd-2): <description> (refs <gsd-2 SHA>) style.

Critical fixes worth porting (limit to security + correctness; skip parallel-evolution churn):

Order gsd-2 fix Why gsd-2 SHA
1 fix(safety): persist bash evidence at tool_call (close mid-unit re-dispatch race) Real race condition; bash tool calls can lose evidence between dispatch and re-dispatch da7dd56e7 (PR #5056 → #5058)
2 fix(security): harden project-controlled surfaces We have a partial cherry-pick at 66ff949c1; supersede with the full fix 65ca5aa2e
3 fix(search): narrow native web_search injection Only inject web_search context when the provider accepts it 4370bedf3
4 fix(gsd): self-heal symlinked .sf staging (path-translated) Data-loss prevention — when the staging dir is a symlink that's broken or points outside expected scope, detect and self-heal instead of silently writing to wrong location. Path-translate .gsd/.sf/ in the port; the substance is symlink-resilience, not the path string. 9340f1e9b (#4423)
5 fix(knowledge): scope + budget milestone KNOWLEDGE injection Prevents milestone-scope knowledge from blowing the context budget 58d3d4d6c (#4721)
6 MCP server stdout-buffer deadlock Not applicable — SF no longer ships an MCP server package. Do not port unless a future accepted ADR reintroduces an SF-owned MCP server. N/A
7 fix(agent-session): guard synthetic agent_end transitions Session-transition race when agent_end was synthesised 71114fccf
8 fix(agent-session): skip idle wait after agent_end Idle wait was burning time on a session that was already ending 6d7e4ccb5
9 Fix agent_end session switch handoff Session handoff during agent_end could drop the next session c162c44bf
10 Fix session transition during agent_end Companion to the above e3bd04551
11 fix(claude-code-cli): persist Always Allow for non-Bash tools Always-Allow grants didn't persist for non-Bash tools a88baeae9 (PR #5096)

Normal-value features worth porting (not critical, but real):

Order gsd-2 feature Why Effort gsd-2 SHA(s)
12 /gsd eval-review (slim, like product-audit) New milestone-end evaluation review command + frontmatter schema. We don't have it. Slim port pattern: prompt + tool + workflow template; skip parallel rewrites of dispatch/prompts. 2 hrs 979487735 6971f4333 a2f8f0e08 83bcb054c a686d22cb (+11 polish commits)
13 Workflow state machine hardening (5 commits as a unit) harden workflow state transitions, persist workflow retry and summary state, fail closed on unreadable milestone summaries, restore slice dependency fallback. Reliability of long auto runs. 2 hrs f2377eedd b9a1c6743 153fb328a 381ccdef5 371b2eb31 (PR #4758)
14 Proactive rate limiting via min_request_interval_ms Self-throttle to avoid 429s — model-side rate-limit data is observability-only (per SPEC.md §19.6); this is the per-dispatch knob. 1 hr f980929f1 73bc4d2f1 (PR #5007)
15 Per-call token telemetry (opt-in) pi-coding-agent gains opt-in per-call token telemetry hooks. Useful for cost dashboards. 0.5 hr b4d4725ad (PR #5023)
16 Worktree TUI commands (worktree {list,merge,clean,remove}) Adds these to the TUI dispatcher. We may have parts of this; check before porting. 1 hr 2361ceeb1 (PR #5055)
17 Doctor check for orphan milestone directories Diagnostic — flags .sf/active/ artifacts whose milestones are gone. Aligns with SPEC.md C-24 startup cleanup. 0.5 hr 420354f99 (PR #4998)

Skip from gsd-2 (parallel evolution; we have own implementations):

  • auto-dispatch.ts, auto-prompts.ts, benchmark-selector.ts rewrites — we have these and ours are richer (e.g. our benchmark-selector has more eval types).
  • UnitContextManifest / Composer rewrite (~15 commits, PRs #4782 / #4924 / #4925 / #4926) — major architectural refactor that conflicts heavily; revisit during v3 §3 schema reconciliation.
  • xiaomi/minimax/product-audit features — already ported in commits ae0bbe32f, 2eebeccb9, a8cf2cd94.
  • All headless UX, prompt edits (DeepWiki/Context7), Serena hints, and global MCP loading — already addressed in our session (commits c41912ff5, dff0df5fd); we have own equivalents.

See UPSTREAM_CHERRY_PICK_CANDIDATES.md for the full audit (all 4,589 commits surveyed; this Tier 0.5 list is the 17 worth porting — 11 critical + 6 normal value).


Tier 1+ active follow-ups (after Tier 0 lands)

These came up during recent ports and refactor passes — tracked here so they don't get lost.

Follow-up Why Tier Effort
Minimax search tests Search agent ported the feature but explicitly skipped tests because bunker's tests don't match our preferences/provider export shape. Need: getMiniMaxSearchApiKey() priority order, resolveSearchProvider() returning "minimax", /search-provider minimax CLI behavior, no-key error messages, executeMiniMaxSearch request shape. 1 0.5 day
Product-audit phase machine wire-up Slim port (commit a8cf2cd94) shipped the prompt + sf_product_audit tool + workflow template, but doesn't yet dispatch into PhaseMerge or PhaseComplete. The tool is callable; the phase doesn't auto-fire. 2 0.5 day
Headless assistant-text preview Headless UX commit (dff0df5fd) covered notification spam, categorization, and phase/status tag distinction. The fourth bunker improvement — separating assistantTextBuffer from thinkingBuffer and flushing both as concise previews on tool-execution-start / message-end — was deferred because it's a meatier change in headless.ts. 2 0.5 day
Search provider registry refactor Adding minimax took 9 files because the provider list is duplicated across provider.ts (type + VALID_PREFERENCES), native-search.ts, command-search-provider.ts (CLI), tool-search.ts + tool-llm-context.ts (two separate execute paths!), preferences-types.ts, preferences-validation.ts, manifest, docs. A single SearchProviderRegistry array would let everything iterate. 2 3-5 days
Pi-mono SDK sync We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check pi-mono/main for SDK improvements worth taking. The remote is set up; cadence is not. 3 recurring
Caveman input-side compression (manual) Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (execute-task.md, discuss.md, plan-*.md, etc.) — is verbose: 10-step instruction lists, runtimeContext, memoriesSection, taskPlanInline, slicePlanExcerpt. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. 2 1-2 days
Runtime input preprocessor (caveman-compress) Add a transformation step in dispatch that pipes sf's rendered prompt through caveman-compress (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a terse_prompts: true preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. 3 3-4 days
Full swarm chat for subagent tool Round-robin debate mode now exists as subagent({ mode: "debate", rounds: N, tasks: [...] }), so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from ADR-011: full inbox-based swarm chat after the persistent-agent layer (SPEC §1718) lands. 3 ~3 weeks (depends on persistent-agent layer)
Singularity Knowledge + Agent Platform (Go re-platform) Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See ADR-014 and singularity-memory/MIGRATION.md. 1 ~12 weeks across phases
Wire sf to Singularity Memory remote-mode sf-side: change memory-store.ts provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. 1 1 week post-platform
Judge calibration + eval runner service Documentation-only for now. When implemented, keep SF core in TS for repo profiling and .sf/sf.db run ledgers, but build model-judge execution/calibration as a Go/Charm service using fantasy/catwalk, with durable false-positive/false-negative lessons retained into Singularity Memory. See repo-native-harness-architecture.md. 2 ~2-3 weeks after Singularity Memory remote-mode
sf-worker SSH host Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): wish + xpty/conpty + promwish. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See ADR-013. 2 ~3 weeks
Charm TUI client (sf-tui) Build a new Go-based TUI client on pony + ultraviolet + bubbles + lipgloss + glamour + huh + harmonica + x/mosaic. Talks to sf daemon over RPC. Two-stage replacement of pi-tui: ship parallel as sf --tui=charm, reach parity, flip default, delete pi-tui (sheds ~10k LOC of TS from sf core). See ADR-017. 2 ~12-16 weeks across stages
Flight recorder (x/vcr) Frame-accurate session recording for sf auto-loop dispatches. Go service using charmbracelet/x/vcr. Records to .sf/recordings/{unit-id}.vcr; sf replay <unit-id> opens TUI player. Frame-level redaction parity with event-log.jsonl. See ADR-015. 3 ~3 weeks
Multi-instance federation (other surfaces) Federated benchmarks, federated persistent agents, cross-repo unit graph — all deferred. Decide ride-Singularity-Memory vs separate service for benchmarks after §16 lands and we observe duplicated discovery cost. Cross-repo orch is out-of-scope for sf (meta-coordinator territory). Federated agents wait until concrete pain shows up. See ADR-012. 3 depends on which surface — re-scope after Singularity Memory lands

It is opinionated. Each item has a tier and a one-line rationale. Reorder freely.


Upstream stance

sf is a fork. We do not periodically sync from gsd-build/gsd-2.

We tried (see attempt log in UPSTREAM_CHERRY_PICK_CANDIDATES.md). The conflicts run deep because of three structural choices that are intentional and won't be reverted:

  • We renamed gsd_* tool names → sf_* (421fccd89).
  • We renamed @sf-run/*@singularity-forge/* package scope (f92ee8d64).
  • We've cherry-picked tool fixes from pi-mono upstream directly (f153521c2), which addresses some bugs that gsd-2 fixed differently.

Pretending we still track gsd-2 means weeks of merge work for diminishing return. Better to:

  • Treat gsd-build/gsd-2 upstream as an intelligence source. We read it. We hand-port fixes when one specifically bites us. UPSTREAM_CHERRY_PICK_CANDIDATES.md is a reference list of what's available, not an action plan.
  • Pull from pi-mono directly for SDK improvements. We've already been doing this; continue.
  • Track our own roadmap via SPEC.md and this file.

If a specific upstream fix matters (e.g. a CVE, a bug we hit), port it manually and credit upstream in the commit message. Don't try to sync the whole tree.


Tier 1 — ESSENTIAL (block v3 ship)

These resolve real product or correctness gaps. v3 isn't v3 without them.

1.1 Vault secret resolver

Spec: § 24, C-38, C-83.
What: vault://secret/path#field URI resolver, replacing any plaintext provider keys in current config. Auth chain: VAULT_TOKEN~/.vault-token → AppRole.
Why essential: sf is a real tool used against real models with real billing. Plaintext keys in config files are a security regression we should not ship past.
Effort: 12 days. pi-ai config layer adds a resolver.

1.2 Singularity Memory integration decision + execution

Spec: § 16, § 24, C-94, C-95, K-01 through K-06.
What: Decide whether sm replaces sf's existing memory layer, layers on top, or stays absent — then execute. The repo at singularity-ng/singularity-memory exists; integrating means replacing or augmenting memory-store.ts, memory-extractor.ts, memory-relations.ts, tools/memory-tools.ts, bootstrap/memory-tools.ts.
Why essential: the spec leans heavily on sm (anti-patterns, two-bank recall, cross-tool sharing). Either commit to it or rewrite §16 to match what sf actually has.
Recommended path: keep sf's local memory as a hot cache + use sm as durable cross-tool store. This is the layered model — sf's local memory becomes the operational fast-path; sm holds long-term cross-session, cross-project, cross-tool memories.
Effort: 12 weeks for the integration; 1 day to decide.

1.3 Schema reconciliation: units vs milestones/slices/tasks

Spec: § 3.1.
What: sf has 3 tables, spec has 1 with a type column. Either:

  • (a) Migrate sf to single units table (data migration; touches many files).
  • (b) Update spec to 3-table model (no code change; spec rewrite).
    Recommended path: (b) — keep what sf has. The 3-table shape is more granular and integrates with decisions, requirements, artifacts, assessments, replan_history which have rich schemas of their own. Forcing them into one units table loses information.
    Effort: 23 days for spec rewrite, 0 days code.

1.4 Config schema alignment

Spec: § 14.2, C-25, C-26, C-73.
What: config-overlay.ts exposes whatever keys sf has today. Spec specifies context_compact_at, context_hard_limit, unit_timeout, unit_timeout_by_phase, max_agents_by_phase, turn_input_required, worktree_mode, tool_abort_grace, max_turns_per_attempt, hot_cache_turns, etc. Add missing keys with defaults; document each.
Why essential: users can't tune behavior they can't configure. Spec promises configurability that doesn't exist yet.
Effort: 35 days. Add keys, plumb through, write doctor checks.


Tier 2 — STRONG (ship with v3 if possible, otherwise v3.1)

Real value-add. Defer is allowed but disappointing.

2.1 Persistent agents v1 (basic, no messaging)

Spec: § 17, A-01, A-02, A-03, A-04, A-09, A-10. Defer: A-05, A-06, A-07, A-08 (messaging) to v3.1.
What: named agents with their own memory blocks, system prompt, message history, durable across sessions. core_memory_append / core_memory_replace tools. /sf agent run|reset|delete|inspect commands.
Why strong: the persistent-agent pattern was the main draw from Letta and a recurring user interest throughout this spec process. Shipping basic persistent agents in v3 unlocks the architecture; messaging can come in v3.1.
Effort: 2 weeks for basic; +12 weeks for messaging.

2.2 Doc-sync sub-step

Spec: § 10.5, C-20, C-45, C-68.
What: at the end of the last code-mutating phase (Merge or, for spike workflows, Execute), run a fast-tier dispatch to check whether ARCHITECTURE.md/CONVENTIONS.md/STACK.md need updates and propose a diff for user approval.
Why strong: project docs rotting is the most predictable failure mode of long autopilot runs. Catching it costs ~5 minutes per merge.
Effort: 35 days.

2.3 Intent chapters

Spec: § 19.4, C-34.
What: spans grouped into named "what was the agent trying to do" chapters. Inferred from phase transitions or agent-declared via chapter_open(name). Used for crash-resume context and Hindsight recall.
Why strong: crash-resume reconstruction is currently weak. Chapters give the resumed agent a coherent "what was I doing" header instead of replaying raw tool calls.
Effort: 1 week.

2.4 PhaseReview 3-pass review

Spec: § 13.3, C-39, C-63.
What: establish-context pass (single fast dispatch) → parallel chunked review (per-file, ≤300 lines each, standard tier) → synthesis pass.
Why strong: the current single-pass review on large diffs is known to gloss the tail. The 3-pass shape catches more.
Effort: 1 week.

2.5 turn_status marker

Spec: § 5.4.1, C-81.
What: parse <turn_status>complete|blocked|giving_up</turn_status> from end of agent output. blocked triggers SignalPause; giving_up transitions to PhaseReassess immediately.
Why strong: a per-turn semantic checkpoint between transport-success and phase-boundary. Currently the harness has no way to know "the agent thinks it's stuck" except by waiting for stuck-loop timeout.
Effort: 23 days.

2.6 last_error cap

Spec: § 7.3, C-74.
What: truncate last_error to 4 KB head+tail; full payload to .sf/active/{unit-id}/last-error-full.txt. Agent reads the file if needed.
Why strong: lint output / traceback dumps can blow the prompt. Current behaviour is "inject and pray."
Effort: 1 day.

2.7 Cost stored as integer micro-USD

Spec: C-69.
What: rename cost_usd REALcost_micro_usd INTEGER in runs, benchmark_results. Float drift on accumulated costs is real over thousands of runs.
Why strong: small change, real correctness improvement, easier reasoning about totals.
Effort: 1 day with the migration.


Tier 3 — NICE (v3.1 or v3.2)

Worth building, just not blocking. Ship after Tier 2 if calendar allows.

Item Spec One-line
Inter-agent messaging § 18, A-05..A-08 send_message + inbox + wait_for_reply + handoff. Builds on Tier 2.1 persistent agents. ~12 weeks.
Workflow content pinning § 4.5, C-71 SHA-256 hash of template content stored per unit; in-flight units use pinned content. Defends against operator editing the template mid-run. ~3 days.
Trace _meta record § 19.3, C-79 First line of each daily JSONL is a schema-version record. Forward-compatible. ~1 day.
runs table § 3.1, C-48, C-49, C-59 Unifies unit_attempt and agent_run history. sf has audit_events already; either repurpose or add a new view. Decision required. ~1 week.
pending_retain queue § 16.1, C-51 Sm retain failures queue locally and retry with backoff. Required if and only if sm is integrated (Tier 1.2).
Capability-tag handoff § 18.4, C-82, C-90 handoff("capability:go,testing", ...) resolves to any matching agent. Adds agent_capabilities index. Builds on Tier 2.1 + Tier 3 inter-agent messaging. ~3 days.
agent_run budget + termination § 17.5, C-54, C-65 When does an agent run end? (inbox drained / explicit stop / budget hard-limit / supervisor signal / timeout). Compaction preserves wake message. ~1 week.

Tier 4 — DEFER (only if a deployment actually demands it)

Spec sections that landed during late-stage adversarial review and only matter at scale or in specific deployments.

Item Spec Why deferred
SSH worker extension § 22, C-64, C-75, E-02 Real for fleet deployments (bunker, inference-fabric scaling). Not real for daily-driver development. Build when a user actually needs to dispatch to a remote box.
HTTP API auth § 19.5, C-77 Only needed if the HTTP API ships. SF currently supports MCP as a client surface only, not as an SF workflow server.
trace_index SQL § 19.3.1, C-80 Forensics over JSONL is fine until grep gets slow. Build the index when you have months of trace files, not before.
PhaseUAT § 4.6, C-53, C-76 Only matters for "release" workflows where humans sign off before merge. Add when needed.
Multi-orchestrator atomic claim C-47 The single-process run.lock is sufficient. The atomic UPDATE pattern matters when two orchestrators race against the same DB; sf doesn't deploy that way today.
specs.check JSDoc CI C-37 Useful but not blocking. Add when JSDoc rot becomes a real issue.

Tier 5 — DROP from spec

These crept in during adversarial review iterations and don't earn their keep.

Item Spec Why drop
Cost-per_1k_micro_usd field type rename C-69 (partial) If we accept cost_micro_usd for runs (Tier 2.7), the benchmark_results.cost_per_1k_micro_usd rename is internally consistent — but the user-facing pricing model that benchmark uses already varies per provider; the integer-micro-USD constraint there is over-engineered. Keep REAL for benchmark, integer for runs.
runs snap_ columns (unit_id_snap, agent_name_snap) C-59 If we use soft-delete (archived_at) and never hard-delete, snapshots are unnecessary. Drop the columns.
workflow_pins content snapshot table C-71 If we just hash the file at first dispatch and store the hash on the unit (units.workflow_hash), we don't need a separate pins table. The hash is enough; the content can be re-read from disk. Simplify.
agent_capabilities separate indexed table C-90 At fleet sizes <100 agents, the JSON-array-LIKE scan is fine. Add the index when you have a measurement showing it's slow.

Suggested v3 milestone breakdown

v3.0 — ship target: ~68 weeks

  • Tier 1.1 Vault (12d)
  • Tier 1.2 sm integration, layered model (2 weeks)
  • Tier 1.3 spec schema rewrite to 3-table (3d)
  • Tier 1.4 config alignment (1 week)
  • Tier 2.2 doc-sync (1 week)
  • Tier 2.5 turn_status marker (3d)
  • Tier 2.6 last_error cap (1d)
  • Tier 2.7 cost_micro_usd (1d)

That's ~5 weeks of work for the must-haves.

v3.1 — ~4 weeks after v3.0

  • Tier 2.1 persistent agents v1 (2 weeks)
  • Tier 2.3 intent chapters (1 week)
  • Tier 2.4 PhaseReview 3-pass (1 week)

v3.2 — when ready

  • Tier 3 items as appetite allows.

Decisions needed before starting v3.0

  1. sm: replace, layer, or keep? Recommended: layer (sf local cache + sm durable).
  2. Schema: migrate to single units or update spec to 3-table? Recommended: update spec.
  3. Persistent agents in v3.0 or v3.1? Recommended: v3.1 — too much new surface to land alongside Tier 1 + 2.
  4. Does any deployment actually need SSH workers in v3.x? If not, drop §22 from spec entirely; re-add when needed.