sf snapshot: uncommitted changes after 113m inactivity
This commit is contained in:
parent
d3ff8efb22
commit
d7c2663ca5
6 changed files with 2604 additions and 0 deletions
309
docs/plans/DISPATCH_ARCHITECTURE_CONSOLIDATION.md
Normal file
309
docs/plans/DISPATCH_ARCHITECTURE_CONSOLIDATION.md
Normal file
|
|
@ -0,0 +1,309 @@
|
|||
# Dispatch Architecture Consolidation Plan
|
||||
|
||||
> **Status**: Draft — for review
|
||||
> **Author**: Research synthesis from codebase analysis
|
||||
> **Date**: 2026-05-08
|
||||
|
||||
---
|
||||
|
||||
## 1. Root Cause Diagnosis — Why the Proliferation Happened
|
||||
|
||||
The 5 dispatch mechanisms + 1 message bus are not accidental complexity — they are each responses to a genuine gap that appeared at a different time, with different constraints. The structural symptom is that **dispatch, orchestration, and coordination are conflated into one system**, and SF grew new systems rather than extending existing ones when the use cases diverged.
|
||||
|
||||
### The Timeline of Divergence
|
||||
|
||||
| Era | Mechanism Added | Gap It Filled |
|
||||
|-----|----------------|---------------|
|
||||
| Early SF | `subagent tool` | Ad-hoc delegation: "run this agent for this task" |
|
||||
| Parallel work | `parallel-orchestrator` | "run milestone X in a worktree, independently" — required isolation at process boundary |
|
||||
| Slice-level work | `slice-parallel-orchestrator` | Same as above but at finer granularity — duplicate code, not a different concept |
|
||||
| Autonomous loop | `UOK kernel` | "run the full PDD loop continuously, gated by confidence/risk" |
|
||||
| Multi-agent messaging | `MessageBus` | "agents need to communicate across turns/sessions" (Letta-style) |
|
||||
| Surface multiplexing | `Cmux` | "TUI needs multiple visible surfaces for parallel agents" |
|
||||
|
||||
### Structural Root Cause
|
||||
|
||||
**Single-process thinking drove process-per-unit.** The original SF was a single-agent CLI. When parallelism was needed, the natural answer was `spawn('sf headless')` — a new OS process per milestone. This is correct for isolation but wrong for shared-state coordination. SQLite WAL was bolted on to let workers share a DB, which created the "shared DB with file-based locking" model that all orchestrators now use.
|
||||
|
||||
**The UOK kernel was designed as a single-agent loop.** It runs inside the headless process and manages one autonomous run. It does not know about sibling workers, does not coordinate with the parallel orchestrator, and does not have a model for "I am one of N workers running concurrently."
|
||||
|
||||
**MessageBus was designed for persistent agents, but SF doesn't have persistent agents yet.** The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today the MessageBus is used for UOK internal observer chains but not for real multi-agent coordination.
|
||||
|
||||
**Subagent tool was never designed to integrate with SF's state.** It spawns `sf` CLI which is a full TUI/CLI binary. It cannot call SF tools like `complete-task` or `plan-slice` because those are registered in the headless RPC path, not in the subagent's spawned CLI context. The 4 registered tools (subagent, scout, reviewer, reporter) are intentionally narrow to avoid dangerous nested dispatch.
|
||||
|
||||
### The Concretion
|
||||
|
||||
The proliferation is a **symptom of three missing abstractions**:
|
||||
|
||||
1. **No unified "dispatch context"** — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment"
|
||||
2. **No shared dispatch registry** — there is no single place that tracks "what is currently running" across all parallelism dimensions
|
||||
3. **No first-class "work unit" concept** — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit
|
||||
|
||||
---
|
||||
|
||||
## 2. What Should Stay vs Merge
|
||||
|
||||
### Keep (Genuinely Different Needs)
|
||||
|
||||
| Mechanism | Reason to Keep |
|
||||
|-----------|---------------|
|
||||
| **UOK kernel** | This is the autonomous loop engine. It implements the PDD gate model (confidence/risk/reversibility/blast-radius/cost). Removing it means rewriting autonomous mode from scratch. It should be the *inner loop* of dispatch, not replaced by it. |
|
||||
| **MessageBus** | SQLite-backed durable inbox is the right model for cross-turn coordination when agents are long-lived. This is a genuine infrastructure primitive. However: it should be *repurposed*, not extended — it serves UOK diagnostics today and should serve agent handoff tomorrow. |
|
||||
| **Cmux** | This is surface-layer multiplexing (terminal UI). It belongs in `pi-tui`, not in the dispatch layer. It should be *decoupled* from dispatch entirely — the parallel orchestrator should not know about Cmux grid layouts. |
|
||||
|
||||
### Merge (Duplication Without Functional Difference)
|
||||
|
||||
| Duplicated | Problem | Resolution |
|
||||
|------------|---------|-----------|
|
||||
| `parallel-orchestrator.js` + `slice-parallel-orchestrator.js` | 90% identical code. The only difference is scope (milestone vs slice) and the lock env var name (`SF_MILESTONE_LOCK` vs `SF_SLICE_LOCK`). The conflict detection, worktree management, and worker lifecycle are copy-pasted. | **Merge into a single `WorktreeOrchestrator`** with a `scope` parameter. Share all file overlap detection, worktree lifecycle, and status tracking. |
|
||||
| **subagent tool's parallel/debate/chain modes** vs **parallel-orchestrator's milestone workers** | Both implement "run multiple things at the same time." The subagent tool does in-process `Promise.all` over spawned `sf` CLIs; the parallel orchestrator does the same over `sf headless` with worktrees. They use different IPC mechanisms and different isolation models. | **Subagent tool should delegate to the unified orchestrator** for multi-agent work, rather than managing its own concurrency pool. The subagent tool keeps single-agent dispatch (its core value) but offloads parallel/debate to the orchestrator layer. |
|
||||
|
||||
### Refactor (Same Need, Wrong Implementation)
|
||||
|
||||
| Current | Issue | Refactor |
|
||||
|---------|-------|----------|
|
||||
| **subagent spawning `sf` CLI** | Full CLI binary with TUI/headless分辨. Subagent is a thin wrapper that spawns a binary, not a dispatch primitive. The 4-tool limitation is a workaround for not having a proper dispatch API. | Subagent should use a **headless RPC client** directly, not spawn `sf`. This allows it to call any SF tool, not just the 4 registered ones. |
|
||||
| **parallel-orchestrator + slice-parallel using SQLite WAL + file IPC** | Workers coordinate via `sf headless` + session status files + signal files. This is a hand-rolled IPC layer. The status files are "poll the filesystem" coordination — correct but fragile. | Replace with **MessageBus-based coordination**. Workers publish status to MessageBus; coordinator subscribes. Eliminates file-based IPC and session status polling. |
|
||||
| **UOK kernel owning the autonomous loop** | The kernel runs inside a headless process. When the parallel orchestrator spawns `sf headless autonomous`, each worker has its own UOK kernel. Coordination between kernels requires external signals. | The UOK kernel should be the **runtime environment** for any autonomous dispatch, not a process-bound concept. The orchestrator manages worktree lifecycle; the kernel manages turn-level execution within each worktree. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Streamlined Architecture
|
||||
|
||||
### The Unified Dispatch Layer
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Unified Dispatch API (UDA) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ dispatch.work({ unit, mode, model, tools, cwd, signal }) │
|
||||
│ dispatch.batch([{ unit, ... }, { unit, ... }], { strategy }) │
|
||||
│ dispatch.chain([{ unit, after }, ...]) │
|
||||
│ dispatch.debate([{ unit, role }, ...], { rounds }) │
|
||||
│ dispatch.subscribe(handler) // for events: start, end, error, log │
|
||||
│ dispatch.cancel(workId) │
|
||||
│ dispatch.status() → { active: WorkInfo[] } │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Modes**: `isolated` (worktree), `shared` (same process), `rpc` (separate process via headless)
|
||||
|
||||
### How the Existing Components Map
|
||||
|
||||
| Component | Role in Unified Architecture |
|
||||
|-----------|------------------------------|
|
||||
| **subagent tool** | Becomes a thin **UDA client** in the TUI. Single-agent dispatch with the full SF tool access. Keeps the 4-mode interface (single/parallel/debate/chain) but implemented via UDA, not spawned CLI. |
|
||||
| **parallel-orchestrator + slice-parallel** | Merge into **WorktreeOrchestrator** — a UDA backend that manages worktree lifecycle and multi-slot execution. Implements `dispatch.work({ mode: 'isolated' })` for milestone/slice workers. |
|
||||
| **UOK kernel** | Becomes **UOK runtime** — a UDA execution mode that wraps any dispatch with the PDD gate model. A `dispatch.work({ unit, runControl: 'autonomous' })` automatically uses the UOK runtime. The kernel is not a separate process; it's the execution strategy. |
|
||||
| **MessageBus** | Becomes the **UDA event/logging backbone**. All dispatch events (start, end, tool call, error, cost) are published to MessageBus. The parallel orchestrator's file-based IPC is replaced by MessageBus subscriptions. |
|
||||
| **Cmux** | **Decoupled entirely**. Cmux listens to MessageBus for dispatch events and renders grid layouts accordingly. The dispatch layer does not know about Cmux. |
|
||||
|
||||
### The Mental Model: Dispatch Is a Service, Not a Tool
|
||||
|
||||
The unified dispatch API is a service (backed by WorktreeOrchestrator + UOK runtime) that SF agents and tools call. It is not a tool itself and is not registered as one.
|
||||
|
||||
```
|
||||
Agent/Tool Dispatch Service
|
||||
│ │
|
||||
├── dispatch.work() ───────►│ Spawns worktree, runs UOK loop
|
||||
│ │
|
||||
│◄──── work.start event ────┤
|
||||
│◄──── work.end event ──────┤
|
||||
│
|
||||
├── dispatch.batch() ───────►│ Runs N work items in parallel
|
||||
│ │ (via WorktreeOrchestrator)
|
||||
│
|
||||
├── dispatch.chain() ───────►│ Runs N items sequentially, passes
|
||||
│ │ previous output as {previous} input
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Multi-Dimensional Parallelism
|
||||
|
||||
SF needs to run multiple things concurrently at multiple levels:
|
||||
|
||||
| Dimension | Example | Current Implementation |
|
||||
|-----------|---------|----------------------|
|
||||
| **Unit (milestone/slice)** | Two milestones simultaneously | `parallel-orchestrator` (worktree-per-milestone) |
|
||||
| **Agent within unit** | Two agents working on the same slice | `subagent parallel mode` (Promise.all over spawned CLIs) |
|
||||
| **Turn within agent** | Agent running autonomous loop | `UOK kernel` (single-threaded, event loop) |
|
||||
| **Tool within turn** | Concurrent tool executions | Not supported (single-threaded LLM dispatch) |
|
||||
|
||||
### What Should Actually Be Parallel
|
||||
|
||||
**The real parallelism need is at the unit level**, not at the agent level. Milestones and slices are the natural parallelism boundary because:
|
||||
- They have independent file scope (reduced conflict surface)
|
||||
- They are tracked independently in the DB
|
||||
- They have independent cost budgets
|
||||
- They can recover independently from failure
|
||||
|
||||
**Agent-level parallelism within a unit** (subagent parallel/debate) is useful for review and research tasks but is not the primary parallelism mode. It should remain but as a secondary mechanism.
|
||||
|
||||
### Proposed Multi-Dimensional Model
|
||||
|
||||
```
|
||||
WorktreeOrchestrator
|
||||
├── slot[0] → worktree for milestone M1
|
||||
│ └── UOK kernel running autonomous loop
|
||||
│ ├── turn[0]: agent dispatch
|
||||
│ └── turn[1]: agent dispatch (sequential within unit)
|
||||
├── slot[1] → worktree for milestone M2
|
||||
│ └── UOK kernel running autonomous loop
|
||||
└── slot[2] → worktree for slice S1 (within M1)
|
||||
└── UOK kernel running autonomous loop
|
||||
```
|
||||
|
||||
**Constraints:**
|
||||
- Worktrees provide filesystem isolation (required for concurrent file mutations)
|
||||
- Each worktree runs one UOK kernel (not multiple concurrent kernels per worktree)
|
||||
- The kernel turn loop is sequential within a worktree (correct — you can't have two LLM turns modifying state simultaneously)
|
||||
- Tool-level parallelism (e.g., running `grep` and `read` simultaneously) is not needed — the LLM dispatches tools serially
|
||||
|
||||
### Concurrency Limits
|
||||
|
||||
| Level | Max Concurrent |
|
||||
|-------|---------------|
|
||||
| Project (milestones) | `parallel.max_workers` config (default: CPU cores / 2) |
|
||||
| Milestone (slices) | `parallel.slice_max_workers` config (default: 2) |
|
||||
| Subagent parallel tasks | `MAX_CONCURRENCY = 4` (current hardcoded) |
|
||||
|
||||
---
|
||||
|
||||
## 5. DB Access from Subagents
|
||||
|
||||
### The Current Constraint
|
||||
|
||||
The subagent tool cannot call SF DB tools (`complete-task`, `plan-slice`, etc.) because:
|
||||
1. It spawns `sf` CLI which is a full binary with its own extension registration
|
||||
2. The spawned CLI does not share the parent process's RPC connection
|
||||
3. The 4 registered tools (subagent, scout, reviewer, reporter) are intentionally all that's available
|
||||
|
||||
This is a **correct security isolation**, not a bug. A spawned `sf` CLI with full SF tool access running in a user-specified `cwd` is a significant attack surface.
|
||||
|
||||
### The Right Model
|
||||
|
||||
**Layer 1 — No direct DB access from subagents (correct, keep it)**
|
||||
Subagents should not have direct SQLite access. The DB is the source of truth for the primary agent's state; subagents reading it creates consistency hazards.
|
||||
|
||||
**Layer 2 — Structured output from subagents (keep and expand)**
|
||||
Subagents return structured output (via `--mode json` + event stream). The parent agent is responsible for interpreting the output and calling the appropriate DB tools. This is the "subagent as a function" model — it returns data, not mutations.
|
||||
|
||||
**Layer 3 — Intention declaration for later commit**
|
||||
For cases where a subagent needs to propose a state change (e.g., "I found this issue, mark the slice as blocked"), the subagent should return a structured **intention** (e.g., `{ intended_action: "block_slice", slice_id: "S01", reason: "..." }`). The parent agent reviews and commits it via its own DB tools.
|
||||
|
||||
**Layer 4 — Shared WAL for read-your-own-writes consistency (future)**
|
||||
When the UDA runs subagents in the same process (not spawned CLI), it can share the DB connection. This enables the subagent to read what the parent just wrote in the same transaction. This requires the subagent to run as a headless RPC client, not a spawned CLI.
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Keep the current constraint for spawned-CLI subagents.** The 4-tool limit is a security boundary, not a limitation to be fixed.
|
||||
|
||||
**Add a new subagent mode** — `dispatch.work({ mode: 'rpc' })` — where the subagent runs as an RPC client in the same process, gaining access to all SF tools. This is the headless equivalent of the subagent tool. Use this for internal SF workflows (e.g., "dispatch a review subagent that calls `complete-task`").
|
||||
|
||||
---
|
||||
|
||||
## 6. Naming — The Mental Model
|
||||
|
||||
The current names reflect implementation history, not user intent. Here is what they should be:
|
||||
|
||||
### Current → Proposed
|
||||
|
||||
| Current | Problem | Proposed | Rationale |
|
||||
|---------|---------|----------|-----------|
|
||||
| `subagent` tool | "subagent" implies a lesser agent, not a dispatch primitive | `dispatch` tool (in TUI) | The tool *is* the dispatch API surface |
|
||||
| `parallel-orchestrator` | "orchestrator" is vague; doesn't convey worktree isolation | `worktree-pool` or `worktree-scheduler` | Conveys the resource model |
|
||||
| `slice-parallel-orchestrator` | Duplicate of above | Merge into `worktree-pool` | See section 2 |
|
||||
| `UOK kernel` | "kernel" implies OS-level; "UOK" is jargon | `autonomous-runtime` or `UOK` stays if we accept the acronym | "UOK" means "unit-of-work kernel" internally; can keep if documented |
|
||||
| `MessageBus` | Generic; doesn't convey durability | `agent-inbox` (but it's more than a bus) | Actually, `MessageBus` is fine — it is a bus pattern. Keep it. |
|
||||
| `Cmux` | "cmux" is implementation detail of terminal multiplexing | `surface-grid` | User-facing concept: "show agents in a grid" |
|
||||
|
||||
### The Unified Naming Hierarchy
|
||||
|
||||
```
|
||||
dispatch — The high-level API and TUI tool name
|
||||
├── work() — Run a single unit (milestone/slice/task)
|
||||
├── batch() — Run multiple units in parallel (worktree pool)
|
||||
├── chain() — Run units sequentially, passing output
|
||||
├── debate() — Run units as adversarial roles
|
||||
└── subscribe() — Listen to dispatch events
|
||||
|
||||
worktree-pool — The backend that manages worktree lifecycle
|
||||
autonomous-runtime — The PDD-gated execution loop (UOK kernel)
|
||||
MessageBus — Durable inter-agent messaging
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Implementation Priority
|
||||
|
||||
This is a large refactor. The work should be sequenced to avoid breaking the current system while building the new one underneath.
|
||||
|
||||
### Phase 1 — Foundation (Weeks 1-3)
|
||||
**Goal**: Establish the UDA backbone without changing existing behavior.
|
||||
|
||||
| Task | Why | Risk |
|
||||
|------|-----|------|
|
||||
| Extract a minimal `dispatch-worktree` module from `parallel-orchestrator.js` that just manages worktree lifecycle (create/remove/heartbeat) | The worktree management is the most isolated piece and the easiest to extract first | Low |
|
||||
| Add MessageBus subscriptions to `dispatch-worktree` for worker status (replacing session status file polling) | MessageBus already exists; this just redirects the existing file-based IPC | Low |
|
||||
| Create `dispatch-chain` module that takes an array of `{ unit, afterId }` and runs them sequentially, passing output | Reuses worktree-pool; no new parallelism semantics | Low |
|
||||
| **Do NOT change subagent tool or parallel-orchestrator yet** | These must keep working while foundation is laid | — |
|
||||
|
||||
### Phase 2 — Merge (Weeks 4-6)
|
||||
**Goal**: Eliminate duplication, keep external behavior identical.
|
||||
|
||||
| Task | Why | Risk |
|
||||
|------|-----|------|
|
||||
| Merge `slice-parallel-orchestrator.js` into `dispatch-worktree` as `scope: 'slice'` parameter | 90% code duplication; this is a pure refactor | Medium |
|
||||
| Replace `parallel-orchestrator`'s file-based IPC with MessageBus subscriptions | Changes the coordination mechanism but not the external API | Medium |
|
||||
| Add `dispatch-batch()` that calls `dispatch-worktree` for N units | Reuses the same worktree pool; just adds the batch interface | Low |
|
||||
| Verify all existing parallel orchestrator tests still pass | Regression protection | Low |
|
||||
|
||||
### Phase 3 — Subagent RPC Mode (Weeks 7-8)
|
||||
**Goal**: Subagent gains headless RPC access without spawning CLI.
|
||||
|
||||
| Task | Why | Risk |
|
||||
|------|-----|------|
|
||||
| Add `dispatch.rpc()` — spawn a headless RPC client (not CLI) for a subagent | The 4-tool limitation goes away when subagent is an RPC client | Medium |
|
||||
| Wire `subagent({ mode: 'rpc' })` to use `dispatch.rpc()` | Subagent keeps its 4-mode interface; the implementation changes | Medium |
|
||||
| Ensure subagent RPC mode cannot access tools the parent mode doesn't permit | Security boundary must be preserved | Medium |
|
||||
|
||||
### Phase 4 — UOK as Execution Mode (Weeks 9-10)
|
||||
**Goal**: UOK kernel becomes a dispatch execution mode, not a separate process.
|
||||
|
||||
| Task | Why | Risk |
|
||||
|------|-----|------|
|
||||
| Refactor `runAutoLoopWithUok` to be `dispatch.autonomous()` — a UDA execution mode | The autonomous loop becomes a configuration of dispatch, not a separate entry point | Medium |
|
||||
| `sf headless autonomous` calls `dispatch.batch()` with UOK runtime per slot | The headless binary becomes a thin launcher for the dispatch service | Medium |
|
||||
| Remove the notion of "UOK kernel" as a separate coordination entity | The kernel is an execution context; coordination is dispatch's job | Medium |
|
||||
|
||||
### Phase 5 — Cmux Decoupling (Week 11)
|
||||
**Goal**: Cmux becomes a MessageBus subscriber, not a dispatch-aware component.
|
||||
|
||||
| Task | Why | Risk |
|
||||
|------|-----|------|
|
||||
| Make Cmux grid layout creation driven by MessageBus events, not by dispatch calling Cmux directly | Dispatch should not know about terminal surface implementation | Low |
|
||||
| Remove `cmuxSplitsEnabled` from subagent tool | This is the concrete coupling point — dispatch knows about Cmux grid layouts | Low |
|
||||
|
||||
### Phase 6 — Naming Cleanup (Week 12)
|
||||
**Goal**: Rename things to match the mental model once the refactor is stable.
|
||||
|
||||
| Task | Why | Risk |
|
||||
|------|-----|------|
|
||||
| Rename `subagent` tool to `dispatch` in the TUI (keep `subagent` as alias) | User-facing naming should match the mental model | Low |
|
||||
| Rename `parallel-orchestrator` file to `worktree-pool.js` | Internal naming | Low |
|
||||
| Document the architecture in `ARCHITECTURE.md` | The current dispatch docs are scattered | Low |
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 3 duplications (parallel-orchestrator + slice-parallel-orchestrator; subagent parallel mode + parallel-orchestrator; Cmux tight coupling). The root cause is that dispatch, orchestration, and coordination evolved separately rather than being designed as layers of one system.
|
||||
|
||||
**The plan is to:**
|
||||
1. Merge `parallel-orchestrator` + `slice-parallel-orchestrator` into a single `WorktreePool`
|
||||
2. Make subagent an RPC client of a unified `Dispatch` service, not a spawned CLI
|
||||
3. Make UOK an execution *mode* of the dispatch service, not a separate process
|
||||
4. Make MessageBus the event backbone replacing all file-based IPC
|
||||
5. Decouple Cmux from dispatch entirely (it subscribes to MessageBus)
|
||||
6. Sequence the refactor so existing behavior is preserved at each step
|
||||
435
docs/plans/DISPATCH_ORCHESTRATION_PLAN.md
Normal file
435
docs/plans/DISPATCH_ORCHESTRATION_PLAN.md
Normal file
|
|
@ -0,0 +1,435 @@
|
|||
# Dispatch/Orchestration Architecture — Consolidation Plan
|
||||
|
||||
**Author:** Research synthesis
|
||||
**Date:** 2026-05-08
|
||||
**Status:** Draft — for review and promotion
|
||||
|
||||
---
|
||||
|
||||
## 1. Root Cause Diagnosis
|
||||
|
||||
The 5 dispatch mechanisms + 1 message bus grew to fill genuine gaps at different stages, but the structural symptom is a **missing abstraction layer**: there is no unified concept that unifies "what to run" (controller) from "how to run" (mechanism).
|
||||
|
||||
### Timeline of divergence
|
||||
|
||||
| Era | Mechanism | Gap filled |
|
||||
|-----|-----------|-----------|
|
||||
| Early SF | `subagent tool` (`extensions/subagent/index.js`) | Ad-hoc delegation: "run this agent for this task" from within a session |
|
||||
| Parallel work | `parallel-orchestrator.js` | "run milestone X in a worktree, independently" — process isolation via `spawn('sf headless')` |
|
||||
| Slice-level work | `slice-parallel-orchestrator.js` | Same as above but at slice granularity — **90% copy-paste of parallel-orchestrator** |
|
||||
| Autonomous loop | `UOK kernel` (`uok/kernel.js`) | "run the full PDD loop continuously, gated by confidence/risk" |
|
||||
| Multi-agent messaging | `MessageBus` (`uok/message-bus.js`) | "agents need to communicate across turns/sessions" (Letta-style) |
|
||||
| Surface multiplexing | `Cmux` (`cmux/index.js`) | "TUI needs multiple visible surfaces for parallel agents" |
|
||||
|
||||
### Three structural problems
|
||||
|
||||
**1. Single-process thinking drove process-per-unit.**
|
||||
SF was originally a single-agent CLI. When parallelism was needed, the natural answer was `spawn('sf headless')` — a new OS process per milestone. This is correct for filesystem isolation but requires bolted-on coordination (SQLite WAL, file-based IPC, session status polling).
|
||||
|
||||
**2. UOK kernel was designed as a single-agent loop.**
|
||||
It runs inside a headless process and manages one autonomous run. It does not know about sibling workers spawned by parallel-orchestrator, does not coordinate with it, and has no model for "I am one of N workers running concurrently."
|
||||
|
||||
**3. MessageBus was designed for persistent agents SF doesn't have yet.**
|
||||
The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today MessageBus is used for UOK internal observer chains but not for real multi-agent coordination between workers and coordinator.
|
||||
|
||||
### The concretion
|
||||
|
||||
The proliferation is a **missing abstraction problem** at three levels:
|
||||
|
||||
1. **No unified "dispatch context"** — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment"
|
||||
2. **No shared dispatch registry** — no single place that tracks "what is currently running" across all parallelism dimensions
|
||||
3. **No first-class "work unit" concept** — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit
|
||||
|
||||
---
|
||||
|
||||
## 2. What Should Stay vs Merge
|
||||
|
||||
### Stay (genuinely different needs)
|
||||
|
||||
| Mechanism | Reason to Keep |
|
||||
|-----------|---------------|
|
||||
| **UOK kernel** | Autonomous loop engine implementing the PDD gate model (confidence/risk/reversibility/blast-radius/cost). This is the **controller** — it decides what to run, not how. Removing it means rewriting autonomous mode from scratch. |
|
||||
| **MessageBus** | SQLite-backed durable inbox is the right model for cross-turn coordination. This is genuine infrastructure. However: it should be **repurposed** — it serves UOK diagnostics today and should serve agent handoff when persistent agents land (v3.1 per BUILD_PLAN.md). |
|
||||
| **Cmux** | Terminal UI surface multiplexing. Belongs in `pi-tui`, not the dispatch layer. Should be **decoupled** from dispatch entirely — parallel orchestrator should not know about Cmux grid layouts. |
|
||||
| **Execution graph** (`uok/execution-graph.js`) | File-conflict DAG that computes which milestones/slices can run in parallel. This is the **constraint solver** — stays separate from dispatch mechanism. |
|
||||
|
||||
### Merge (duplication without functional difference)
|
||||
|
||||
| Duplicated | Problem | Resolution |
|
||||
|------------|---------|-----------|
|
||||
| `parallel-orchestrator.js` + `slice-parallel-orchestrator.js` | ~80% identical. Only diffs: scope (milestone vs slice), lock env vars, status file naming. Slice orchestrator additionally calls `slice-parallel-conflict.ts` for file overlap filtering. | **Merge into a single `WorktreeOrchestrator`** parameterized by `{ scope: 'milestone' | 'slice', milestoneId, sliceId? }`. Conflict filtering already lives in `slice-parallel-conflict.ts` — call it from the merged class. |
|
||||
| **subagent tool's parallel/debate/chain modes** vs **parallel-orchestrator's milestone workers** | Both implement "run multiple things at the same time." Subagent tool does in-process `Promise.all` over spawned `sf` CLIs; parallel-orchestrator does the same with worktrees. Different IPC mechanisms, different isolation models. | **Subagent keeps single-agent dispatch** (its core value). For multi-agent work, subagent should delegate to the unified orchestrator rather than managing its own concurrency pool. |
|
||||
|
||||
### Refactor (same need, wrong implementation)
|
||||
|
||||
| Current | Issue | Refactor |
|
||||
|---------|-------|----------|
|
||||
| **subagent spawning `sf` CLI** | Thin wrapper spawning a full binary. Only 4 tools registered as a workaround for not having a proper dispatch API. | Subagent should use a **headless RPC client** directly, not spawn `sf`. Enables calling any SF tool, not just 4 registered ones. |
|
||||
| **parallel/slice orchestrator using SQLite WAL + file IPC** | Hand-rolled IPC via session status files + signal files. "Poll the filesystem" coordination — correct but fragile. | Replace with **MessageBus-based coordination**. Workers publish status to MessageBus; coordinator subscribes. |
|
||||
| **UOK kernel owning the autonomous loop** | Runs inside a headless process. When parallel orchestrator spawns `sf headless autonomous`, each worker has its own UOK kernel with no coordination between kernels. | UOK kernel should be the **runtime environment** for any autonomous dispatch, not a process-bound concept. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Streamlined Architecture
|
||||
|
||||
### Three-tier dispatch model
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ UOK Kernel (controller) │
|
||||
│ Decides WHAT to run next; enforces PDD gates, policy, parity │
|
||||
│ - Phase machine: Discuss → Plan → Execute → Merge → Complete │
|
||||
│ - Calls WorktreeOrchestrator.dispatch() to execute │
|
||||
└────────────────────────────┬──────────────────────────────────────┘
|
||||
│ DispatchEnvelope { scope, unitId, ... }
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ WorktreeOrchestrator (mechanism) │
|
||||
│ Decides HOW to run: worktree lifecycle, process registry, budget │
|
||||
│ - Worktree pool (git worktree per milestone/slice) │
|
||||
│ - Process registry (child_process per worker) │
|
||||
│ - Cost accumulator (NDJSON parsing from worker stdout) │
|
||||
│ - File-intent tracker (parallel-intent.js) │
|
||||
│ - MessageBus integration per worker (AgentInbox) │
|
||||
└────────────────────────────┬──────────────────────────────────────┘
|
||||
│ spawns
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Worker (execution unit) │
|
||||
│ `sf headless --json autonomous` in a worktree │
|
||||
│ - Owns SQLite WAL connection to project DB │
|
||||
│ - Has AgentInbox for MessageBus delivery │
|
||||
│ - Emits NDJSON events consumed by WorktreeOrchestrator │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### How existing components map
|
||||
|
||||
| Component | Role in Unified Architecture |
|
||||
|-----------|------------------------------|
|
||||
| **subagent tool** | Thin UDA client in TUI. Single-agent dispatch with full SF tool access. Keeps 4-mode interface (single/parallel/debate/chain) but implemented via UDA, not spawned CLI. |
|
||||
| **parallel-orchestrator + slice-parallel** | Merge into **WorktreeOrchestrator** — a UDA backend managing worktree lifecycle and multi-slot execution. |
|
||||
| **UOK kernel** | Becomes **autonomous-runtime** — a UDA execution mode wrapping any dispatch with the PDD gate model. `dispatch.work({ unit, runControl: 'autonomous' })` automatically uses the autonomous-runtime. |
|
||||
| **MessageBus** | Becomes the **UDA event/logging backbone**. All dispatch events (start, end, error, cost) published to MessageBus. File-based IPC replaced by MessageBus subscriptions. |
|
||||
| **Cmux** | **Decoupled entirely**. Cmux listens to MessageBus for dispatch events and renders grid layouts. Dispatch layer does not know about Cmux. |
|
||||
|
||||
### WorktreeOrchestrator interface (proposed)
|
||||
|
||||
```ts
|
||||
// File: src/resources/extensions/sf/worktree-orchestrator.js
|
||||
|
||||
interface DispatchOptions {
|
||||
scope: 'milestone' | 'slice';
|
||||
milestoneId: string;
|
||||
sliceId?: string;
|
||||
basePath: string;
|
||||
maxWorkers?: number;
|
||||
budgetCeiling?: number;
|
||||
workerTimeoutMs?: number;
|
||||
shellWrapper?: string[];
|
||||
useExecutionGraph?: boolean;
|
||||
}
|
||||
|
||||
class WorktreeOrchestrator {
|
||||
// Returns eligible units filtered by execution-graph conflicts
|
||||
async prepare(opts: DispatchOptions): Promise<PrepareResult>;
|
||||
|
||||
// Start workers for given unit IDs
|
||||
async start(ids: string[], opts: DispatchOptions): Promise<StartResult>;
|
||||
|
||||
// Stop all or specific workers
|
||||
async stop(ids?: string[]): Promise<void>;
|
||||
|
||||
// Pause/resume workers via MessageBus
|
||||
pause(ids?: string[]): void;
|
||||
resume(ids?: string[]): void;
|
||||
|
||||
// Read current state (for dashboard)
|
||||
getStatus(): DispatchStatus;
|
||||
|
||||
// Shared MessageBus instance
|
||||
readonly bus: MessageBus;
|
||||
|
||||
// Budget tracking
|
||||
totalCost(): number;
|
||||
isBudgetExceeded(): boolean;
|
||||
}
|
||||
```
|
||||
|
||||
### How UOK kernel uses WorktreeOrchestrator
|
||||
|
||||
Today, `uok/kernel.js` runs the autonomous loop and calls into tools that spawn agents. The parallel orchestrator is started separately by the TUI dashboard or headless command. After unification:
|
||||
|
||||
1. UOK kernel initializes `WorktreeOrchestrator` at autonomous loop start
|
||||
2. UOK calls `orchestrator.start(eligibleMilestoneIds)` for parallel milestones
|
||||
3. Workers emit NDJSON events → orchestrator parses cost → updates budget
|
||||
4. Workers emit completion → UOK kernel processes post-unit staging
|
||||
5. Workers receive messages via their `AgentInbox` (MessageBus integration)
|
||||
6. `orchestrator.stop()` called on autonomous loop exit
|
||||
|
||||
---
|
||||
|
||||
## 4. Multi-Dimensional Parallelism
|
||||
|
||||
### Current axes of parallelism
|
||||
|
||||
| Axis | Mechanism | Status |
|
||||
|------|-----------|--------|
|
||||
| **Inter-project** | Multiple `sf` invocations | ✅ not SF's concern |
|
||||
| **Inter-milestone** | parallel-orchestrator + worktrees | ✅ implemented |
|
||||
| **Inter-slice** | slice-parallel-orchestrator + worktrees | ✅ implemented |
|
||||
| **Inter-task** (in-process) | subagent `parallel` mode | ✅ `mapWithConcurrencyLimit` |
|
||||
| **Inter-agent** (debate/chain) | subagent `debate`/`chain` mode | ✅ implemented |
|
||||
| **Terminal-level** | Cmux grid layout for parallel agents | ✅ implemented |
|
||||
|
||||
### What "true concurrency" means
|
||||
|
||||
The current architecture already achieves **true process-level concurrency** via worktrees and separate `sf headless` processes. The shared SQLite WAL allows concurrent readers with a single writer.
|
||||
|
||||
**What is missing is coordinated dispatch, not more parallelism axes:**
|
||||
|
||||
- The execution graph (`uok/execution-graph.js`) already computes file-conflict relationships
|
||||
- `selectConflictFreeBatch` picks a conflict-free subset for parallel dispatch
|
||||
- But this is only wired into parallel-orchestrator, not into slice-parallel or the UOK autonomous loop's dispatch decisions
|
||||
|
||||
### Proposed coordination model
|
||||
|
||||
```
|
||||
Execution Graph (file-conflict DAG)
|
||||
│
|
||||
├── selectConflictFreeBatch() ──► WorktreeOrchestrator.start()
|
||||
│ Workers run in parallel
|
||||
│ Each worker has AgentInbox
|
||||
│
|
||||
UOK kernel
|
||||
│
|
||||
├── reads unit readiness from DB
|
||||
├── calls WorktreeOrchestrator.start(milestoneIds)
|
||||
└── calls WorktreeOrchestrator.start(sliceIds) for intra-milestone parallelism
|
||||
```
|
||||
|
||||
**Debate mode** (subagent tool): runs multiple agents sequentially within a single process using `mapWithConcurrencyLimit`. This is **not** true process-level parallelism but is correct for LLM-based debate where shared context and a single conversation transcript are needed.
|
||||
|
||||
**Chain mode**: purely sequential — each step's output feeds into the next step's prompt.
|
||||
|
||||
### Concurrency limits
|
||||
|
||||
| Level | Max Concurrent |
|
||||
|-------|---------------|
|
||||
| Project (milestones) | `parallel.max_workers` config (default: CPU cores / 2) |
|
||||
| Milestone (slices) | `parallel.slice_max_workers` config (default: 2) |
|
||||
| Subagent parallel tasks | `MAX_CONCURRENCY = 4` (hardcoded in `subagent/index.js`) |
|
||||
|
||||
---
|
||||
|
||||
## 5. DB Access from Subagents
|
||||
|
||||
### The current constraint is intentional
|
||||
|
||||
The subagent tool **cannot** call `complete-task` or `plan-slice` because:
|
||||
1. Only 4 tools are registered in the subagent extension manifest
|
||||
2. The subagent is meant to be a **task executor**, not a **state mutator**
|
||||
|
||||
This is a **correct security isolation**, not a bug. A spawned `sf` CLI with full SF tool access running in a user-specified `cwd` is a significant attack surface.
|
||||
|
||||
### The right model: two-tier DB access
|
||||
|
||||
```
|
||||
Coordinator (UOK kernel) ──► project .sf/sf.db (WAL mode)
|
||||
milestone/slice state
|
||||
task execution ledger
|
||||
|
||||
Subagent (sf process) ──► ~/.sf/sf.db (global)
|
||||
memories, preferences
|
||||
agent-level state
|
||||
✗ project .sf/sf.db (write)
|
||||
```
|
||||
|
||||
The subagent can **read** project state via **prompt injection** (system context assembly already does this). Writes only to global state.
|
||||
|
||||
### If a subagent needs to record a finding
|
||||
|
||||
1. Subagent writes to its output (stdout/file)
|
||||
2. Coordinator reads and processes the output
|
||||
3. Coordinator calls DB tools
|
||||
|
||||
This is the Letta pattern — agents return results, the orchestrator decides what to persist.
|
||||
|
||||
### Architectural backing for the constraint
|
||||
|
||||
```ts
|
||||
// In subagent tool — formalize the access contract
|
||||
const SUBAGENT_DB_ACCESS = {
|
||||
read: ['project_context'], // via prompt injection only
|
||||
write: ['~/.sf/sf.db'], // global state only
|
||||
prohibited: ['project .sf/sf.db write operations']
|
||||
};
|
||||
```
|
||||
|
||||
The extension manifest's `tools[]` array currently enforces this by omission. A more explicit model would declare the access contract formally, making it auditable.
|
||||
|
||||
### Future: RPC-mode subagent
|
||||
|
||||
**Keep the current constraint for spawned-CLI subagents.**
|
||||
|
||||
**Add a new subagent mode** — `dispatch.work({ mode: 'rpc' })` — where the subagent runs as an RPC client in the same process, gaining access to all SF tools. This is the headless equivalent of the subagent tool. Use this for internal SF workflows (e.g., "dispatch a review subagent that calls `complete-task`").
|
||||
|
||||
---
|
||||
|
||||
## 6. Naming — The Mental Model
|
||||
|
||||
### Current → Proposed
|
||||
|
||||
| Current | Problem | Proposed | Rationale |
|
||||
|---------|---------|----------|-----------|
|
||||
| `subagent` tool | "subagent" implies a lesser agent | `dispatch` tool (in TUI) | The tool *is* the dispatch API surface |
|
||||
| `parallel-orchestrator` | "orchestrator" is vague; doesn't convey worktree isolation | `milestone-dispatcher` | Conveys scope + role |
|
||||
| `slice-parallel-orchestrator` | Duplicate of above | `slice-dispatcher` (merged into WorktreeOrchestrator) | See section 2 |
|
||||
| `WorktreeOrchestrator` (new) | — | `worktree-orchestrator` | Backend that manages worktree lifecycle |
|
||||
| `UOK kernel` | "kernel" implies OS-level; "UOK" is jargon | `autonomous-runtime` | PDD-gated execution loop |
|
||||
| `MessageBus` | Generic | keep as-is | It *is* a bus pattern. Keep it. |
|
||||
| `Cmux` | "cmux" is implementation detail | `surface-grid` | User-facing: "show agents in a grid" |
|
||||
|
||||
### The mental model hierarchy
|
||||
|
||||
```
|
||||
dispatch — The high-level API and TUI tool name
|
||||
├── work() — Run a single unit (milestone/slice/task)
|
||||
├── batch() — Run multiple units in parallel (worktree pool)
|
||||
├── chain() — Run units sequentially, passing output
|
||||
├── debate() — Run units as adversarial roles
|
||||
└── subscribe() — Listen to dispatch events (MessageBus)
|
||||
|
||||
worktree-orchestrator — Backend: worktree lifecycle + process registry
|
||||
autonomous-runtime — PDD gate model (UOK kernel, renamed)
|
||||
MessageBus — Durable inter-agent messaging (keeps name)
|
||||
surface-grid — Cmux decoupled from dispatch
|
||||
```
|
||||
|
||||
**Why "kernel" is the right metaphor for UOK (keep it internally):**
|
||||
A kernel manages resources and enforces policy; it doesn't do the work itself. The UOK kernel evaluates confidence/risk gates, manages parity reporting, and decides when to proceed — but it delegates execution to WorktreeOrchestrator. The name fits.
|
||||
|
||||
---
|
||||
|
||||
## 7. Implementation Priority
|
||||
|
||||
### Phase 1 — Merge the two orchestrators (Lowest risk, highest clarity)
|
||||
|
||||
**1.1 — Extract `WorktreeOrchestrator` from parallel-orchestrator + slice-parallel**
|
||||
|
||||
Create new `dispatch-layer.js` that merges the ~80% shared logic. Parameterized by `{ scope: 'milestone' | 'slice' }`. The slice orchestrator's conflict-filtering logic (`filterConflictingSlices` in `slice-parallel-conflict.ts`) already lives separately — call it from the merged class.
|
||||
|
||||
**Files touched:**
|
||||
- New: `src/resources/extensions/sf/dispatch-layer.js`
|
||||
- Refactor: `parallel-orchestrator.js` → thin wrapper calling dispatch-layer
|
||||
- Refactor: `slice-parallel-orchestrator.js` → thin wrapper calling dispatch-layer
|
||||
|
||||
**Test:** Both `/parallel` command and slice-level parallelism continue to work identically. Dashboard continues to show correct worker states.
|
||||
|
||||
**Effort:** ~1 week. Pure refactor, no behavior change.
|
||||
|
||||
### Phase 2 — Wire MessageBus into WorktreeOrchestrator
|
||||
|
||||
**2.1 — Add AgentInbox to each worker**
|
||||
|
||||
Every `sf headless` worker opens a `MessageBus` inbox named after its milestone/slice ID. The coordinator can send messages to workers (pause, resume, report status).
|
||||
|
||||
**2.2 — Replace file-based IPC with MessageBus**
|
||||
|
||||
Replace `session-status-io.js` polling and `sendSignal` file-based signals with MessageBus `send()`. File-based signals remain as crash-recovery fallback.
|
||||
|
||||
**Files touched:**
|
||||
- `dispatch-layer.js` (new)
|
||||
- `session-status-io.js` (add MessageBus-backed path)
|
||||
- Worker bootstrap in both orchestrators
|
||||
|
||||
**Test:** Workers respond to coordinator pause/resume messages delivered via MessageBus.
|
||||
|
||||
**Effort:** ~3 days.
|
||||
|
||||
### Phase 3 — Subagent RPC Mode
|
||||
|
||||
**3.1 — Add `dispatch.rpc()` — spawn a headless RPC client (not CLI)**
|
||||
|
||||
The 4-tool limitation goes away when subagent is an RPC client. The subagent keeps its 4-mode interface; the implementation changes.
|
||||
|
||||
**3.2 — Ensure subagent RPC mode cannot access tools the parent mode doesn't permit**
|
||||
|
||||
Security boundary must be preserved. This is where the access contract from section 5 gets enforced.
|
||||
|
||||
**Files touched:**
|
||||
- `extensions/subagent/index.js` (add RPC mode path)
|
||||
- `extensions/subagent/rpc-client.js` (new)
|
||||
|
||||
**Test:** Subagent with `mode: 'rpc'}` can call `complete-task` and other SF tools.
|
||||
|
||||
**Effort:** ~1 week.
|
||||
|
||||
### Phase 4 — UOK Kernel Adopts WorktreeOrchestrator
|
||||
|
||||
**4.1 — Replace direct parallel-orchestrator calls with WorktreeOrchestrator**
|
||||
|
||||
The autonomous loop's parallel dispatch path (`analyzeParallelEligibility` → `startParallel`) goes through WorktreeOrchestrator instead of calling parallel-orchestrator directly.
|
||||
|
||||
**4.2 — UOK reads worker status from WorktreeOrchestrator**
|
||||
|
||||
Dashboard refresh reads from `orchestrator.getStatus()` instead of directly from parallel-orchestrator's state.
|
||||
|
||||
**Files touched:**
|
||||
- `uok/kernel.js` (import WorktreeOrchestrator)
|
||||
- `parallel-orchestrator.js` (becomes wrapper or is removed)
|
||||
|
||||
**Test:** Autonomous mode with parallel milestones works identically to current behavior.
|
||||
|
||||
**Effort:** ~3 days.
|
||||
|
||||
### Phase 5 — Cmux Decoupling
|
||||
|
||||
**5.1 — Make Cmux grid layout driven by MessageBus events**
|
||||
|
||||
Dispatch should not call Cmux directly. Cmux subscribes to MessageBus dispatch events and creates/destroys grid surfaces accordingly.
|
||||
|
||||
**5.2 — Remove `cmuxSplitsEnabled` from subagent tool**
|
||||
|
||||
This is the concrete coupling point — dispatch knows about Cmux grid layouts. Remove it; let Cmux manage its own surface allocation based on dispatch events.
|
||||
|
||||
**Files touched:**
|
||||
- `cmux/index.js` (add MessageBus subscriber)
|
||||
- `extensions/subagent/index.js` (remove cmuxSplitsEnabled)
|
||||
|
||||
**Effort:** ~2 days.
|
||||
|
||||
### Phase 6 — Naming Cleanup
|
||||
|
||||
**6.1 — Rename `dispatch-layer.js` → `worktree-orchestrator.js`**
|
||||
**6.2 — Rename parallel-orchestrator wrapper → `milestone-dispatcher.js`**
|
||||
**6.3 — Rename slice-parallel-orchestrator wrapper → `slice-dispatcher.js`**
|
||||
**6.4 — Update all import references**
|
||||
|
||||
**Effort:** ~1 day.
|
||||
|
||||
### Phase 7 — Document the Architecture
|
||||
|
||||
**7.1 — Update `ARCHITECTURE.md`**
|
||||
Add a section on the unified dispatch architecture. Current dispatch docs are scattered across inline comments and session-status-io.
|
||||
|
||||
**Effort:** ~1 day.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
The 5 dispatch mechanisms + 1 message bus represent **3 genuinely different needs** (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and **3 duplications** (parallel-orchestrator + slice-parallel-orchestrator; subagent parallel mode + parallel-orchestrator; Cmux tight coupling).
|
||||
|
||||
**The plan:**
|
||||
|
||||
1. **Merge** `parallel-orchestrator` + `slice-parallel-orchestrator` into `WorktreeOrchestrator`
|
||||
2. **Wire** MessageBus into WorktreeOrchestrator — workers become reachable via durable messaging
|
||||
3. **UOK kernel** becomes the controller that calls WorktreeOrchestrator, not a parallel system
|
||||
4. **Subagent tool** stays separate — it's ad-hoc in-session delegation, with an optional RPC mode for internal workflows
|
||||
5. **Cmux** becomes a MessageBus subscriber, decoupled from dispatch
|
||||
6. **DB access model** is already correct: spawned subagents cannot write to project DB; workers dispatched via WorktreeOrchestrator can
|
||||
|
||||
The `adversarial_partner`/`adversarial_combatant`/`adversarial_architect` fields already in the DB are **planning ceremony fields** (Letta-inspired), not dispatch mechanism fields. They belong in the PDD planning layer, not in the dispatch layer.
|
||||
|
||||
**Total effort estimate:** 3-4 weeks across 7 phases, sequenced to preserve existing behavior at each step.
|
||||
226
docs/plans/SCOPED_DELEGATION_TOKENS_DEFENSE.md
Normal file
226
docs/plans/SCOPED_DELEGATION_TOKENS_DEFENSE.md
Normal file
|
|
@ -0,0 +1,226 @@
|
|||
# Defense of Scoped Delegation Tokens for Subagent DB Access
|
||||
|
||||
## Position
|
||||
|
||||
Subagents need bounded, auditable write access to milestone/slice/task state. The current hard wall—zero DB access for delegated agents—is an overcorrection that breaks real parallelism workflows without providing meaningful safety guarantees. Scoped delegation tokens are the correct middle ground: they give subagents just enough authority to record their findings without exposing the full SF tool surface or the planning authority that belongs to the parent agent.
|
||||
|
||||
---
|
||||
|
||||
## 1. Why the Hard Wall Breaks Real Use Cases
|
||||
|
||||
### The Verification Evidence Problem
|
||||
|
||||
Consider the canonical parallel verification pattern:
|
||||
|
||||
```
|
||||
Parent dispatches 3 reviewer subagents in parallel:
|
||||
- Requirements Coverage reviewer → checks requirement_coverage completeness
|
||||
- Cross-Slice Integration reviewer → checks API/interface consistency
|
||||
- Acceptance Criteria reviewer → checks UAT coverage
|
||||
|
||||
Each subagent runs verification commands and produces a verdict.
|
||||
```
|
||||
|
||||
Today, **those verdicts cannot be written to the DB by the subagents**. The evidence table (`verification_evidence`) is write-protected. The parent must:
|
||||
|
||||
1. Wait for all three subagents to complete
|
||||
2. Receive their output strings
|
||||
3. Re-parse the output
|
||||
4. Issue `record_verification_evidence` calls itself
|
||||
|
||||
This destroys the parallelism benefit. The subagents ran concurrently but the parent is now a synchronous bottleneck that has to re-verify what the subagent already verified. More critically: if the parent's context window evicts the subagent output before it can be recorded, the verification evidence is **permanently lost**—not because the subagent failed, but because the recording channel was blocked.
|
||||
|
||||
### The Blocker Discovery Problem
|
||||
|
||||
A scout subagent dispatched to explore a codebase dependency risk discovers that a slice is blocked by an upstream schema migration that hasn't happened yet. The parent needs this information to set `slice.status = "blocked"` and record the blocker in `slices` or `tasks.blocker_discovered`.
|
||||
|
||||
With the hard wall, the scout returns a string: `"Slice S02 blocked: depends on users_v2 migration which doesn't exist yet"`. The parent must then:
|
||||
1. Parse this string (natural language, unreliable)
|
||||
2. Issue its own DB update
|
||||
|
||||
The information existed in the subagent's context at the moment of discovery. The subagent has the correct identity (`milestone_id`, `slice_id`) in its prompt. Forcing the parent to re-interpose creates a brittle translation layer and makes the subagent's finding second-hand rather than authoritative.
|
||||
|
||||
### The Async/Hanging Problem
|
||||
|
||||
With the hard wall, the parent must remain alive for the entire duration of any subagent dispatch to receive and record findings. If the parent process is interrupted (user cancels, context overflow, crash), subagent findings in flight are lost. The subagent did the work; the recording failed because of process lifecycle, not because of any safety check.
|
||||
|
||||
Scoped tokens survive subagent process lifetime: the subagent writes to the DB directly, and the parent's only job is to synthesize the final outcome. A parent crash after subagent completion doesn't lose the subagent's recorded evidence.
|
||||
|
||||
---
|
||||
|
||||
## 2. How Scoped Tokens Work Concretely
|
||||
|
||||
### Token Anatomy
|
||||
|
||||
A scoped delegation token is not a raw SQL connection or an admin API key. It is a **bounded operation grant** with four components:
|
||||
|
||||
```
|
||||
Token {
|
||||
scope: {
|
||||
milestone_id?: string, // null = all accessible milestones
|
||||
slice_id?: string, // null = all slices in milestone
|
||||
task_id?: string, // null = all tasks in slice
|
||||
},
|
||||
operations: [
|
||||
"record_verification_evidence", // append-only evidence table
|
||||
"update_task_status", // set task status to: completed | blocked
|
||||
"append_milestone_evidence", // append-only audit trail
|
||||
"append_slice_evidence", // append-only audit trail
|
||||
],
|
||||
parent_fingerprint: string, // HMAC of parent envelope for audit
|
||||
expires_at: ISO8601, // token TTL = subagent expected lifetime
|
||||
}
|
||||
```
|
||||
|
||||
### Issuance
|
||||
|
||||
Tokens are issued by the parent at dispatch time, embedded in the subagent's environment (via the existing `SF_PARENT_*` inheritance mechanism):
|
||||
|
||||
```
|
||||
SF_DELEGATION_TOKEN="scope=milestone:S01/slice:S02/task:*;ops=record_verification_evidence,update_task_status,append_slice_evidence;fingerprint=abc123;exp=2026-05-08T12:30:00Z"
|
||||
```
|
||||
|
||||
The token is **not** a secret. It is a structured grant that any process can inspect. Enforcement is done server-side in `sf-db.js`: every write operation validates the token's scope and operations list before executing.
|
||||
|
||||
### Operation Validation (in sf-db.js)
|
||||
|
||||
```javascript
|
||||
// Before any write in a subagent context:
|
||||
function validateDelegationToken(token, operation, scope) {
|
||||
if (!token) return { ok: false, reason: "No delegation token" };
|
||||
if (new Date(token.expires_at) < Date.now()) return { ok: false, reason: "Token expired" };
|
||||
if (!token.operations.includes(operation)) return { ok: false, reason: `Operation ${operation} not in token grant` };
|
||||
// Scope check: token.scope.milestone_id must match the target milestone
|
||||
return { ok: true };
|
||||
}
|
||||
```
|
||||
|
||||
### Attack Surface vs. Current Model
|
||||
|
||||
| Surface | Current (Hard Wall) | Scoped Tokens |
|
||||
|---|---|---|
|
||||
| Subagent can write to milestones table | ❌ No | ❌ No |
|
||||
| Subagent can write to slices table | ❌ No | ❌ No |
|
||||
| Subagent can write to tasks table | ❌ No | Only via explicit `update_task_status` grant |
|
||||
| Subagent can write verification_evidence | ❌ No | ✅ Via `record_verification_evidence` |
|
||||
| Subagent can read full DB | ❌ No | ❌ No (read path unchanged) |
|
||||
| Subagent can access SF tools (bash, edit, etc.) | ✅ Via inheritance envelope | ✅ Via inheritance envelope |
|
||||
| Subagent can bypass permission profile | ❌ Blocked by inheritance | ❌ Blocked by inheritance |
|
||||
| Subagent can use blocked providers | ❌ Blocked by inheritance | ❌ Blocked by inheritance |
|
||||
|
||||
The attack surface is **strictly smaller than the parent's surface**. The subagent cannot do anything the parent couldn't do—but it can record its own findings directly. Compare to a system where subagents get full tool access: scoped tokens limit what can be written to the DB even if the subagent is compromised.
|
||||
|
||||
### What Happens Without a Token
|
||||
|
||||
If a subagent process is somehow tricked into making a DB write without a valid token (e.g., a bug in the dispatch layer, or a man-in-the-middle on the IPC channel), the write is rejected. The rejection is logged with the parent fingerprint for audit. This is a better failure mode than the current hard wall, which fails **silently and completely**—the evidence is lost with no record that it was even attempted.
|
||||
|
||||
---
|
||||
|
||||
## 3. Response to "Subagents Shouldn't Mutate Project State"
|
||||
|
||||
This objection conflates two distinct concepts: **authority** and **causality**.
|
||||
|
||||
### Authority Is Not the Issue
|
||||
|
||||
No one is proposing that a subagent should be able to reprioritize milestones, change slice goals, or override task verification contracts. Scoped tokens do not grant planning authority. The parent retains full control over:
|
||||
- Milestone sequencing and status transitions
|
||||
- Slice goal changes or deletion
|
||||
- Task dependencies, blockers, and escalation decisions
|
||||
|
||||
A subagent recording `verification_evidence` is not making a planning decision. It is recording a **factual observation**: "I ran command X on task T, it exited with code Y, my verdict is Z." The parent then synthesizes these observations into planning decisions. The subagent cannot set a milestone to complete—that requires the parent's judgment.
|
||||
|
||||
### Causality Is the Real Issue
|
||||
|
||||
When a researcher subagent discovers a blocker, the discovery happened **inside the subagent's context**. Forcing the parent to re-discover or re-interpret the finding introduces a lossy translation step:
|
||||
|
||||
1. Subagent discovers: "Slice S02 blocked by missing `users_v2` table"
|
||||
2. Parent receives: natural language string
|
||||
3. Parent interprets: "this means I should set slice S02 status to blocked"
|
||||
4. Parent writes: `UPDATE slices SET status = 'blocked' WHERE id = 'S02'`
|
||||
|
||||
Step 3 is where information is lost or distorted. The subagent had precise context (the exact SQL error, the migration file that should exist, the timeline). The parent has a string summary. Scoped tokens let step 4 happen directly from step 1, preserving precision.
|
||||
|
||||
### The "Pure Worker" Model Is Internally Inconsistent
|
||||
|
||||
The opposing view says subagents should return output to the parent for the parent to record. But consider:
|
||||
|
||||
- A `record_verification_evidence` call **is** the subagent returning its output—just to the DB instead of to a string in the parent context.
|
||||
- The "pure worker" model works for compute tasks (subagent computes, parent uses result). It breaks for **observational tasks** (subagent observes, finding is the observation itself, not a computation on it).
|
||||
- Verification evidence is observational. The subagent's verdict **is** the record. Having the parent transcribe it is redundant and lossy.
|
||||
|
||||
### Safety of the Mutation Is What Matters, Not Its Existence
|
||||
|
||||
The relevant question is not "can subagents mutate state" but "can they mutate state unsafely?" Scoped tokens answer the safety question affirmatively:
|
||||
|
||||
- **Bounded scope**: the token constrains writes to specific (milestone, slice, task) tuples
|
||||
- **Operation whitelist**: only append-only and specific status transitions are allowed
|
||||
- **Append-only by default**: evidence tables are append-only (no UPDATE or DELETE)
|
||||
- **Parent fingerprint audit**: every write is tagged with who dispatched it and under what constraints
|
||||
- **TTL expiry**: tokens auto-expire so a stray subagent process can't write indefinitely
|
||||
|
||||
---
|
||||
|
||||
## 4. Minimum Surface Area
|
||||
|
||||
The minimum viable surface for scoped tokens covers three operational patterns that currently force synchronous parent re-interposition:
|
||||
|
||||
### A. `record_verification_evidence` (Append-only)
|
||||
|
||||
**Schema**: `verification_evidence(task_id, slice_id, milestone_id, command, exit_code, verdict, duration_ms, created_at)`
|
||||
|
||||
**Grant**: `scope.milestone_id = "M01", scope.slice_id = "S01", scope.task_id = "T01", operations = ["record_verification_evidence"]`
|
||||
|
||||
**Why minimum**: This is the highest-value, lowest-risk operation. It is strictly append-only (no UPDATE/DELETE). It records what the subagent already did. The parent cannot re-run the verification without re-executing the subagent's work. Evidence is the primary artifact of verification-phase subagents.
|
||||
|
||||
**Constraints enforced**:
|
||||
- Only `INSERT` (no UPDATE or DELETE on evidence rows)
|
||||
- Exit code and verdict are bounded enums
|
||||
- Bounded by task identity in the grant
|
||||
|
||||
### B. `update_task_status` (Status-only, narrow)
|
||||
|
||||
**Schema**: `tasks` table, `status` column only
|
||||
|
||||
**Allowed transitions**:
|
||||
- `pending` → `completed` (subagent finished work)
|
||||
- `pending` → `blocked` (subagent discovered a blocker; sets `blocker_discovered = 1`)
|
||||
|
||||
**Not allowed**: No direct transition to `pending`, no status flips on other tasks, no mutation of `title`, `goal`, `verification_type`, or any other column.
|
||||
|
||||
**Why minimum**: A worker subagent that completes a task should be able to mark it complete. A scout subagent that finds a blocker should be able to record it. These are the two status changes that subagents legitimately produce. All other status transitions (re-opening, escalating, deferring) remain parent-only.
|
||||
|
||||
**Constraints enforced**:
|
||||
- Only the `status` column writable (and `blocker_discovered` as a companion flag)
|
||||
- Only the specific task in the grant scope
|
||||
- No mutation of other columns
|
||||
|
||||
### C. `append_slice_evidence` / `append_milestone_evidence` (Append-only audit trail)
|
||||
|
||||
**Schema**: `slice_evidence(milestone_id, slice_id, evidence_type, content, recorded_at)`, `milestone_evidence(milestone_id, evidence_type, content, recorded_at)`
|
||||
|
||||
**Grant**: `scope.milestone_id = "M01", operations = ["append_slice_evidence", "append_milestone_evidence"]`
|
||||
|
||||
**Why minimum**: These are the existing evidence tables (Tier 1.3 spec). They are append-only audit trails. A subagent that discovers relevant architectural context, a blocking assumption, or an unexpected constraint should be able to record it as an evidence entry without requiring the parent to parse and re-record it.
|
||||
|
||||
**Constraints enforced**:
|
||||
- Only `INSERT` (no UPDATE or DELETE)
|
||||
- `evidence_type` is a bounded enum (`finding`, `blocker`, `note`, `decision_record`)
|
||||
- Content size bounded by DB column limits
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Property | Hard Wall | Scoped Tokens |
|
||||
|---|---|---|
|
||||
| Parallel verification evidence recording | ❌ Lost or re-verified | ✅ Direct append |
|
||||
| Scout blocker discovery | ❌ String parsing required | ✅ Direct status + evidence |
|
||||
| Parent crash resilience | ❌ Evidence lost in flight | ✅ Subagent writes survive |
|
||||
| Planning authority | Parent retains all | Parent retains all |
|
||||
| DB write exposure | Zero | Bounded to 3 operations |
|
||||
| Audit trail | Incomplete | Fingerprinted per token |
|
||||
| Migration path from current model | N/A | Additive, no existing code broken |
|
||||
|
||||
The scoped token model is not a security risk amplification—it is a security risk **reclassification**. The current hard wall does not prevent evidence loss; it only makes it silent. Scoped tokens make evidence recording possible while keeping planning authority with the parent and exposing only bounded, auditable, append-mostly operations to subagents.
|
||||
|
||||
The correct objection is not "subagents shouldn't mutate state" but "subagents shouldn't have unbounded mutation authority." Scoped tokens are the mechanism that draws the bound.
|
||||
589
docs/plans/UNIFIED_DISPATCH_V2.md
Normal file
589
docs/plans/UNIFIED_DISPATCH_V2.md
Normal file
|
|
@ -0,0 +1,589 @@
|
|||
# Unified Dispatch v2 — Architecture Plan
|
||||
|
||||
**Author:** Research synthesis
|
||||
**Date:** 2026-05-08
|
||||
**Status:** Draft — for review
|
||||
**Scope:** Answer the 6 unified-dispatch questions with specific, opinionated positions backed by code references.
|
||||
|
||||
---
|
||||
|
||||
## The Unified Vision
|
||||
|
||||
SF should support a single dispatch system where ALL of these coexist and compose:
|
||||
|
||||
1. **Full-tool agents** — workers with all SF tools + full project DB access (today's parallel-orchestrator workers)
|
||||
2. **Constrained subagents** — the current subagent tool (4 tools, no project DB writes)
|
||||
3. **MessageBus-coordinated agents** — agents with AgentInbox, communicating via MessageBus (durable inbox, not file-based IPC)
|
||||
4. **Coordinators on MessageBus too** — UOK kernel publishes to workers via MessageBus, workers reply via MessageBus
|
||||
5. **All in parallel/debate/chain** — the subagent tool's 4 modes apply to ALL of the above
|
||||
6. **Shared SQLite WAL** — all agents that need project state share the same DB
|
||||
7. **Optional MessageBus inbox for subagents** — subagents can opt in to receive coordinator messages
|
||||
|
||||
The dispatch layer is ONE system parameterized by four dimensions:
|
||||
|
||||
```
|
||||
dispatch(opts)
|
||||
├── isolation: 'full' ← all SF tools + project DB WAL
|
||||
│ 'constrained' ← 4 tools + ~/.sf/sf.db only (subagent)
|
||||
├── coordination: 'standalone' ← no MessageBus, no coordinator messaging
|
||||
│ 'managed' ← AgentInbox + MessageBus-enabled
|
||||
├── scope: 'milestone' | 'slice' | 'task' | 'inline'
|
||||
└── mode: 'single' | 'parallel' | 'debate' | 'chain'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Q1. Unified Interface — The `dispatch()` API
|
||||
|
||||
### Current State
|
||||
|
||||
Three separate dispatch mechanisms with no shared interface:
|
||||
|
||||
| Component | Interface | Backing |
|
||||
|-----------|-----------|---------|
|
||||
| `parallel-orchestrator.js` | `startParallel(basePath, milestoneIds, prefs)` | worktree pool + child_process |
|
||||
| `slice-parallel-orchestrator.js` | `startSliceParallel(basePath, milestoneId, eligibleSlices, opts)` | same, different scope |
|
||||
| `subagent/index.js` | `executeSubagentInvocation({defaultCwd, agents, params, signal, ...})` | spawn `sf` CLI |
|
||||
| `uok/kernel.js` | `runAutoLoopWithUok(args)` — owns autonomous loop | owns controller + mechanism |
|
||||
|
||||
### Proposed API
|
||||
|
||||
A single `DispatchService` class (formerly `WorktreeOrchestrator`) with a typed `DispatchOptions` interface:
|
||||
|
||||
```ts
|
||||
// File: src/resources/extensions/sf/dispatch/service.js
|
||||
|
||||
export interface DispatchOptions {
|
||||
// ── Isolation (what tools + DB access) ──────────────────────────
|
||||
isolation: 'full' | 'constrained';
|
||||
|
||||
// ── Coordination (messaging model) ──────────────────────────────
|
||||
coordination: 'standalone' | 'managed';
|
||||
|
||||
// ── Scope (work unit type) ────────────────────────────────────
|
||||
scope: 'milestone' | 'slice' | 'task' | 'inline';
|
||||
|
||||
// ── Unit identity ──────────────────────────────────────────────
|
||||
milestoneId?: string;
|
||||
sliceId?: string;
|
||||
taskId?: string; // future: task-level dispatch
|
||||
basePath: string;
|
||||
|
||||
// ── Execution mode ─────────────────────────────────────────────
|
||||
mode: 'single' | 'parallel' | 'debate' | 'chain';
|
||||
|
||||
// ── Capacity ──────────────────────────────────────────────────
|
||||
maxWorkers?: number; // default: parallel.max_workers config
|
||||
budgetCeiling?: number; // default: parallel.budget_ceiling config
|
||||
workerTimeoutMs?: number;
|
||||
|
||||
// ── Execution graph (file-conflict DAG) ────────────────────────
|
||||
useExecutionGraph?: boolean; // default: true
|
||||
|
||||
// ── Subagent-specific ──────────────────────────────────────────
|
||||
// Only valid when isolation === 'constrained'
|
||||
agentScope?: 'user' | 'project' | 'both';
|
||||
parentTrace?: string; // audit context injected into task prompts
|
||||
useMessageBus?: boolean; // give subagent an AgentInbox
|
||||
}
|
||||
```
|
||||
|
||||
### Core API Surface
|
||||
|
||||
```ts
|
||||
class DispatchService {
|
||||
// ── Lifecycle ─────────────────────────────────────────────────
|
||||
constructor(opts: DispatchOptions);
|
||||
|
||||
// Prepare: run eligibility analysis + execution graph filtering
|
||||
// Returns { eligible, conflicts, skipped } without starting workers
|
||||
async prepare(): Promise<PrepareResult>;
|
||||
|
||||
// Start workers for given unit IDs
|
||||
async start(unitIds: string[]): Promise<StartResult>;
|
||||
|
||||
// Stop all or specific workers
|
||||
async stop(unitIds?: string[]): Promise<void>;
|
||||
|
||||
// Pause/resume workers (via MessageBus when coordination === 'managed')
|
||||
pause(unitIds?: string[]): void;
|
||||
resume(unitIds?: string[]): void;
|
||||
|
||||
// ── Observation ───────────────────────────────────────────────
|
||||
// Returns current state snapshot for dashboard
|
||||
getStatus(): DispatchStatus;
|
||||
|
||||
// Subscribe to dispatch events (wraps MessageBus)
|
||||
subscribe(handler: DispatchEventHandler): UnsubscribeFn;
|
||||
|
||||
// ── Budget ────────────────────────────────────────────────────
|
||||
totalCost(): number;
|
||||
isBudgetExceeded(): boolean;
|
||||
|
||||
// ── Shared infrastructure ─────────────────────────────────────
|
||||
readonly bus: MessageBus; // shared bus when coordination === 'managed'
|
||||
}
|
||||
```
|
||||
|
||||
### How the 4 Dimensions Compose
|
||||
|
||||
| isolation | coordination | scope | mode | What happens |
|
||||
|-----------|-------------|-------|------|-------------|
|
||||
| `'full'` | `'standalone'` | `'milestone'` | `'parallel'` | Current parallel-orchestrator behavior |
|
||||
| `'full'` | `'standalone'` | `'slice'` | `'parallel'` | Current slice-parallel behavior |
|
||||
| `'full'` | `'managed'` | `'milestone'` | `'parallel'` | Workers have AgentInbox; coordinator sends pause/resume via MessageBus |
|
||||
| `'constrained'` | `'standalone'` | `'inline'` | `'single'` | Current subagent single mode |
|
||||
| `'constrained'` | `'standalone'` | `'inline'` | `'parallel'` | Current subagent parallel mode |
|
||||
| `'constrained'` | `'standalone'` | `'inline'` | `'debate'` | Current subagent debate mode |
|
||||
| `'constrained'` | `'standalone'` | `'inline'` | `'chain'` | Current subagent chain mode |
|
||||
| `'constrained'` | `'managed'` | `'inline'` | `'single'` | Subagent with AgentInbox (opt-in); coordinator can message it |
|
||||
| `'full'` | `'managed'` | `'milestone'` | `'debate'` | Full-tool debate: multiple milestone workers with MessageBus |
|
||||
| `'full'` | `'managed'` | `'milestone'` | `'chain'` | Full-tool chain: milestone workers run sequentially via MessageBus handoff |
|
||||
|
||||
### Subagent Tool as DispatchService Client
|
||||
|
||||
The `subagent` tool becomes a thin **client** of `DispatchService`:
|
||||
|
||||
```
|
||||
subagent tool
|
||||
│
|
||||
├── isolation: 'constrained'
|
||||
├── coordination: params.useMessageBus ? 'managed' : 'standalone'
|
||||
├── scope: 'inline'
|
||||
├── mode: params.mode (single/parallel/debate/chain)
|
||||
│
|
||||
└── Calls DispatchService instead of managing its own spawn pool
|
||||
```
|
||||
|
||||
This eliminates the ~1000 LOC of concurrency management in `subagent/index.js` (`mapWithConcurrencyLimit`, `runSingleAgent`, `runSingleAgentInCmuxSplit`, `spawn` boilerplate) and replaces it with a single `dispatch.start()` call.
|
||||
|
||||
---
|
||||
|
||||
## Q2. MessageBus as the Backbone
|
||||
|
||||
### Current State
|
||||
|
||||
MessageBus (`uok/message-bus.js`) is wired **only to UOK kernel internal observer chains**:
|
||||
|
||||
- UOK kernel creates a `MessageBus` instance in `runAutoLoopWithUok`
|
||||
- `createTurnObserver` (`uok/loop-adapter.js`) subscribes to UOK events
|
||||
- The parallel orchestrator and slice-parallel orchestrator use **file-based IPC** exclusively:
|
||||
- `session-status-io.js`: poll `.sf/parallel/sessions/*.json` every refresh cycle
|
||||
- `sendSignal(basePath, mid, "pause"|"resume"|"stop")`: write signal files that workers check on next dispatch
|
||||
|
||||
### The Gap
|
||||
|
||||
File-based IPC is correct for crash recovery (workers persist state to disk and survive coordinator restarts), but it has two weaknesses:
|
||||
|
||||
1. **No durable coordinator → worker messaging**: When a coordinator restarts, it re-reads session files to restore state, but workers don't know the coordinator restarted unless they poll. Workers check for signals on each dispatch turn — correct but ~1-2 second latency.
|
||||
|
||||
2. **No worker → coordinator messaging**: Workers emit cost via NDJSON stdout, but there's no inbox model for workers to send structured messages back to the coordinator.
|
||||
|
||||
### Proposed: MessageBus Replaces File-Based IPC for Live Coordination
|
||||
|
||||
```
|
||||
Current (file-based):
|
||||
Coordinator ──signal file──► Worker
|
||||
Worker ────session status file───► Coordinator (polled)
|
||||
|
||||
Proposed (MessageBus):
|
||||
Coordinator ──MessageBus.send()──► Worker AgentInbox
|
||||
Worker ────MessageBus.send()───────► Coordinator Inbox
|
||||
(File-based IPC stays as crash-recovery fallback)
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
The `DispatchService` owns a single `MessageBus` instance per basePath. Each worker gets an `AgentInbox` named after its unit ID (e.g., `milestone:M01`, `slice:S01:02`).
|
||||
|
||||
**Coordinator → Worker messages** (pause, resume, stop, status request):
|
||||
```ts
|
||||
// In DispatchService
|
||||
this.bus.send('coordinator', `worker:${unitId}`, { type: 'control', action }, metadata);
|
||||
```
|
||||
|
||||
**Worker → Coordinator messages** (unit started, unit completed, error, cost update):
|
||||
```ts
|
||||
// In worker bootstrap (sf headless entry point)
|
||||
const bus = new MessageBus(basePath);
|
||||
const inbox = bus.getInbox(`worker:${unitId}`);
|
||||
inbox.receive({ from: 'worker', body: { type: 'unit_started' }, ... });
|
||||
```
|
||||
|
||||
**File-based fallback remains**: `session-status-io.js` is NOT removed. Workers still write session status files. The coordinator still reads them on startup for crash recovery. MessageBus adds *durable live coordination* on top.
|
||||
|
||||
### Should ALL Coordination Flow Through MessageBus?
|
||||
|
||||
**Yes, for live coordination between a running coordinator and its workers.**
|
||||
|
||||
The UOK kernel itself becomes a coordinator that uses MessageBus. When `runAutoLoopWithUok` initializes `DispatchService`, it passes `coordination: 'managed'`. The UOK kernel then receives worker events via the shared bus rather than polling session files.
|
||||
|
||||
**File-based IPC stays for crash recovery** — when a coordinator dies and restarts, it reads session status files to adopt surviving workers. MessageBus state does not survive coordinator restarts (inboxes are in-memory, backed by SQLite messages). This is the right split: MessageBus for live coordination, file-based for durability.
|
||||
|
||||
**What replaces file-based IPC for subagent coordination?** Subagents spawned with `isolation: 'constrained'` and `coordination: 'standalone'` use the current model (spawn `sf` CLI, parse NDJSON stdout). When `coordination: 'managed'`, subagents get an `AgentInbox` and the coordinator can send them pause/resume messages.
|
||||
|
||||
---
|
||||
|
||||
## Q3. DB Access Matrix
|
||||
|
||||
### Current State
|
||||
|
||||
| Dispatch configuration | DB access | Mechanism |
|
||||
|-----------------------|-----------|-----------|
|
||||
| Parallel-orchestrator workers | Full project `.sf/sf.db` WAL | Workers open `.sf/sf.db` in worktree via `syncSfStateToWorktree` |
|
||||
| Slice-parallel-orchestrator workers | Full project `.sf/sf.db` WAL | Same as above |
|
||||
| Subagent (spawned `sf` CLI) | Global `~/.sf/sf.db` only; NO project DB | Spawned process has own SQLite connection |
|
||||
| UOK kernel (autonomous loop) | Full project `.sf/sf.db` WAL | Runs in project context |
|
||||
| Cmux | None | Terminal surface only |
|
||||
|
||||
### The Constraint Is Intentional
|
||||
|
||||
The subagent's 4-tool limit is **correct security isolation**, not a limitation to be fixed:
|
||||
|
||||
- A spawned `sf` CLI with project DB write access running in a user-specified `cwd` is a significant attack surface
|
||||
- Subagents should return **structured output**, not mutate state directly
|
||||
- The coordinator (UOK kernel or parent agent) is responsible for interpreting subagent output and calling DB tools
|
||||
|
||||
### Proposed DB Access Matrix (Unified Model)
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ isolation: 'full', coordination: 'managed' │
|
||||
│ Workers: milestone/slice agents spawned via DispatchService │
|
||||
│ DB: project .sf/sf.db (WAL) — full read/write │
|
||||
│ AgentInbox: yes │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ isolation: 'constrained', coordination: 'standalone' │
|
||||
│ Subagent: current subagent tool (spawned sf CLI) │
|
||||
│ DB: ~/.sf/sf.db (global) read/write; project .sf/sf.db: read-via-prompt-injection │
|
||||
│ AgentInbox: no │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ isolation: 'constrained', coordination: 'managed' │
|
||||
│ Subagent with opt-in messaging │
|
||||
│ DB: same as above + MessageBus inbox for coordinator messages │
|
||||
│ AgentInbox: yes (injected via prompt context) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ isolation: 'full', coordination: 'standalone' │
|
||||
│ Workers without MessageBus (legacy standalone mode) │
|
||||
│ DB: project .sf/sf.db (WAL) — full read/write │
|
||||
│ AgentInbox: no │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Key Rule
|
||||
|
||||
**`isolation: 'full'` = project DB WAL access. `isolation: 'constrained'` = no project DB writes.**
|
||||
|
||||
The DB access is determined solely by `isolation`, not by `scope` or `mode`. A slice-scope worker with `isolation: 'full'` has the same DB access as a milestone-scope worker — correct, since they both represent the primary agent running project work.
|
||||
|
||||
### Subagent Output Contract
|
||||
|
||||
When a constrained subagent needs to record something in project state, the contract is:
|
||||
|
||||
1. Subagent returns structured output (via NDJSON `message_end` events)
|
||||
2. Coordinator parses and calls the appropriate DB tool (`complete-task`, `block-slice`, etc.)
|
||||
3. Subagent never writes to project DB directly
|
||||
|
||||
This mirrors the Letta agent pattern: agents return results, the orchestrator persists.
|
||||
|
||||
---
|
||||
|
||||
## Q4. Coordinator Pattern — Debate and Chain on MessageBus
|
||||
|
||||
### Current State
|
||||
|
||||
**Subagent debate/chain** (`subagent/index.js`):
|
||||
- Debate: `mapWithConcurrencyLimit` runs N agents per round sequentially within a single process; each agent sees the prior round's transcript
|
||||
- Chain: sequential `runSingleAgent` calls, each feeding its output into the next step's prompt
|
||||
- The coordinator is **in-process** — it's the `subagent/index.js` call stack
|
||||
|
||||
**Parallel orchestrator** (`parallel-orchestrator.js`):
|
||||
- Coordinator is **out-of-process** — it's the TUI/headless process that spawned milestone workers
|
||||
- No MessageBus — coordinator and workers communicate via session status files and NDJSON stdout
|
||||
- Workers run `sf headless --json autonomous` in worktrees
|
||||
|
||||
### How Debate and Chain Work with Coordinators on MessageBus
|
||||
|
||||
The coordinator is always the **dispatching agent** (UOK kernel or subagent tool). The key question is whether the coordinator is in-process or out-of-process.
|
||||
|
||||
#### Debate Mode with Full-Tool Workers (Milestone-Level)
|
||||
|
||||
```
|
||||
Coordinator (UOK kernel, coordination: 'managed')
|
||||
│
|
||||
├── Round 1: bus.broadcast('coordinator', [worker:M1a, worker:M1b], {type: 'debate', round: 1, topic, prompt})
|
||||
├── Each worker replies via their AgentInbox with their position
|
||||
├── Coordinator collects all replies, builds transcript
|
||||
│
|
||||
├── Round 2: bus.broadcast('coordinator', [worker:M1a, worker:M1b], {type: 'debate', round: 2, transcript})
|
||||
├── ...
|
||||
│
|
||||
└── Round N: coordinator issues final verdict
|
||||
```
|
||||
|
||||
This is **true process-level parallelism** — workers are separate `sf headless` processes in worktrees, each with full project DB access. The coordinator sequences rounds via MessageBus.
|
||||
|
||||
#### Chain Mode with Full-Tool Workers
|
||||
|
||||
```
|
||||
Coordinator (UOK kernel, coordination: 'managed')
|
||||
│
|
||||
├── Step 1: bus.send('coordinator', 'worker:M1a', {type: 'chain', step: 1, after: null})
|
||||
│ Worker M1a produces output
|
||||
├── Coordinator collects output
|
||||
│
|
||||
├── Step 2: bus.send('coordinator', 'worker:M1b', {type: 'chain', step: 2, after: output_from_M1a})
|
||||
│ Worker M1b produces output
|
||||
├── ...
|
||||
```
|
||||
|
||||
The coordinator controls sequencing — it waits for each step's output before dispatching the next. Workers can run in different worktrees or the same worktree depending on file-conflict constraints.
|
||||
|
||||
#### Debate Mode with Constrained Subagents (Current Behavior)
|
||||
|
||||
The current subagent debate mode runs in-process via `mapWithConcurrencyLimit`. This is correct for constrained subagents because:
|
||||
- They're short-lived, spawned per debate round
|
||||
- They don't need project DB access
|
||||
- In-process is faster (no process spawn overhead per round)
|
||||
|
||||
**This does NOT change** for constrained subagents. The coordinator stays in-process.
|
||||
|
||||
#### Chain Mode with Constrained Subagents (Current Behavior)
|
||||
|
||||
Current subagent chain mode is sequential `runSingleAgent` calls in the same process. **This does NOT change** for constrained subagents.
|
||||
|
||||
### When Does the Coordinator Become a MessageBus Agent?
|
||||
|
||||
**Only when `coordination: 'managed'` and `isolation: 'full'`** (full-tool workers).
|
||||
|
||||
The coordinator (UOK kernel) gets its own `AgentInbox` on the MessageBus:
|
||||
```ts
|
||||
// In DispatchService
|
||||
const coordinatorInbox = this.bus.getInbox('coordinator');
|
||||
```
|
||||
|
||||
Workers send messages to `coordinator`; coordinator sends to `worker:${unitId}`.
|
||||
|
||||
**For constrained subagents** (`isolation: 'constrained'`), the coordinator is always in-process. They don't use MessageBus unless `coordination: 'managed'` is explicitly set — in which case the subagent tool creates an `AgentInbox` for the spawned subagent process and the coordinator (subagent tool's process) can send it messages.
|
||||
|
||||
### Summary
|
||||
|
||||
| Mode | isolation | Coordinator location | MessageBus role |
|
||||
|------|-----------|---------------------|----------------|
|
||||
| `parallel` | `'full'` | Out-of-process (UOK kernel) | Workers reachable via AgentInbox |
|
||||
| `debate` | `'full'` | Out-of-process (UOK kernel) | Rounds sequenced via broadcast |
|
||||
| `chain` | `'full'` | Out-of-process (UOK kernel) | Sequential handoff via send/reply |
|
||||
| `single` | `'full'` | Out-of-process (UOK kernel) | Worker has AgentInbox |
|
||||
| `parallel` | `'constrained'` | In-process (subagent tool) | Optional AgentInbox if opt-in |
|
||||
| `debate` | `'constrained'` | In-process (subagent tool) | Not MessageBus (in-process) |
|
||||
| `chain` | `'constrained'` | In-process (subagent tool) | Not MessageBus (in-process) |
|
||||
| `single` | `'constrained'` | In-process (subagent tool) | Optional AgentInbox if opt-in |
|
||||
|
||||
---
|
||||
|
||||
## Q5. Migration — From Today's Siloed Mechanisms to Unified System
|
||||
|
||||
### The Constraint: Don't Break Existing Workflows
|
||||
|
||||
SF has active users relying on:
|
||||
- `sf parallel <milestone-id>` — the parallel orchestrator dashboard
|
||||
- `sf headless autonomous` — the UOK kernel autonomous loop
|
||||
- `sf` subagent tool with all 4 modes — used inside TUI/headless sessions
|
||||
- Slice-level parallelism inside milestones
|
||||
|
||||
**Migration must be additive and backward-compatible at each step.**
|
||||
|
||||
### Migration Path: 6 Phases
|
||||
|
||||
#### Phase 1 — Merge Parallel + Slice Orchestrators (Week 1)
|
||||
**Risk: Low | Behavior: identical**
|
||||
|
||||
Extract the ~80% shared logic from `parallel-orchestrator.js` + `slice-parallel-orchestrator.js` into a single `WorktreeOrchestrator` class parameterized by `{ scope: 'milestone' | 'slice' }`.
|
||||
|
||||
**Before:**
|
||||
```
|
||||
parallel-orchestrator.js (~800 LOC)
|
||||
slice-parallel-orchestrator.js (~450 LOC)
|
||||
```
|
||||
|
||||
**After:**
|
||||
```
|
||||
worktree-orchestrator.js (~900 LOC merged)
|
||||
├── Both orchestrators become thin wrappers calling WorktreeOrchestrator
|
||||
└── slice-parallel-conflict.ts stays as the constraint solver
|
||||
```
|
||||
|
||||
**Files touched:**
|
||||
- New: `src/resources/extensions/sf/worktree-orchestrator.js`
|
||||
- Refactor: `parallel-orchestrator.js` → thin wrapper
|
||||
- Refactor: `slice-parallel-orchestrator.js` → thin wrapper
|
||||
- All callers of `startParallel` / `startSliceParallel` continue to work
|
||||
|
||||
**Verification:** parallel dashboard and slice-level parallelism work identically. Zero behavior change.
|
||||
|
||||
#### Phase 2 — Extract DispatchService API (Week 2)
|
||||
**Risk: Low | Behavior: identical**
|
||||
|
||||
Create the `DispatchService` class with the `DispatchOptions` interface. Wrap `WorktreeOrchestrator` internally. The parallel orchestrator wrapper becomes a `DispatchService` client.
|
||||
|
||||
```ts
|
||||
// New file: src/resources/extensions/sf/dispatch/service.js
|
||||
export class DispatchService {
|
||||
constructor(opts: DispatchOptions) { ... }
|
||||
async prepare(): Promise<PrepareResult> { return this.orchestrator.prepare(...); }
|
||||
async start(unitIds: string[]): Promise<StartResult> { ... }
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
**Files touched:**
|
||||
- New: `src/resources/extensions/sf/dispatch/service.js`
|
||||
- New: `src/resources/extensions/sf/dispatch/types.js`
|
||||
- Parallel orchestrator wrapper updated to call `DispatchService`
|
||||
- Slice parallel orchestrator wrapper updated to call `DispatchService`
|
||||
|
||||
**Verification:** all existing dispatch paths (parallel, slice-parallel) work via the new API.
|
||||
|
||||
#### Phase 3 — Wire MessageBus into DispatchService (Week 3)
|
||||
**Risk: Medium | Behavior: additive**
|
||||
|
||||
Add `MessageBus` to `DispatchService` and give each worker an `AgentInbox` when `coordination: 'managed'`. File-based IPC (`session-status-io.js`) stays as fallback.
|
||||
|
||||
**New behavior (opt-in):**
|
||||
```ts
|
||||
const dispatch = new DispatchService({
|
||||
isolation: 'full',
|
||||
coordination: 'managed', // NEW: workers get AgentInbox
|
||||
scope: 'milestone',
|
||||
mode: 'parallel',
|
||||
...
|
||||
});
|
||||
```
|
||||
|
||||
**Files touched:**
|
||||
- `src/resources/extensions/sf/dispatch/service.js` — add MessageBus integration
|
||||
- `src/resources/extensions/sf/worktree-orchestrator.js` — add worker AgentInbox creation
|
||||
- `worker bootstrap` in `spawnWorker` — open MessageBus inbox after fork
|
||||
|
||||
**Verification:** workers respond to `dispatch.pause()` / `dispatch.resume()` via MessageBus. File-based fallback still works.
|
||||
|
||||
#### Phase 4 — Subagent Tool Uses DispatchService (Week 4)
|
||||
**Risk: Medium | Behavior: constrained subagent modes unchanged**
|
||||
|
||||
Replace subagent tool's internal spawn pool with `DispatchService({ isolation: 'constrained', scope: 'inline' })`. For now, use `coordination: 'standalone'` — no MessageBus for subagents yet.
|
||||
|
||||
**Files touched:**
|
||||
- `src/resources/extensions/subagent/index.js` — replace concurrency management with `DispatchService` calls
|
||||
- Estimated: ~600 LOC removed (spawn management, `mapWithConcurrencyLimit`, `runSingleAgent`, etc.)
|
||||
|
||||
**Verification:** all 4 subagent modes (single/parallel/debate/chain) work identically. The implementation changes, the user experience doesn't.
|
||||
|
||||
#### Phase 5 — UOK Kernel Adopts DispatchService (Week 5)
|
||||
**Risk: Medium | Behavior: UOK autonomous loop uses unified API**
|
||||
|
||||
Refactor `runAutoLoopWithUok` to use `DispatchService` instead of calling `startParallel` / `slice-parallel` directly.
|
||||
|
||||
```ts
|
||||
// Before (in kernel.js):
|
||||
const { started, errors } = await startParallel(basePath, milestoneIds, prefs);
|
||||
|
||||
// After:
|
||||
const dispatch = new DispatchService({
|
||||
isolation: 'full',
|
||||
coordination: 'managed',
|
||||
scope: 'milestone',
|
||||
mode: 'parallel',
|
||||
basePath,
|
||||
...
|
||||
});
|
||||
await dispatch.start(eligibleMilestoneIds);
|
||||
```
|
||||
|
||||
**Files touched:**
|
||||
- `src/resources/extensions/sf/uok/kernel.js` — use DispatchService
|
||||
- Remove `startParallel` / `startSliceParallel` exports (or keep as legacy wrappers)
|
||||
|
||||
**Verification:** `sf headless autonomous` works identically. Workers appear in dashboard.
|
||||
|
||||
#### Phase 6 — Subagent Optional MessageBus Inbox (Week 6)
|
||||
**Risk: Low | Behavior: opt-in, additive**
|
||||
|
||||
Allow subagent tool to pass `useMessageBus: true`, giving the spawned subagent an `AgentInbox` that the coordinator can message.
|
||||
|
||||
**Files touched:**
|
||||
- `src/resources/extensions/subagent/index.js` — inject `useMessageBus` into DispatchService opts
|
||||
- `src/resources/extensions/sf/dispatch/service.js` — handle `isolation: 'constrained', coordination: 'managed'`
|
||||
|
||||
**Verification:** subagent with `useMessageBus: true` can receive pause/resume from coordinator.
|
||||
|
||||
---
|
||||
|
||||
## Q6. Implementation Order — Build First, Second, Third
|
||||
|
||||
### Priority Rationale
|
||||
|
||||
**Highest value first:**
|
||||
1. **Phase 1 (merge)** — Eliminates 90% code duplication. Pure refactor, no new behavior. Clarifies the worktree pool as a single concept. Sets the foundation for all subsequent changes.
|
||||
|
||||
2. **Phase 2 (API extraction)** — Codifies the `DispatchOptions` interface before any new dispatch paths are added. Forces the 4-dimension model to be explicit and typed. New code immediately benefits from the API.
|
||||
|
||||
3. **Phase 3 (MessageBus)** — Adds durable coordination on top of the merged worktree pool. This is the key differentiator for the "unified" vision — workers become reachable via durable messaging. File-based IPC stays as crash recovery.
|
||||
|
||||
4. **Phase 4 (subagent → DispatchService)** — Removes ~600 LOC of duplicate concurrency management from subagent. Makes subagent a client of the unified API. Opens the door for subagents to opt into MessageBus coordination.
|
||||
|
||||
5. **Phase 5 (UOK → DispatchService)** — Makes the UOK kernel a `DispatchService` client. This is the most impactful migration: the autonomous loop and the parallel orchestrator now share the same dispatch machinery.
|
||||
|
||||
6. **Phase 6 (subagent MessageBus)** — Final piece of the unified vision: subagents with MessageBus inboxes. Lowest risk (opt-in, additive) but completes the composition story.
|
||||
|
||||
### What NOT to Build Yet
|
||||
|
||||
- **Task-level dispatch** (`scope: 'task'`): Not needed yet. Milestone and slice are the primary parallelism boundaries. Task dispatch would require the unit-runtime layer (`uok/unit-runtime.js`) to be more mature.
|
||||
|
||||
- **Nested dispatch** (subagent spawning subagent): The current security boundary (constrained isolation = no project DB writes) prevents dangerous nested dispatch. Don't remove this constraint.
|
||||
|
||||
- **Persistent agents** (Letta-style): MessageBus is the right primitive, but SF doesn't have persistent named agents yet. Don't build agent registry/lifecycle management until there's a concrete use case.
|
||||
|
||||
- **Cmux decoupling**: Lower priority. Cmux grid layout is a UI concern. The dispatch layer doesn't need to know about it.
|
||||
|
||||
### The Order in Summary
|
||||
|
||||
```
|
||||
Week 1: Phase 1 — Merge parallel + slice orchestrators → WorktreeOrchestrator
|
||||
Week 2: Phase 2 — Extract DispatchService API (DispatchOptions interface)
|
||||
Week 3: Phase 3 — Wire MessageBus into DispatchService (coordination: 'managed')
|
||||
Week 4: Phase 4 — Subagent tool becomes DispatchService client
|
||||
Week 5: Phase 5 — UOK kernel uses DispatchService
|
||||
Week 6: Phase 6 — Subagent optional MessageBus inbox
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary of Positions
|
||||
|
||||
| Question | Position |
|
||||
|----------|----------|
|
||||
| **Unified interface** | Single `DispatchService` class with `DispatchOptions { isolation, coordination, scope, mode }`. Four typed dimensions, not separate mechanisms. |
|
||||
| **MessageBus as backbone** | YES for live coordinator↔worker messaging. File-based IPC (`session-status-io.js`) stays as crash-recovery fallback. All live coordination flows through MessageBus when `coordination: 'managed'`. |
|
||||
| **DB access matrix** | `isolation: 'full'` = project DB WAL. `isolation: 'constrained'` = ~/.sf/sf.db only, no project writes. Scope and mode don't affect DB access. |
|
||||
| **Coordinator on MessageBus** | YES for `isolation: 'full', coordination: 'managed'`. UOK kernel becomes a DispatchService client with an AgentInbox. Workers reply via MessageBus. Debate/chain run as sequential rounds over MessageBus broadcast. Constrained subagents stay in-process for debate/chain. |
|
||||
| **Migration** | 6 additive phases. Merge first (lowest risk), API extraction second, MessageBus wiring third, subagent adoption fourth, UOK migration fifth, subagent MessageBus opt-in sixth. Zero behavior change until Phase 4. |
|
||||
| **Implementation order** | Phase 1 → Phase 2 → Phase 3 → Phase 4 → Phase 5 → Phase 6. Highest-value/lowest-risk items first. Don't build task-level dispatch, nested dispatch, persistent agents, or Cmux decoupling yet. |
|
||||
|
||||
---
|
||||
|
||||
## Key File References
|
||||
|
||||
| File | Role in Unified System |
|
||||
|------|----------------------|
|
||||
| `src/resources/extensions/sf/parallel-orchestrator.js` | Merged into `worktree-orchestrator.js` |
|
||||
| `src/resources/extensions/sf/slice-parallel-orchestrator.js` | Merged into `worktree-orchestrator.js` |
|
||||
| `src/resources/extensions/sf/worktree-orchestrator.js` | **NEW** — merged orchestration engine |
|
||||
| `src/resources/extensions/sf/dispatch/service.js` | **NEW** — `DispatchService` class |
|
||||
| `src/resources/extensions/sf/dispatch/types.js` | **NEW** — `DispatchOptions` and related types |
|
||||
| `src/resources/extensions/sf/uok/message-bus.js` | MessageBus + AgentInbox (already exists) |
|
||||
| `src/resources/extensions/sf/uok/kernel.js` | UOK kernel (becomes DispatchService client) |
|
||||
| `src/resources/extensions/sf/uok/execution-graph.js` | Constraint solver (stays separate) |
|
||||
| `src/resources/extensions/sf/uok/dispatch-envelope.js` | What-to-dispatch contract (already exists) |
|
||||
| `src/resources/extensions/sf/session-status-io.js` | File-based IPC fallback (stays) |
|
||||
| `src/resources/extensions/subagent/index.js` | Subagent tool (becomes DispatchService client) |
|
||||
| `src/resources/extensions/sf/slice-parallel-conflict.js` | Slice conflict checker (stays) |
|
||||
666
docs/plans/UNIFIED_DISPATCH_V2_PLAN.md
Normal file
666
docs/plans/UNIFIED_DISPATCH_V2_PLAN.md
Normal file
|
|
@ -0,0 +1,666 @@
|
|||
# Unified Dispatch v2 — Qwen Plan
|
||||
|
||||
**Author:** Architecture research
|
||||
**Date:** 2026-05-08
|
||||
**Status:** Structured plan for review
|
||||
**Supersedes:** `DISPATCH_ARCHITECTURE_CONSOLIDATION.md`, `dispatch-orchestration-architecture.md`, `DISPATCH_ORCHESTRATION_PLAN.md`
|
||||
|
||||
---
|
||||
|
||||
## The Unified Vision
|
||||
|
||||
SF should support a single dispatch system where ALL of these coexist and compose:
|
||||
|
||||
1. **Full-tool agents** — workers with all SF tools + full project DB access (today's parallel-orchestrator workers)
|
||||
2. **Constrained subagents** — the current subagent tool (4 tools, no project DB writes)
|
||||
3. **MessageBus-coordinated agents** — agents with AgentInbox, communicating via MessageBus (durable inbox, not file-based IPC)
|
||||
4. **Coordinators on MessageBus too** — UOK kernel publishes to workers via MessageBus, workers reply via MessageBus
|
||||
5. **All in parallel/debate/chain** — the subagent tool's 4 modes apply to ALL of the above
|
||||
6. **Shared SQLite WAL** — all agents that need project state share the same DB
|
||||
7. **Optional MessageBus inbox for subagents** — subagents can opt in to receive coordinator messages
|
||||
|
||||
The dispatch layer is **ONE system** parameterized by four dimensions:
|
||||
|
||||
| Dimension | Values | Meaning |
|
||||
|-----------|--------|---------|
|
||||
| `isolation` | `'full'` \| `'constrained'` | Full: all SF tools + project DB writes. Constrained: 4 tools, no project DB writes. |
|
||||
| `coordination` | `'standalone'` \| `'managed'` | Standalone: no MessageBus. Managed: has AgentInbox, coordinator can message. |
|
||||
| `scope` | `'milestone'` \| `'slice'` \| `'task'` \| `'inline'` | The work unit hierarchy. |
|
||||
| `mode` | `'single'` \| `'parallel'` \| `'debate'` \| `'chain'` | How many agents run and in what relationship. |
|
||||
|
||||
---
|
||||
|
||||
## Current State Map
|
||||
|
||||
| Mechanism | Isolation | Coordination | Scope | Mode | Key Files |
|
||||
|-----------|-----------|--------------|-------|------|-----------|
|
||||
| `parallel-orchestrator.js` | `full` | `standalone` (file-based IPC) | `milestone` | `parallel` | `src/resources/extensions/sf/parallel-orchestrator.js` |
|
||||
| `slice-parallel-orchestrator.js` | `full` | `standalone` (file-based IPC) | `slice` | `parallel` | `src/resources/extensions/sf/slice-parallel-orchestrator.js` |
|
||||
| Subagent tool (`extensions/subagent/index.js`) | `constrained` (4 tools) | `standalone` | `inline` | `single/parallel/debate/chain` | `src/resources/extensions/subagent/index.js` |
|
||||
| UOK kernel (`uok/kernel.js`) | N/A (runs in-process) | `standalone` (no MessageBus) | `milestone` | `single` (autonomous loop) | `src/resources/extensions/sf/uok/kernel.js` |
|
||||
| MessageBus (`uok/message-bus.js`) | N/A | N/A | N/A | N/A | `src/resources/extensions/sf/uok/message-bus.js` |
|
||||
|
||||
---
|
||||
|
||||
## Q1: Unified `dispatch()` API
|
||||
|
||||
### Design Principle
|
||||
|
||||
One function, four parameters. Every dispatch configuration is a point in the 4D parameter space.
|
||||
|
||||
```ts
|
||||
// File: src/resources/extensions/sf/dispatch-layer.js (proposed)
|
||||
|
||||
export interface DispatchOptions {
|
||||
// ── Isolation ───────────────────────────────────────────────────────────
|
||||
// 'full': all SF tools, project DB read/write (milestone workers, slice workers)
|
||||
// 'constrained': 4 tools only, no project DB writes (subagent, scout, reviewer, reporter)
|
||||
isolation: 'full' | 'constrained';
|
||||
|
||||
// ── Coordination ───────────────────────────────────────────────────────
|
||||
// 'standalone': no MessageBus, no coordinator messaging
|
||||
// 'managed': AgentInbox per worker, coordinator can send pause/resume/stop/status messages
|
||||
coordination: 'standalone' | 'managed';
|
||||
|
||||
// ── Scope ──────────────────────────────────────────────────────────────
|
||||
// 'milestone': git worktree per milestone, SF_MILESTONE_LOCK set
|
||||
// 'slice': git worktree per slice within a milestone, SF_SLICE_LOCK + SF_MILESTONE_LOCK set
|
||||
// 'task': no worktree, runs in same process or short-lived subprocess
|
||||
// 'inline': in-process agent call (subagent single mode, no worktree)
|
||||
scope: 'milestone' | 'slice' | 'task' | 'inline';
|
||||
|
||||
// ── Mode ───────────────────────────────────────────────────────────────
|
||||
// 'single': run one agent
|
||||
// 'parallel': run N agents concurrently (up to maxConcurrency)
|
||||
// 'debate': bounded adversarial rounds, each agent sees prior rounds' output
|
||||
// 'chain': sequential, each step's output feeds into next step as {previous}
|
||||
mode: 'single' | 'parallel' | 'debate' | 'chain';
|
||||
|
||||
// ── Common fields ──────────────────────────────────────────────────────
|
||||
unitId: string; // milestoneId, sliceId, or taskId
|
||||
milestoneId?: string; // required when scope='slice'
|
||||
agent: string | string[]; // agent name(s); array for parallel/debate/chain
|
||||
task?: string; // task description (single mode)
|
||||
tasks?: TaskItem[]; // task list (parallel/debate mode)
|
||||
chain?: ChainItem[]; // chain steps (chain mode)
|
||||
basePath: string;
|
||||
maxConcurrency?: number; // default: 4 for parallel/debate, 1 for chain
|
||||
budgetCeiling?: number;
|
||||
workerTimeoutMs?: number;
|
||||
shellWrapper?: string[];
|
||||
useExecutionGraph?: boolean; // use file-conflict DAG to filter parallel set
|
||||
modelOverride?: string;
|
||||
parentTrace?: string; // audit context injected for review subagents
|
||||
// ── MessageBus routing (when coordination='managed') ──────────────────
|
||||
coordinatorId?: string; // agentId of the coordinator (for routing replies)
|
||||
}
|
||||
|
||||
export interface TaskItem {
|
||||
agent: string;
|
||||
task: string;
|
||||
cwd?: string;
|
||||
model?: string;
|
||||
parentTrace?: string;
|
||||
}
|
||||
|
||||
export interface ChainItem {
|
||||
agent: string;
|
||||
task: string; // may contain {previous} placeholder
|
||||
cwd?: string;
|
||||
model?: string;
|
||||
parentTrace?: string;
|
||||
}
|
||||
|
||||
export class DispatchLayer {
|
||||
readonly bus: MessageBus; // shared bus for all dispatches
|
||||
|
||||
constructor(basePath: string, busOptions?: MessageBusOptions);
|
||||
|
||||
// ── Core dispatch ────────────────────────────────────────────────────────
|
||||
async dispatch(opts: DispatchOptions): Promise<DispatchResult>;
|
||||
|
||||
// ── Batch helpers ───────────────────────────────────────────────────────
|
||||
async dispatchMilestones(milestoneIds: string[], opts: Partial<DispatchOptions>): Promise<StartResult>;
|
||||
async dispatchSlices(milestoneId: string, sliceIds: string[], opts: Partial<DispatchOptions>): Promise<StartResult>;
|
||||
|
||||
// ── Lifecycle ───────────────────────────────────────────────────────────
|
||||
async stop(workIds?: string[]): Promise<void>;
|
||||
pause(workIds?: string[]): void; // via MessageBus when managed
|
||||
resume(workIds?: string[]): void; // via MessageBus when managed
|
||||
|
||||
// ── State ───────────────────────────────────────────────────────────────
|
||||
getStatus(): DispatchStatus;
|
||||
totalCost(): number;
|
||||
isBudgetExceeded(): boolean;
|
||||
|
||||
// ── Event subscription ─────────────────────────────────────────────────
|
||||
subscribe(handler: DispatchEventHandler): UnsubscribeFn;
|
||||
}
|
||||
```
|
||||
|
||||
### Parameter Matrix
|
||||
|
||||
| isolation | coordination | scope | mode | Current equivalent | DB access |
|
||||
|-----------|---------------|-------|------|-------------------|-----------|
|
||||
| `full` | `managed` | `milestone` | `parallel` | `parallel-orchestrator.js` | project DB read/write |
|
||||
| `full` | `managed` | `slice` | `parallel` | `slice-parallel-orchestrator.js` | project DB read/write |
|
||||
| `constrained` | `standalone` | `inline` | `single` | subagent single mode | no project DB |
|
||||
| `constrained` | `standalone` | `inline` | `parallel` | subagent parallel mode | no project DB |
|
||||
| `constrained` | `standalone` | `inline` | `debate` | subagent debate mode | no project DB |
|
||||
| `constrained` | `standalone` | `inline` | `chain` | subagent chain mode | no project DB |
|
||||
| `constrained` | `managed` | `inline` | `single` | **new: managed subagent** | no project DB |
|
||||
| `full` | `managed` | `inline` | `single` | **new: headless autonomous** | project DB read/write |
|
||||
|
||||
---
|
||||
|
||||
## Q2: MessageBus as the Backbone
|
||||
|
||||
### Answer: Yes, ALL coordinator → worker and worker → coordinator communication flows through MessageBus.
|
||||
|
||||
The file-based IPC (`session-status-io.js`, `sendSignal`) becomes a **crash-recovery fallback only**, not the primary path.
|
||||
|
||||
### What MessageBus replaces today
|
||||
|
||||
| Current mechanism | Replaced by | Where defined |
|
||||
|-------------------|-------------|---------------|
|
||||
| `session-status-io.js` — write/read session status files | `MessageBus.send()` to `worker/<id>/status` | `dispatch-layer.js` |
|
||||
| `sendSignal(basePath, mid, "pause\|resume\|stop")` | `MessageBus.send(coordinatorId, workerId, "pause")` | `dispatch-layer.js` |
|
||||
| `consumeSignal(basePath, mid)` | Worker polls `AgentInbox.list()` | Worker bootstrap |
|
||||
| `parallel-intent.js` — CoordinationStore for file intent | `MessageBus.send()` to `coordinator/file-intent` | `dispatch-layer.js` |
|
||||
|
||||
### Worker inbox naming convention
|
||||
|
||||
```
|
||||
Workers get AgentInbox named: dispatch:<scope>:<unitId>
|
||||
e.g., dispatch:milestone:M01
|
||||
e.g., dispatch:slice:M01/S01
|
||||
|
||||
Coordinator gets AgentInbox named: dispatch:coordinator:<runId>
|
||||
```
|
||||
|
||||
### MessageBus event taxonomy
|
||||
|
||||
```ts
|
||||
// Worker → Coordinator messages (via worker's AgentInbox addressed to coordinatorId)
|
||||
type WorkerStatusMessage =
|
||||
| { type: 'worker.started', milestoneId: string, sliceId?: string, pid: number, worktreePath: string }
|
||||
| { type: 'worker.heartbeat', milestoneId: string, cost: number, currentUnit?: string }
|
||||
| { type: 'worker.completed', milestoneId: string, exitCode: number, totalCost: number }
|
||||
| { type: 'worker.error', milestoneId: string, error: string }
|
||||
| { type: 'worker.paused', milestoneId: string }
|
||||
| { type: 'worker.resumed', milestoneId: string };
|
||||
|
||||
// Coordinator → Worker messages (via coordinator's AgentInbox addressed to workerId)
|
||||
type CoordinatorCommandMessage =
|
||||
| { type: 'coordinator.pause' }
|
||||
| { type: 'coordinator.resume' }
|
||||
| { type: 'coordinator.stop' }
|
||||
| { type: 'coordinator.status_request' };
|
||||
|
||||
// Broadcast messages (coordinator → all workers)
|
||||
type BroadcastMessage =
|
||||
| { type: 'coordinator.budget_exceeded', ceiling: number }
|
||||
| { type: 'coordinator.sibling_failed', triggeringWorkerId: string };
|
||||
```
|
||||
|
||||
### Worker bootstrap changes
|
||||
|
||||
Today, a worker (`sf headless --json autonomous` in a worktree) has no MessageBus integration. It writes status files that the coordinator polls. After Q2:
|
||||
|
||||
```ts
|
||||
// In the worker bootstrap (inside sf headless autonomous process)
|
||||
const workerId = `dispatch:${scope}:${unitId}`;
|
||||
const inbox = dispatchLayer.bus.getInbox(workerId);
|
||||
|
||||
// On each dispatch tick, check inbox for coordinator messages
|
||||
const messages = inbox.list(unreadOnly = true);
|
||||
for (const msg of messages) {
|
||||
if (msg.body.type === 'coordinator.stop') {
|
||||
// graceful shutdown
|
||||
break;
|
||||
}
|
||||
inbox.markRead(msg.id);
|
||||
}
|
||||
```
|
||||
|
||||
### File-based IPC as fallback
|
||||
|
||||
`session-status-io.js` stays for **crash recovery**: if a coordinator restarts and workers are still running, the coordinator reads `session-status.json` files to restore state. This is already implemented and correct — it just stops being the *primary* coordination path.
|
||||
|
||||
### Code reference for existing file-based IPC
|
||||
|
||||
- `session-status-io.js:writeSessionStatus()` — atomic JSON write to `.sf/parallel/<mid>.status.json`
|
||||
- `session-status-io.js:sendSignal()` — atomic JSON write to `.sf/parallel/<mid>.signal.json`
|
||||
- `parallel-orchestrator.js:refreshWorkerStatuses()` — polls all status files every dashboard refresh cycle
|
||||
- `parallel-orchestrator.js:processWorkerLine()` — parses NDJSON from worker stdout, updates status
|
||||
|
||||
---
|
||||
|
||||
## Q3: DB Access Matrix
|
||||
|
||||
### The single-writer invariant
|
||||
|
||||
`sf-db.js` enforces that **only this file** issues write SQL (INSERT/UPDATE/DELETE) against `.sf/sf.db`. All other modules must call typed wrappers exported here. This is checked in CI by `tests/single-writer-invariant.test.ts`.
|
||||
|
||||
### DB access per dispatch configuration
|
||||
|
||||
| Dispatch config | Project DB (.sf/sf.db) | Global DB (~/.sf/sf.db) | Notes |
|
||||
|----------------|------------------------|--------------------------|-------|
|
||||
| `isolation:full, scope:milestone` | **read/write** | read | Workers open WAL connection to project DB via `syncSfStateToWorktree` |
|
||||
| `isolation:full, scope:slice` | **read/write** | read | Same as above |
|
||||
| `isolation:full, scope:inline` | **read/write** | read | `sf headless autonomous` running in same process (UOK kernel's autonomous mode) |
|
||||
| `isolation:constrained, scope:inline` | **read via prompt injection only** | read/write | Subagent spawns `sf` CLI which opens `~/.sf/sf.db` only; project DB accessed via prompt context injection |
|
||||
| `isolation:constrained, scope:task` | **none** | read/write | Ephemeral task dispatch (future) |
|
||||
|
||||
### How constrained isolation is enforced
|
||||
|
||||
The subagent tool (`extensions/subagent/index.js`) spawns `sf` CLI as a **separate OS process**:
|
||||
|
||||
```ts
|
||||
// subagent/index.js — runSingleAgent()
|
||||
const child = spawn(launchSpec.command, launchSpec.args, {
|
||||
cwd: cwd ?? defaultCwd,
|
||||
env: launchSpec.env, // inherits parent env but NOT the RPC connection
|
||||
shell: false,
|
||||
stdio: ["ignore", "pipe", "pipe"],
|
||||
});
|
||||
```
|
||||
|
||||
The spawned `sf` CLI opens its **own** SQLite connection — by default `~/.sf/sf.db` (global). Project DB access happens only through **prompt injection** (`system-context.js` assembles project context into the system prompt).
|
||||
|
||||
The 4-tool registry (`subagent`, `await_subagent`, `cancel_subagent`, background job tools) is enforced by the extension manifest's `tools[]` array, not by DB permissions.
|
||||
|
||||
### The access contract for constrained dispatch
|
||||
|
||||
```ts
|
||||
// In dispatch-layer.js — formalize the access contract
|
||||
const DISPATCH_DB_ACCESS = {
|
||||
full: {
|
||||
read: ['project .sf/sf.db', 'global ~/.sf/sf.db'],
|
||||
write: ['project .sf/sf.db', 'global ~/.sf/sf.db'],
|
||||
},
|
||||
constrained: {
|
||||
read: ['project context via prompt injection only'],
|
||||
write: ['global ~/.sf/sf.db only'],
|
||||
}
|
||||
} as const;
|
||||
```
|
||||
|
||||
### Future: RPC-mode subagent (constrained + managed)
|
||||
|
||||
Today: `isolation:constrained, coordination:standalone`
|
||||
|
||||
New mode: `isolation:constrained, coordination:managed`
|
||||
|
||||
A subagent with `coordination:'managed'` gets an `AgentInbox` but still cannot write to the project DB. This enables long-running subagents to receive pause/stop messages from the coordinator without gaining DB write access.
|
||||
|
||||
---
|
||||
|
||||
## Q4: Coordinator Pattern with Coordinators on MessageBus
|
||||
|
||||
### Answer: Coordinators ARE MessageBus agents. Debate and chain work differently.
|
||||
|
||||
The coordinator is not a separate process — it is a **role** that a MessageBus agent plays when it initiates dispatch and monitors replies.
|
||||
|
||||
### Two coordinator patterns
|
||||
|
||||
**Pattern A — UOK Kernel as Coordinator (full-tool, managed, milestone/slice scope)**
|
||||
|
||||
The UOK kernel initializes a `DispatchLayer` with `coordination:'managed'`. Each worker has an `AgentInbox` named `dispatch:milestone:<mid>`. The kernel subscribes to all worker inboxes via the shared `MessageBus`.
|
||||
|
||||
```
|
||||
UOK Kernel (coordinatorId: "dispatch:coordinator:<runId>")
|
||||
│
|
||||
├── MessageBus.send("dispatch:coordinator:<runId>", "dispatch:milestone:M01", { type: 'worker.started', ... })
|
||||
├── MessageBus.send("dispatch:coordinator:<runId>", "dispatch:milestone:M02", { type: 'worker.started', ... })
|
||||
│
|
||||
│ Workers each poll their AgentInbox:
|
||||
│ AgentInbox("dispatch:milestone:M01").list()
|
||||
│ AgentInbox("dispatch:milestone:M02").list()
|
||||
│
|
||||
├── UOK kernel processes milestone completions
|
||||
└── Calls dispatchLayer.stop() on autonomous loop exit
|
||||
```
|
||||
|
||||
**Pattern B — Subagent Tool as Coordinator (constrained, standalone or managed, inline scope)**
|
||||
|
||||
The subagent tool itself is the coordinator for its `parallel`, `debate`, and `chain` modes. It does **not** use MessageBus for coordination in `standalone` mode — it uses `Promise.all` + in-process event streaming.
|
||||
|
||||
In `managed` mode (new), the subagent tool would also have an `AgentInbox` so the parent TUI session can send it pause/stop messages.
|
||||
|
||||
### How debate mode works (subagent tool, NOT using MessageBus for agent coordination)
|
||||
|
||||
The subagent tool's debate mode (`subagent/index.js:executeSubagentInvocation()`, line 320) runs multiple agents **sequentially within a single process** using `mapWithConcurrencyLimit(MAX_CONCURRENCY=4)`:
|
||||
|
||||
```ts
|
||||
// subagent/index.js — debate mode
|
||||
for (let round = 1; round <= rounds; round++) {
|
||||
for (let i = 0; i < batchTasks.length; i++) {
|
||||
// buildDebatePrompt() injects prior round transcripts
|
||||
const prompt = buildDebatePrompt(task, round, transcriptEntries.join("\n\n"));
|
||||
const result = await runSingleAgent(..., prompt, ...);
|
||||
debateResults[(round-1) * batchTasks.length + i] = result;
|
||||
transcriptEntries.push(formatResult(result));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is **not** MessageBus-based because:
|
||||
1. Agents in a debate share a **single conversation transcript** — they must run sequentially and pass state through the coordinator's memory
|
||||
2. True process-level parallelism would require separate conversation contexts, which breaks the shared-transcript model
|
||||
3. The coordinator IS the single-agent orchestrator that sequences rounds and injects transcripts
|
||||
|
||||
### How chain mode works (subagent tool)
|
||||
|
||||
Chain mode (`subagent/index.js:executeSubagentInvocation()`, line 220) runs steps **sequentially**, passing output as `{previous}`:
|
||||
|
||||
```ts
|
||||
let previousOutput = "";
|
||||
for (let i = 0; i < params.chain.length; i++) {
|
||||
const step = params.chain[i];
|
||||
const taskWithContext = step.task.replace(/\{previous\}/g, previousOutput);
|
||||
const result = await runSingleAgent(..., taskWithContext, ...);
|
||||
results.push(result);
|
||||
previousOutput = getFinalOutput(result.messages);
|
||||
}
|
||||
```
|
||||
|
||||
**Chain does NOT need MessageBus** — it's purely sequential, and the coordinator (subagent tool) holds state in memory.
|
||||
|
||||
### When MessageBus IS used for coordinator ↔ workers
|
||||
|
||||
| Mode | MessageBus needed? | Why |
|
||||
|------|---------------------|-----|
|
||||
| `single` (inline) | No | One agent, no coordination needed |
|
||||
| `parallel` (inline, subagent) | No | Coordinator uses `Promise.all`, in-process |
|
||||
| `parallel` (milestone/slice, WorktreeOrchestrator) | **Yes** | Workers are separate processes; coordinator needs durable signaling |
|
||||
| `debate` | No | Sequential rounds with in-memory transcript; process-level parallelism defeats shared context |
|
||||
| `chain` | No | Purely sequential with in-memory `{previous}` injection |
|
||||
| `managed` subagent | **Yes** | Parent TUI needs to send pause/stop to long-running subagent |
|
||||
|
||||
### The coordinator IS a MessageBus agent
|
||||
|
||||
```ts
|
||||
// In DispatchLayer constructor
|
||||
this.coordinatorInbox = this.bus.getOrCreateInbox(`dispatch:coordinator:${runId}`);
|
||||
|
||||
// When a worker wants to send to the coordinator:
|
||||
this.bus.send(workerId, coordinatorId, { type: 'worker.completed', ... });
|
||||
|
||||
// When coordinator wants to send to a worker:
|
||||
this.bus.send(coordinatorId, workerId, { type: 'coordinator.pause' });
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Q5: Migration from Today to Unified System
|
||||
|
||||
### Principle: Never break existing workflows. Build the new system alongside the old.
|
||||
|
||||
### Migration strategy: Strangler Fig
|
||||
|
||||
The old dispatch mechanisms are replaced one at a time, with the unified `DispatchLayer` absorbing their responsibilities. External behavior (CLI flags, command handlers, dashboard output) stays identical throughout.
|
||||
|
||||
### Phase 1 — Extract DispatchLayer (week 1-2)
|
||||
|
||||
**Goal**: Create `dispatch-layer.js` without changing behavior.
|
||||
|
||||
```ts
|
||||
// New: src/resources/extensions/sf/dispatch-layer.js
|
||||
// Internal implementation: delegates to existing parallel-orchestrator.js
|
||||
// Public API: the DispatchOptions interface above
|
||||
|
||||
// parallel-orchestrator.js becomes a thin wrapper:
|
||||
export async function startParallel(basePath, milestoneIds, prefs) {
|
||||
const layer = new DispatchLayer(basePath);
|
||||
return layer.dispatchMilestones(milestoneIds, { isolation: 'full', coordination: 'standalone', scope: 'milestone', mode: 'parallel', ...prefs });
|
||||
}
|
||||
```
|
||||
|
||||
**Files touched**:
|
||||
- New: `src/resources/extensions/sf/dispatch-layer.js` (~600 LOC, merged from both orchestrators)
|
||||
- `parallel-orchestrator.js` — refactored to delegate to DispatchLayer
|
||||
- `slice-parallel-orchestrator.js` — refactored to delegate to DispatchLayer
|
||||
|
||||
**Test**: All parallel and slice-parallel tests pass. Dashboard shows same worker states.
|
||||
|
||||
### Phase 2 — Wire MessageBus for coordinator → worker signaling (week 2-3)
|
||||
|
||||
**Goal**: Workers get `AgentInbox`, coordinator sends pause/resume/stop via MessageBus.
|
||||
|
||||
```ts
|
||||
// In dispatch-layer.js — start() method
|
||||
async start(ids: string[], opts: DispatchOptions) {
|
||||
// For each worker, create AgentInbox
|
||||
for (const id of ids) {
|
||||
const workerId = `dispatch:${opts.scope}:${id}`;
|
||||
this.bus.getOrCreateInbox(workerId);
|
||||
// ... spawn worker with MessageBus integration
|
||||
}
|
||||
}
|
||||
|
||||
// Worker bootstrap (sf headless autonomous process)
|
||||
// On each dispatch tick:
|
||||
const inbox = dispatchLayer.bus.getInbox(workerId);
|
||||
for (const msg of inbox.list(true)) {
|
||||
handleCoordinatorMessage(msg.body);
|
||||
inbox.markRead(msg.id);
|
||||
}
|
||||
```
|
||||
|
||||
**Files touched**:
|
||||
- `dispatch-layer.js` — add MessageBus send on start/pause/resume/stop
|
||||
- Worker NDJSON bootstrap in `parallel-orchestrator.js` and `slice-parallel-orchestrator.js` — add inbox polling loop
|
||||
|
||||
**Test**: Workers respond to MessageBus pause/resume messages. File-based IPC (`session-status-io.js`) still works as crash-recovery fallback.
|
||||
|
||||
### Phase 3 — Subagent gets optional MessageBus inbox (week 3)
|
||||
|
||||
**Goal**: Subagents can opt in to `coordination:'managed'` for long-running tasks.
|
||||
|
||||
```ts
|
||||
// In subagent tool params — new field
|
||||
const SubagentParams = Type.Object({
|
||||
// ... existing fields ...
|
||||
managed: Type.Optional(Type.Boolean({
|
||||
description: 'Give this subagent a MessageBus AgentInbox so the coordinator can send pause/stop messages.',
|
||||
default: false,
|
||||
})),
|
||||
});
|
||||
```
|
||||
|
||||
**Files touched**:
|
||||
- `extensions/subagent/index.js` — add `managed` parameter
|
||||
- When `managed: true`, spawn `sf headless` (not `sf` CLI) so it can receive MessageBus messages
|
||||
|
||||
**Test**: Long-running subagent receives coordinator pause message via MessageBus.
|
||||
|
||||
### Phase 4 — UOK kernel adopts DispatchLayer (week 4)
|
||||
|
||||
**Goal**: UOK kernel calls `DispatchLayer` instead of directly managing parallel workers.
|
||||
|
||||
Today: `uok/kernel.js` calls `parallel-orchestrator.js` via separate import.
|
||||
|
||||
After: `uok/kernel.js` calls `DispatchLayer` which owns the worktree pool.
|
||||
|
||||
**Files touched**:
|
||||
- `uok/kernel.js` — replace `import { startParallel } from '../parallel-orchestrator.js'` with `import { DispatchLayer }`
|
||||
- `dispatch-layer.js` — add `useExecutionGraph` integration so UOK kernel's dispatch decisions use the file-conflict DAG
|
||||
|
||||
**Test**: `sf headless autonomous` with parallel milestones works identically to current behavior.
|
||||
|
||||
### Phase 5 — Deprecate file-based IPC paths (week 5-6)
|
||||
|
||||
**Goal**: MessageBus becomes primary, file-based IPC becomes pure fallback.
|
||||
|
||||
After Phase 2, file-based IPC (`session-status-io.js`) is still written by workers for crash recovery. After Phase 5, coordinator **stops reading** status files on the primary path — only reads them on startup if MessageBus has no worker state.
|
||||
|
||||
**Files touched**:
|
||||
- `dispatch-layer.js` — change `refreshWorkerStatuses()` to read from MessageBus first, fall back to `session-status-io.js` only if no MessageBus state found
|
||||
- `session-status-io.js` — keep as crash-recovery-only, mark as `@deprecated`
|
||||
|
||||
**Test**: Crash recovery still works. Primary path uses MessageBus.
|
||||
|
||||
### Phase 6 — Subagent RPC mode (week 6-7)
|
||||
|
||||
**Goal**: Constrained subagents can gain full tool access by running as headless RPC client.
|
||||
|
||||
Today, constrained subagent spawns `sf` CLI (full binary). New mode: spawns `sf headless` as RPC client.
|
||||
|
||||
```ts
|
||||
// In dispatch-layer.js
|
||||
async dispatch(opts: DispatchOptions): Promise<DispatchResult> {
|
||||
if (opts.isolation === 'constrained' && opts.rpcMode) {
|
||||
return this.dispatchAsRpcClient(opts); // calls sf headless RPC
|
||||
}
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Files touched**:
|
||||
- `extensions/subagent/index.js` — add `rpcMode: true` path
|
||||
- New: `extensions/subagent/rpc-client.js`
|
||||
|
||||
**Test**: Subagent with `rpcMode: true` can call `complete-task`.
|
||||
|
||||
### Phase 7 — Naming cleanup (week 7)
|
||||
|
||||
**Goal**: Reflect the unified model in file names.
|
||||
|
||||
- `dispatch-layer.js` → `worktree-orchestrator.js` (or keep `dispatch-layer.js` if preferred — the name is already clear)
|
||||
- Update all import paths
|
||||
|
||||
---
|
||||
|
||||
## Q6: Implementation Order
|
||||
|
||||
### The correct build sequence (not the same as migration order)
|
||||
|
||||
The key constraint: **subagent tool must stay stable** throughout because it is the primary user-facing tool. The UOK kernel and parallel orchestrator are internal — they can change more aggressively.
|
||||
|
||||
### Order:
|
||||
|
||||
```
|
||||
1. Extract DispatchLayer (week 1-2)
|
||||
→ No behavior change. Parallel/slice orchestrators delegate to it.
|
||||
→ Test: all existing parallel/slice tests pass.
|
||||
|
||||
2. Subagent RPC mode (week 3)
|
||||
→ Most impactful user-facing improvement.
|
||||
→ Subagent with RPC mode can call complete-task and other SF tools.
|
||||
→ Isolated from the rest of the refactor — subagent is its own path.
|
||||
|
||||
3. MessageBus wiring in DispatchLayer (week 4)
|
||||
→ Coordinator → workers via MessageBus, not file IPC.
|
||||
→ Worker bootstrap gets inbox polling.
|
||||
→ File-based IPC becomes fallback only.
|
||||
|
||||
4. UOK kernel adopts DispatchLayer (week 5)
|
||||
→ UOK kernel is internal. Breaks less if changed late.
|
||||
→ After this, the UOK kernel is the coordinator for autonomous mode.
|
||||
|
||||
5. Subagent managed mode (week 6)
|
||||
→ Optional MessageBus inbox for subagents.
|
||||
→ Parent TUI can pause/stop long-running subagents.
|
||||
|
||||
6. Deprecate file-based IPC (week 7)
|
||||
→ MessageBus becomes the only primary path.
|
||||
→ session-status-io.js kept for crash recovery only.
|
||||
|
||||
7. Naming cleanup + Cmux decoupling (week 8)
|
||||
→ Remove cmuxSplitsEnabled coupling from subagent tool.
|
||||
→ Cmux subscribes to MessageBus dispatch events.
|
||||
```
|
||||
|
||||
### Why this order
|
||||
|
||||
1. **Subagent RPC mode first** — highest user impact with lowest risk. Subagent is a separate dispatch path; changes don't affect parallel orchestrator or UOK kernel.
|
||||
2. **MessageBus wiring before UOK kernel adoption** — UOK kernel is the most complex consumer. We want MessageBus as the backbone *before* we hook UOK kernel to it.
|
||||
3. **UOK kernel adoption late** — it's internal infrastructure. Changing it last means we've already validated the DispatchLayer API in production-like conditions (parallel orchestrator + subagent).
|
||||
4. **Cmux decoupling last** — it's UI, not dispatch. It can follow the architecture once the dispatch architecture is stable.
|
||||
|
||||
---
|
||||
|
||||
## Key Code References
|
||||
|
||||
### Parallel Orchestrator
|
||||
- `src/resources/extensions/sf/parallel-orchestrator.js` — 820 lines, manages milestone workers
|
||||
- `startParallel()` — spawns `sf headless --json autonomous` in worktrees
|
||||
- `spawnWorker()` — sets `SF_MILESTONE_LOCK`, `SF_PROJECT_ROOT`, `SF_PARALLEL_WORKER` env vars
|
||||
- `processWorkerLine()` — parses NDJSON, extracts cost from `message_end` events
|
||||
- `refreshWorkerStatuses()` — polls `session-status-io.js` for worker state
|
||||
|
||||
### Slice Parallel Orchestrator
|
||||
- `src/resources/extensions/sf/slice-parallel-orchestrator.js` — 90% identical to parallel-orchestrator.js
|
||||
- Key diff: sets `SF_SLICE_LOCK` + `SF_MILESTONE_LOCK` env vars
|
||||
- Calls `filterConflictingSlices()` from `slice-parallel-conflict.ts`
|
||||
|
||||
### Session Status (File-based IPC)
|
||||
- `src/resources/extensions/sf/session-status-io.js` — 150 lines
|
||||
- `writeSessionStatus()` — atomic write to `.sf/parallel/<mid>.status.json`
|
||||
- `sendSignal()` / `consumeSignal()` — pause/resume/stop via `.sf/parallel/<mid>.signal.json`
|
||||
|
||||
### Subagent Tool
|
||||
- `src/resources/extensions/subagent/index.js` — 2700 lines
|
||||
- `runSingleAgent()` — spawns `sf` CLI, parses NDJSON events
|
||||
- `executeSubagentInvocation()` — handles single/parallel/debate/chain modes
|
||||
- `mapWithConcurrencyLimit()` — in-process concurrency for parallel/debate modes
|
||||
- No DB tools registered (only 4 tools in extension manifest)
|
||||
|
||||
### Subagent Inheritance (DB Access Contract)
|
||||
- `src/resources/extensions/sf/subagent-inheritance.js` — 220 lines
|
||||
- `buildSubagentInheritanceEnvelope()` — captures parent mode for subagent dispatch
|
||||
- `validateSubagentDispatch()` — rejects subagents that bypass provider allowlists
|
||||
|
||||
### MessageBus
|
||||
- `src/resources/extensions/sf/uok/message-bus.js` — 280 lines
|
||||
- `MessageBus.send()` — SQLite-backed durable send to AgentInbox
|
||||
- `AgentInbox` — per-agent durable inbox with TTL and retention
|
||||
|
||||
### UOK Kernel
|
||||
- `src/resources/extensions/sf/uok/kernel.js` — 220 lines
|
||||
- `runAutoLoopWithUok()` — the autonomous loop entry point
|
||||
- Currently calls `parallel-orchestrator.js` separately, not through a unified dispatch layer
|
||||
|
||||
### Execution Graph (Constraint Solver)
|
||||
- `src/resources/extensions/sf/uok/execution-graph.js`
|
||||
- `selectConflictFreeBatch()` — picks conflict-free parallel subset from file overlap DAG
|
||||
- Already used by parallel-orchestrator, should be used by slice-parallel and UOK kernel
|
||||
|
||||
### DB Schema
|
||||
- `src/resources/extensions/sf/sf-db.js` — single-writer invariant
|
||||
- `milestones` table — `id, title, status, created_at, ...`
|
||||
- `slices` table — `milestone_id, id, title, status, ...`
|
||||
- `tasks` table — `milestone_id, slice_id, id, status, ...`
|
||||
- `milestone_specs`, `slice_specs`, `task_specs` — immutable spec records
|
||||
- `milestone_evidence`, `slice_evidence`, `task_evidence` — append-only audit trail
|
||||
|
||||
### Parallel Intent (File Claim Registry)
|
||||
- `src/resources/extensions/sf/parallel-intent.js` — 170 lines
|
||||
- `declareIntent()` — worker announces file intent before editing
|
||||
- Uses `UokCoordinationStore` (Redis-like on SQLite) for TTL-based claims
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**Q1 — Unified API**: One `dispatch()` function with four parameters: `isolation × coordination × scope × mode`. Current 5 mechanisms collapse to one `DispatchLayer` class.
|
||||
|
||||
**Q2 — MessageBus backbone**: YES. All coordinator ↔ worker communication flows through MessageBus. File-based IPC (`session-status-io.js`) becomes crash-recovery fallback only.
|
||||
|
||||
**Q3 — DB access matrix**: `isolation:full` → project DB read/write. `isolation:constrained` → no project DB writes, reads via prompt injection only. Global DB always accessible. Enforced by process boundary (spawned CLI) and extension manifest tools array.
|
||||
|
||||
**Q4 — Coordinator pattern**: Coordinators ARE MessageBus agents. The UOK kernel gets a `DispatchLayer` coordinator inbox. Debate/chain modes do NOT use MessageBus — they are sequential in-memory coordination with in-process `Promise.all`. Subagent parallel mode is also in-process. MessageBus is for **cross-process** coordination only.
|
||||
|
||||
**Q5 — Migration**: Strangler Fig pattern. Extract `DispatchLayer` first (no behavior change). Then wire MessageBus. Then UOK kernel adopts it. Subagent RPC mode is independent and can ship first.
|
||||
|
||||
**Q6 — Implementation order**:
|
||||
1. Subagent RPC mode (highest impact, lowest risk)
|
||||
2. Extract `DispatchLayer` (foundational, no behavior change)
|
||||
3. Wire MessageBus into `DispatchLayer`
|
||||
4. UOK kernel adopts `DispatchLayer`
|
||||
5. Subagent managed mode
|
||||
6. Deprecate file-based IPC
|
||||
7. Cmux decoupling + naming cleanup
|
||||
|
||||
**Total: ~8 weeks**, sequenced to never break existing workflows.
|
||||
379
docs/plans/dispatch-orchestration-architecture.md
Normal file
379
docs/plans/dispatch-orchestration-architecture.md
Normal file
|
|
@ -0,0 +1,379 @@
|
|||
# Dispatch/Orchestration Architecture — Consolidation Plan
|
||||
|
||||
**Author:** Research synthesis
|
||||
**Date:** 2026-05-08
|
||||
**Status:** Draft — for review and promotion
|
||||
|
||||
---
|
||||
|
||||
## 1. Root Cause Diagnosis — Why Did This Proliferation Happen?
|
||||
|
||||
The five dispatch mechanisms + 1 message bus grew to fill genuine gaps, not from poor design. But the structural symptom is consistent with every system that accumulates dispatch primitives without a unifying abstraction: **there is no single concept that unifies them.**
|
||||
|
||||
Each addition was driven by a real gap at a different time:
|
||||
|
||||
| Mechanism | Gap filled | Structural symptom |
|
||||
|---|---|---|
|
||||
| **subagent tool** (`extensions/subagent/index.js`) | Ad-hoc delegation from within a TUI/headless session | First-class spawning of a full CLI process via `spawn()`; only 4 tools registered; no DB tools |
|
||||
| **parallel-orchestrator** (`parallel-orchestrator.js`) | True parallel milestone execution with git worktree isolation | Mirrors subagent's `spawn` pattern but at milestone scope with session status files, cost accumulation, and file-intent tracking |
|
||||
| **slice-parallel-orchestrator** (`slice-parallel-orchestrator.js`) | Slice-level parallelism within a milestone | Copy-paste of parallel-orchestrator with scope changed; ~90% identical code |
|
||||
| **UOK kernel** (`uok/kernel.js`) | Deterministic autonomous loop with gates, observability, parity reporting | Grew into the central orchestration engine but does not subsume the dispatch primitives below it |
|
||||
| **MessageBus** (`uok/message-bus.js`) | Durable SQLite-backed inter-agent messaging for multi-agent coordination | Modeled on Letta's SQLite-backed messaging; lives in UOK but is not wired into subagent or parallel-orchestrator dispatch paths |
|
||||
| **Cmux** (`cmux/index.js`) | RPC multiplexing and terminal surface integration | Orthogonal to dispatch — a UI/surface concern, not an orchestration concern |
|
||||
|
||||
### The Concretion
|
||||
|
||||
**Three missing abstractions drove the proliferation:**
|
||||
|
||||
1. **No unified "dispatch context"** — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment." The result is three different spawn patterns, three different ways of tracking cost, and no shared vocabulary.
|
||||
|
||||
2. **No shared dispatch registry** — there is no single place that tracks "what is currently running" across all parallelism dimensions. The parallel orchestrator tracks milestone workers via session status files; the slice-parallel orchestrator tracks slice workers separately; subagent tracks spawned processes in a `Set`. These are not unified.
|
||||
|
||||
3. **No first-class "work unit" concept** — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit. This is why the slice-parallel orchestrator had to be a near-total copy of the milestone orchestrator rather than a parameterization.
|
||||
|
||||
**The UOK kernel was designed as a single-agent loop.** It runs inside the headless process and manages one autonomous run. It does not know about sibling workers, does not coordinate with the parallel orchestrator, and does not have a model for "I am one of N workers running concurrently."
|
||||
|
||||
**Subagent tool was never designed to integrate with SF's state.** It spawns `sf` CLI which is a full binary with its own extension registration. It cannot call SF tools like `complete-task` or `plan-slice` because those are registered in the headless RPC path, not in the subagent's spawned CLI context. The 4 registered tools are intentionally narrow to avoid dangerous nested dispatch.
|
||||
|
||||
**MessageBus was designed for persistent agents, but SF doesn't have persistent agents yet.** The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today the MessageBus is used for UOK internal observer chains but not for real multi-agent coordination.
|
||||
|
||||
### The `adversarial_partner/combatant/architect` Fields
|
||||
|
||||
These DB fields (in `slices` table, `sf-db.js`) are **planning ceremony fields**, not dispatch mechanism fields. They belong in the PDD planning layer and are rendered in `markdown-renderer.js` and `workflow-projections.js` as "Partner Review", "Combatant Review", and "Architect Review" sections in slice output. They have nothing to do with the dispatch layer — they are populated by planning tools, not by dispatch.
|
||||
|
||||
---
|
||||
|
||||
## 2. What Should Stay vs Merge
|
||||
|
||||
### Stay (genuinely different concerns)
|
||||
|
||||
| Mechanism | Reason to Keep |
|
||||
|-----------|---------------|
|
||||
| **subagent tool** (`extensions/subagent/index.js`) | Ad-hoc in-session delegation. The 4-tool surface (`subagent`, `await_subagent`, `cancel_subagent`, `/subagent` command) is the right interface for human-in-the-loop or autonomous session agents that need to spin up a helper without leaving their context. The restriction to only those 4 tools is intentional and correct. The subagent spawns `sf --mode json` (not `sf headless`), which is correct for its shorter-lived, interactive nature. |
|
||||
| **UOK kernel** (`uok/kernel.js`, `uok/index.js`) | The deterministic autonomous loop with gate evaluation, parity reporting, audit envelopes, and run-control policy. This is the **controller** in the architecture sense. It decides what to run next; it does not implement how to run it. The `runAutoLoopWithUok` function is correctly scoped. |
|
||||
| **MessageBus** (`uok/message-bus.js`) | Durable SQLite-backed inter-agent messaging. The `send`, `broadcast`, `sendOnce`, `getConversation`, and `AgentInbox` primitives are genuinely useful for multi-agent coordination. The Letta-style design is sound. The problem is it is not wired into the dispatch path — agents spawned by subagent or parallel-orchestrator cannot use it. |
|
||||
| **Cmux** (`cmux/index.js`) | RPC multiplexing and terminal surface integration. Orthogonal to dispatch — a UI/surface concern, not an orchestration concern. Correctly scoped as a UI/shell concern. |
|
||||
| **Execution graph** (`uok/execution-graph.js`) | The file-conflict DAG that computes which milestones/slices can run in parallel. This is the **constraint solver** — it knows about file overlaps but not about process lifecycle. |
|
||||
| **CoordinationStore** (`uok/coordination-store.js`) | Redis-like primitives (TTL KV, streams, lease-based queues) on SQLite. Right building block for durable background coordination without a server process. |
|
||||
|
||||
### Merge (duplication with no semantic difference)
|
||||
|
||||
| Duplicated | Problem | Resolution |
|
||||
|------------|---------|-----------|
|
||||
| `parallel-orchestrator.js` + `slice-parallel-orchestrator.js` | ~90% identical code. The only meaningful differences: scope (milestone vs slice), lock env vars (`SF_MILESTONE_LOCK` vs `SF_SLICE_LOCK` + `SF_MILESTONE_LOCK`), and status file naming (`milestoneId` vs `milestoneId/sliceId`). The conflict detection, worktree management, worker lifecycle, NDJSON parsing, and cost tracking are copy-pasted. | **Merge into a single `WorktreeOrchestrator` class** parameterized by `{ scope: 'milestone' | 'slice', milestoneId, sliceId? }`. The conflict-filtering logic already lives in `slice-parallel-conflict.ts` and `selectConflictFreeBatch` in `execution-graph.js` — these stay separate as the constraint layer. |
|
||||
|
||||
### Refactor (same need, wrong implementation)
|
||||
|
||||
| Current | Issue | Refactor |
|
||||
|---------|-------|----------|
|
||||
| **subagent spawning `sf` CLI** | The subagent tool spawns `sf` CLI as a full binary. The 4-tool limitation is enforced by not registering other tools, not by a principled access model. | Keep spawning `sf` CLI for security isolation, but formalize the access contract explicitly. See section 5. |
|
||||
| **parallel-orchestrator + slice-parallel using file-based IPC** | Workers coordinate via `session-status-io.js` (filesystem polling) and `sendSignal`. This is a hand-rolled IPC layer. The filesystem polling is correct but fragile. | Replace with MessageBus-based coordination. Workers publish status to MessageBus; coordinator subscribes. See section 3. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Streamlined Architecture — The Unified Dispatch Layer
|
||||
|
||||
### Three-tier conceptual model
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ UOK Kernel (controller) │
|
||||
│ Decides WHAT to run next; enforces gates, policy, parity │
|
||||
│ - Phase machine: Discuss → Plan → Execute → Merge → Complete │
|
||||
│ - Calls DispatchLayer.dispatch() to execute │
|
||||
└─────────────────────────────┬─────────────────────────────────┘
|
||||
│ DispatchEnvelope { scope, unitId, ... }
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ DispatchLayer (mechanism) │
|
||||
│ Decides HOW to run: worktree? process? in-process? │
|
||||
│ - Worktree pool (git worktree per milestone/slice) │
|
||||
│ - Process registry (child_process per worker) │
|
||||
│ - Budget accumulator (cost tracking via NDJSON parsing) │
|
||||
│ - File-intent tracker (parallel-intent.js) │
|
||||
│ - AgentInbox per worker (MessageBus integration) │
|
||||
└─────────────────────────────┬─────────────────────────────────┘
|
||||
│ spawns
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Worker (execution unit) │
|
||||
│ `sf headless --json autonomous` in a worktree │
|
||||
│ - Owns SQLite WAL connection to project DB │
|
||||
│ - Has AgentInbox for MessageBus delivery │
|
||||
│ - Emits NDJSON events consumed by DispatchLayer │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Subagent tool relationship to DispatchLayer
|
||||
|
||||
The subagent tool and DispatchLayer serve **different dispatch scopes**:
|
||||
|
||||
- **subagent tool**: in-session, ad-hoc, short-lived. The subagent is a separate `sf` CLI process spawned from within a running session and its output is returned to the caller synchronously. It is **not** managed by the DispatchLayer's worktree pool or budget tracking. It spawns `sf --mode json` (not `sf headless`), which is correct for its interactive nature.
|
||||
|
||||
- **DispatchLayer**: autonomous, long-running, milestone/slice scoped. Workers are spawned and tracked by DispatchLayer; they emit cost events back to the layer; they share the project DB via WAL.
|
||||
|
||||
These two paths should remain separate but use the **same worker bootstrap** (`sf headless --json autonomous`).
|
||||
|
||||
### DispatchLayer interface (proposed)
|
||||
|
||||
```ts
|
||||
// lives in: src/resources/extensions/sf/dispatch-layer.js
|
||||
|
||||
interface DispatchOptions {
|
||||
scope: 'milestone' | 'slice';
|
||||
milestoneId: string;
|
||||
sliceId?: string;
|
||||
basePath: string;
|
||||
maxWorkers?: number;
|
||||
budgetCeiling?: number;
|
||||
workerTimeoutMs?: number;
|
||||
shellWrapper?: string[];
|
||||
useExecutionGraph?: boolean;
|
||||
}
|
||||
|
||||
class DispatchLayer {
|
||||
// Returns eligible units filtered by execution-graph conflicts
|
||||
async prepare(opts: DispatchOptions): Promise<PrepareResult>;
|
||||
|
||||
// Start workers for given unit IDs
|
||||
async start(ids: string[], opts: DispatchOptions): Promise<StartResult>;
|
||||
|
||||
// Stop all or specific workers
|
||||
async stop(ids?: string[]): Promise<void>;
|
||||
|
||||
// Pause/resume
|
||||
pause(ids?: string[]): void;
|
||||
resume(ids?: string[]): void;
|
||||
|
||||
// Read current state (for dashboard)
|
||||
getStatus(): DispatchStatus;
|
||||
|
||||
// Shared MessageBus instance
|
||||
readonly bus: MessageBus;
|
||||
|
||||
// Budget
|
||||
totalCost(): number;
|
||||
isBudgetExceeded(): boolean;
|
||||
}
|
||||
```
|
||||
|
||||
### How UOK kernel uses DispatchLayer
|
||||
|
||||
Today, `uok/kernel.js` runs the autonomous loop and calls into tools like `execute_task` which eventually spawn agents. The parallel orchestrator is started separately by the TUI dashboard or headless command. After unification:
|
||||
|
||||
1. UOK kernel initializes `DispatchLayer` at autonomous loop start
|
||||
2. UOK calls `dispatchLayer.start(eligibleMilestoneIds)` for parallel milestones
|
||||
3. Workers emit NDJSON events → DispatchLayer parses cost → updates budget
|
||||
4. Workers emit completion → UOK kernel processes post-unit staging
|
||||
5. Workers can receive messages via their `AgentInbox` (MessageBus integration)
|
||||
6. `DispatchLayer.stop()` called on autonomous loop exit
|
||||
|
||||
---
|
||||
|
||||
## 4. Multi-Dimensional Parallelism
|
||||
|
||||
### Three axes of parallelism
|
||||
|
||||
| Axis | Mechanism | Status |
|
||||
|---|---|---|
|
||||
| **Inter-project** | Multiple `sf` invocations (manual or CI) | ✅ not SF's concern |
|
||||
| **Inter-milestone** | DispatchLayer + worktrees | ✅ currently via parallel-orchestrator |
|
||||
| **Inter-slice** | DispatchLayer + worktrees | ✅ currently via slice-parallel-orchestrator |
|
||||
| **Inter-task** (in-process) | subagent `parallel` mode | ✅ implemented (mapWithConcurrencyLimit) |
|
||||
| **Inter-agent** (debate/chain) | subagent `debate`/`chain` mode | ✅ implemented |
|
||||
| **Terminal-level** | Cmux grid layout for parallel agents | ✅ implemented |
|
||||
|
||||
### What "true concurrency" means
|
||||
|
||||
The current architecture already achieves true process-level concurrency via worktrees and separate `sf headless` processes. The shared SQLite WAL means all workers can read the same DB concurrently — WAL allows concurrent readers with a single writer.
|
||||
|
||||
**What is missing is not more parallelism axes but coordinated dispatch:**
|
||||
|
||||
- The execution graph (`uok/execution-graph.js`) already computes file-conflict relationships between milestones and slices
|
||||
- `selectConflictFreeBatch` picks a conflict-free subset for parallel dispatch
|
||||
- But this is only wired into parallel-orchestrator, not into the slice-parallel path or the UOK autonomous loop's dispatch decisions
|
||||
|
||||
### Proposed coordination model
|
||||
|
||||
The execution graph is the **source of truth for parallelism constraints**. The DispatchLayer is the **enforcer**. The UOK kernel is the **policy layer**:
|
||||
|
||||
```
|
||||
Execution Graph (file-conflict DAG)
|
||||
│
|
||||
├── selectConflictFreeBatch() ──► DispatchLayer.start()
|
||||
│ Workers run in parallel
|
||||
│ Each worker has AgentInbox
|
||||
│
|
||||
UOK kernel
|
||||
│
|
||||
├── reads unit readiness from DB
|
||||
├── calls DispatchLayer.start(milestoneIds)
|
||||
└── calls DispatchLayer.start(sliceIds) for intra-milestone parallelism
|
||||
```
|
||||
|
||||
**Debate mode** (subagent tool): runs multiple agents sequentially within a single process using `mapWithConcurrencyLimit`. This is **not** true process-level parallelism but is correct for LLM-based debate where shared context and a single conversation transcript are needed. The Cmux grid layout provides terminal-level parallelism for these agents via split panes.
|
||||
|
||||
**Chain mode**: purely sequential — each step's output feeds into the next step's prompt. No parallelism needed here.
|
||||
|
||||
---
|
||||
|
||||
## 5. DB Access from Subagents
|
||||
|
||||
### The current model
|
||||
|
||||
Subagents spawn `sf` CLI as a **separate process** with its own environment. The inheritance envelope (`subagent-inheritance.js`) propagates preferences, but the subagent's `sf` process opens its own SQLite connection to `~/.sf/sf.db` (global state) or `.sf/sf.db` (project state). This is **correct isolation** — a subagent should not write to the project DB directly.
|
||||
|
||||
### The constraint is intentional
|
||||
|
||||
The subagent tool **cannot** call `complete-task` or `plan-slice` — not because those tools don't exist in the subagent's tool registry, but because:
|
||||
1. Only 4 tools are registered in the subagent extension manifest (`subagent`, `await_subagent`, `cancel_subagent`, and the `/subagent` command)
|
||||
2. The subagent is meant to be a **task executor**, not a **state mutator**
|
||||
|
||||
If a subagent could call `complete-task`, it could mark tasks done without the coordinator's knowledge, corrupting the UOK state machine.
|
||||
|
||||
### The right model: two-tier DB access
|
||||
|
||||
```
|
||||
Coordinator (UOK kernel) ──► project .sf/sf.db (WAL mode)
|
||||
milestone/slice state
|
||||
task execution ledger
|
||||
|
||||
Subagent (sf process) ──► ~/.sf/sf.db (global)
|
||||
memories, preferences
|
||||
agent-level state
|
||||
✗ project .sf/sf.db
|
||||
```
|
||||
|
||||
The subagent can read from the project DB for context (via system prompt injection), but writes only to global state. The `inheritanceEnvelope` already controls what context the subagent receives.
|
||||
|
||||
**Exception**: The `sf` CLI that runs as a DispatchLayer worker (`sf headless --json autonomous`) is a different mode — it IS the coordinator for its worktree's scope and SHOULD write to the project DB. This is already how it works (workers open `.sf/sf.db` in the worktree, which syncs from the project root via `syncSfStateToWorktree`).
|
||||
|
||||
### What subagents CAN do with the DB
|
||||
|
||||
- Read project state via **prompt injection** (system context assembly already does this)
|
||||
- Write to global `~/.sf/sf.db` for their own memories and preferences
|
||||
- **NOT** write to the project `.sf/sf.db`
|
||||
|
||||
If a subagent needs to record a finding that the coordinator should see, the right pattern is:
|
||||
1. Subagent writes to its output (stdout/file)
|
||||
2. Coordinator reads and processes the output
|
||||
3. Coordinator calls DB tools
|
||||
|
||||
This is the same pattern as Letta agents — agents return results, the orchestrator decides what to persist.
|
||||
|
||||
### Architectural backing for the constraint
|
||||
|
||||
The "no DB tools for subagents" constraint should be backed by a **principled access model**, not just "we didn't register those tools." Proposed:
|
||||
|
||||
```ts
|
||||
// In subagent tool — formalize the access contract
|
||||
const SUBAGENT_DB_ACCESS = {
|
||||
read: ['project_context'], // via prompt injection only
|
||||
write: ['~/.sf/sf.db'], // global state only
|
||||
prohibited: ['project .sf/sf.db write operations']
|
||||
};
|
||||
```
|
||||
|
||||
The extension manifest's `tools[]` array currently enforces this by omission. A more explicit model would declare the access contract formally, making it auditable.
|
||||
|
||||
---
|
||||
|
||||
## 6. Naming — What Should the Mental Model Be?
|
||||
|
||||
The names are confusing because they mix three different layers of abstraction. Proposed renaming:
|
||||
|
||||
| Current name | Proposed name | Reason |
|
||||
|---|---|---|
|
||||
| `parallel-orchestrator.js` | `milestone-dispatcher.js` | Describes scope + role |
|
||||
| `slice-parallel-orchestrator.js` | `slice-dispatcher.js` | Scope + role; merges into unified DispatchLayer |
|
||||
| `DispatchLayer` (new) | `dispatch-layer.js` | The unified class |
|
||||
| `uok/kernel.js` | keep as-is | Kernel is the right metaphor for the controller |
|
||||
| `MessageBus` | keep as-is | Standard pattern name |
|
||||
| `Cmux` | keep as-is | Product name for terminal multiplexing |
|
||||
| `subagent tool` | keep as-is | The user-facing tool name |
|
||||
|
||||
**Mental model:**
|
||||
|
||||
- **Controller** = UOK kernel (deterministic policy, what to run)
|
||||
- **Dispatcher** = DispatchLayer (mechanism, how to run)
|
||||
- **Workers** = `sf headless` processes in worktrees (the doing)
|
||||
- **Inbox** = AgentInbox per worker (message receiving)
|
||||
- **Bus** = MessageBus (durable inter-agent messaging)
|
||||
- **Subagent tool** = in-session ad-hoc delegation (separate from the DispatchLayer path)
|
||||
|
||||
The confusion arises because "orchestrator" suggests it controls both what and how. In a clean architecture, orchestrator = controller (what), and dispatcher = mechanism (how). Today, parallel-orchestrator does both, which is why it feels heavyweight and why slice-parallel-orchestrator had to be cloned to change scope.
|
||||
|
||||
---
|
||||
|
||||
## 7. Implementation Priority
|
||||
|
||||
### Phase 1: Eliminate duplication (lowest risk, highest clarity)
|
||||
|
||||
**1.1 — Merge parallel-orchestrator + slice-parallel-orchestrator**
|
||||
|
||||
Extract shared logic into a `DispatchLayer` class parameterized by scope. The slice orchestrator's conflict-filtering logic (`filterConflictingSlices`) already lives in `slice-parallel-conflict.ts` and stays there. The merged `dispatch-layer.js` calls it.
|
||||
|
||||
Test: both the `/parallel` command and the slice-level parallelism continue to work identically. The parallel orchestrator dashboard continues to show milestone workers; slice-level parallelism shows slice workers.
|
||||
|
||||
File: new `src/resources/extensions/sf/dispatch-layer.js` (~400 LOC merged from both orchestrators).
|
||||
|
||||
### Phase 2: Wire MessageBus into DispatchLayer
|
||||
|
||||
**2.1 — Add AgentInbox to each worker**
|
||||
|
||||
Every `sf headless` worker opens a `MessageBus` inbox named after its milestone/slice ID. The coordinator can send messages to workers (e.g., "pause", "resume", "report status").
|
||||
|
||||
**2.2 — Use MessageBus for coordinator → worker signaling**
|
||||
|
||||
Replace file-based IPC signals (`session-status-io.js`, `sendSignal`) with MessageBus `send()`. The file-based signals already exist as a fallback for crash recovery; MessageBus gives durable at-least-once delivery.
|
||||
|
||||
Test: workers respond to coordinator pause/resume messages delivered via MessageBus instead of or in addition to file signals.
|
||||
|
||||
### Phase 3: UOK kernel adopts DispatchLayer
|
||||
|
||||
**3.1 — Replace direct parallel-orchestrator calls with DispatchLayer**
|
||||
|
||||
The autonomous loop's parallel dispatch path (`analyzeParallelEligibility` → `startParallel`) goes through DispatchLayer instead of calling parallel-orchestrator directly.
|
||||
|
||||
**3.2 — UOK reads worker status from DispatchLayer**
|
||||
|
||||
Dashboard refresh reads from `dispatchLayer.getStatus()` instead of directly from parallel-orchestrator's state.
|
||||
|
||||
File changes: `uok/kernel.js` imports `DispatchLayer`; parallel-orchestrator.js becomes a thin wrapper (or is removed if no other callers remain).
|
||||
|
||||
### Phase 4: Subagent tool gets optional MessageBus inbox
|
||||
|
||||
**4.1 — Allow subagent workers to opt-in to MessageBus**
|
||||
|
||||
A subagent spawned with `useMessageBus: true` in params gets an `AgentInbox` injected into its prompt context. This enables the subagent to receive coordinator messages during long-running tasks.
|
||||
|
||||
**Constraint**: subagent still cannot write to project DB. MessageBus read access does not change this.
|
||||
|
||||
Test: long-running subagent receives a pause message from the coordinator via MessageBus.
|
||||
|
||||
### Phase 5: Naming cleanup (cosmetic but reduces confusion)
|
||||
|
||||
**5.1 — Rename `parallel-orchestrator.js` → `milestone-dispatcher.js`**
|
||||
**5.2 — Rename `slice-parallel-orchestrator.js` → `slice-dispatcher.js`**
|
||||
Update all import references.
|
||||
|
||||
**5.3 — Trim `uok/index.js` exports**
|
||||
|
||||
Move non-orchestration exports (skills, model policy, etc.) to their own barrels or remove from the UOK public API. The `uok/index.js` barrel re-exports ~60 symbols from ~30 sub-modules. Some exports (e.g., skill functions, model policy functions) are used only by specific tools and do not belong in an orchestration kernel export.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 2 duplications (parallel-orchestrator + slice-parallel-orchestrator; file-based IPC replacing MessageBus). The root cause is that dispatch, orchestration, and coordination evolved separately rather than being designed as layers of one system.
|
||||
|
||||
**The plan is to:**
|
||||
1. Merge `parallel-orchestrator` + `slice-parallel-orchestrator` into a single `DispatchLayer` class
|
||||
2. Wire MessageBus into DispatchLayer so workers become reachable via durable messaging (replacing file-based IPC)
|
||||
3. UOK kernel becomes the controller that calls DispatchLayer, not a parallel system
|
||||
4. Subagent tool stays separate — it's ad-hoc in-session delegation, not autonomous dispatch; formalize its DB access contract
|
||||
5. Cmux stays orthogonal — it's surface integration, not dispatch
|
||||
|
||||
The DB access model is already correct: subagents run in their own process with their own DB connection and cannot write to the project state. Workers (dispatched via DispatchLayer) are the project's own agents and do have project DB write access.
|
||||
|
||||
The `adversarial_partner`/`adversarial_combatant`/`adversarial_architect` fields are **planning ceremony fields** (Letta-inspired) that belong in the PDD planning layer (slice/milestone planning), not in the dispatch layer. They are populated by planning tools and rendered in slice output. The dispatch layer should remain purely about "how to run" — worktree lifecycle, process management, cost tracking, and message delivery.
|
||||
Loading…
Add table
Reference in a new issue