singularity-forge/docs/user-docs/parallel-orchestration.md
2026-05-08 01:34:07 +02:00

309 lines
12 KiB
Markdown

# Parallel Milestone Orchestration
Run multiple milestones simultaneously in isolated git worktrees. Each milestone gets its own worker process, its own branch, and its own context window — while a coordinator tracks progress, enforces budgets, and keeps everything in sync.
> **Status:** Behind `parallel.enabled: false` by default. Opt-in only — zero impact to existing users.
## Quick Start
1. Enable parallel mode in your preferences:
```yaml
---
parallel:
enabled: true
max_workers: 2
---
```
2. Start parallel execution:
```
/parallel start
```
SF scans your milestones, checks dependencies and file overlap, shows an eligibility report, and spawns workers for eligible milestones.
3. Monitor progress:
```
/parallel status
```
4. Stop when done:
```
/parallel stop
```
## How It Works
### Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Coordinator (your SF session) │
│ │
│ Responsibilities: │
│ - Eligibility analysis (deps + file overlap) │
│ - Worker spawning and lifecycle │
│ - Budget tracking across all workers │
│ - Signal dispatch (pause/resume/stop) │
│ - Session status monitoring │
│ - Merge reconciliation │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ ... │
│ │ M001 │ │ M003 │ │ M005 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ .sf/worktrees/ .sf/worktrees/ .sf/worktrees/ │
│ M001/ M003/ M005/ │
│ (milestone/ (milestone/ (milestone/ │
│ M001 branch) M003 branch) M005 branch) │
└─────────────────────────────────────────────────────────┘
```
### Worker Isolation
Each worker is a separate `sf` process with complete isolation:
| Resource | Isolation Method |
|----------|-----------------|
| **Filesystem** | Git worktree — each worker has its own checkout |
| **Git branch** | `milestone/<MID>` — one branch per milestone |
| **State derivation** | `SF_MILESTONE_LOCK` env var — `deriveState()` only sees the assigned milestone |
| **Context window** | Separate process — each worker has its own agent sessions |
| **Metrics** | Each worktree has its own `.sf/metrics.json` |
| **Crash recovery** | Each worktree has its own `.sf/auto.lock` |
### Coordination
Workers and the coordinator communicate through file-based IPC:
- **Session status files** (`.sf/parallel/<MID>.status.json`) — workers write heartbeats, the coordinator reads them
- **Signal files** (`.sf/parallel/<MID>.signal.json`) — coordinator writes signals, workers consume them
- **Atomic writes** — write-to-temp + rename prevents partial reads
## Eligibility Analysis
Before starting parallel execution, SF checks which milestones can safely run concurrently.
### Rules
1. **Not complete** — Finished milestones are skipped
2. **Dependencies satisfied** — All `dependsOn` entries must have status `complete`
3. **File overlap check** — Milestones touching the same files get a warning (but are still eligible)
### Example Report
```
# Parallel Eligibility Report
## Eligible for Parallel Execution (2)
- **M002** — Auth System
All dependencies satisfied.
- **M003** — Dashboard UI
All dependencies satisfied.
## Ineligible (2)
- **M001** — Core Types
Already complete.
- **M004** — API Integration
Blocked by incomplete dependencies: M002.
## File Overlap Warnings (1)
- **M002** <-> **M003** — 2 shared file(s):
- `src/types.ts`
- `src/middleware.ts`
```
File overlaps are warnings, not blockers. Both milestones work in separate worktrees, so they won't interfere at the filesystem level. Conflicts are detected and resolved during merge.
## Configuration
Add to `~/.sf/PREFERENCES.md` or `.sf/PREFERENCES.md`:
```yaml
---
parallel:
enabled: false # Master toggle (default: false)
max_workers: 2 # Concurrent workers (1-4, default: 2)
budget_ceiling: 50.00 # Aggregate cost limit in dollars (optional)
merge_strategy: "per-milestone" # When to merge: "per-slice" or "per-milestone"
auto_merge: "confirm" # "auto", "confirm", or "manual"
---
```
### Configuration Reference
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `enabled` | boolean | `false` | Master toggle. Must be `true` for `/parallel` commands to work. |
| `max_workers` | number (1-4) | `2` | Maximum concurrent worker processes. Higher values use more memory and API budget. |
| `budget_ceiling` | number | none | Aggregate cost ceiling in USD across all workers. When reached, no new units are dispatched. |
| `merge_strategy` | `"per-slice"` or `"per-milestone"` | `"per-milestone"` | When worktree changes merge back to main. Per-milestone waits for the full milestone to complete. |
| `auto_merge` | `"auto"`, `"confirm"`, `"manual"` | `"confirm"` | How merge-back is handled. `confirm` prompts before merging. `manual` requires explicit `/parallel merge`. |
## Commands
| Command | Description |
|---------|-------------|
| `/parallel start` | Analyze eligibility, confirm, and start workers |
| `/parallel status` | Show all workers with state, units completed, and cost |
| `/parallel stop` | Stop all workers (sends SIGTERM) |
| `/parallel stop M002` | Stop a specific milestone's worker |
| `/parallel pause` | Pause all workers (finish current unit, then wait) |
| `/parallel pause M002` | Pause a specific worker |
| `/parallel resume` | Resume all paused workers |
| `/parallel resume M002` | Resume a specific worker |
| `/parallel merge` | Merge all completed milestones back to main |
| `/parallel merge M002` | Merge a specific milestone back to main |
## Signal Lifecycle
The coordinator communicates with workers through signals:
```
Coordinator Worker
│ │
├── sendSignal("pause") ──→ │
│ ├── consumeSignal()
│ ├── pauseAuto()
│ │ (finish current unit, wait)
│ │
├── sendSignal("resume") ─→ │
│ ├── consumeSignal()
│ ├── resume dispatch loop
│ │
├── sendSignal("stop") ───→ │
│ + SIGTERM ────────────→ │
│ ├── consumeSignal() or SIGTERM handler
│ ├── stopAuto()
│ └── process exits
```
Workers check for signals between units (in `handleAgentEnd`). The coordinator also sends `SIGTERM` for immediate response on stop.
## Merge Reconciliation
When milestones complete, their worktree changes need to merge back to main.
### Merge Order
- **Sequential** (default): Milestones merge in ID order (M001 before M002)
- **By-completion**: Milestones merge in the order they finish
### Conflict Handling
1. `.sf/` state files (STATE.md, metrics.json, etc.) — **auto-resolved** by accepting the milestone branch version
2. Code conflicts — **stop and report**. The merge halts, showing which files conflict. Resolve manually and retry with `/parallel merge <MID>`.
### Example
```
/parallel merge
# Merge Results
- **M002** — merged successfully (pushed)
- **M003** — CONFLICT (2 file(s)):
- `src/types.ts`
- `src/middleware.ts`
Resolve conflicts manually and run `/parallel merge M003` to retry.
```
## Budget Management
When `budget_ceiling` is set, the coordinator tracks aggregate cost across all workers:
- Cost is summed from each worker's session status
- When the ceiling is reached, the coordinator signals workers to stop
- Each worker also respects the project-level `budget_ceiling` preference independently
## Health Monitoring
### Doctor Integration
`/doctor` detects parallel session issues:
- **Stale parallel sessions** — Worker process died without cleanup. Doctor finds `.sf/parallel/*.status.json` files with dead PIDs or expired heartbeats and removes them.
Run `/doctor --fix` to clean up automatically.
### Stale Detection
Sessions are considered stale when:
- The worker PID is no longer running (checked via `process.kill(pid, 0)`)
- The last heartbeat is older than 30 seconds
The coordinator runs stale detection during `refreshWorkerStatuses()` and automatically removes dead sessions.
## Safety Model
| Safety Layer | Protection |
|-------------|------------|
| **Feature flag** | `parallel.enabled: false` by default — existing users unaffected |
| **Eligibility analysis** | Dependency and file overlap checks before starting |
| **Worker isolation** | Separate processes, worktrees, branches, context windows |
| **`SF_MILESTONE_LOCK`** | Each worker only sees its milestone in state derivation |
| **`SF_PARALLEL_WORKER`** | Workers cannot spawn nested parallel sessions |
| **Budget ceiling** | Aggregate cost enforcement across all workers |
| **Signal-based shutdown** | Graceful stop via file signals + SIGTERM |
| **Doctor integration** | Detects and cleans up orphaned sessions |
| **Conflict-aware merge** | Stops on code conflicts, auto-resolves `.sf/` state conflicts |
## File Layout
```
.sf/
├── parallel/ # Coordinator ↔ worker IPC
│ ├── M002.status.json # Worker heartbeat + progress
│ ├── M002.signal.json # Coordinator → worker signals
│ ├── M003.status.json
│ └── M003.signal.json
├── worktrees/ # Git worktrees (one per milestone)
│ ├── M002/ # M002's isolated checkout
│ │ ├── .sf/ # M002's own state files
│ │ │ ├── auto.lock
│ │ │ ├── metrics.json
│ │ │ └── milestones/
│ │ └── src/ # M002's working copy
│ └── M003/
│ └── ...
└── ...
```
Both `.sf/parallel/` and `.sf/worktrees/` are gitignored — they're runtime-only coordination files that never get committed.
## Troubleshooting
### "Parallel mode is not enabled"
Set `parallel.enabled: true` in your preferences file.
### "No milestones are eligible for parallel execution"
All milestones are either complete or blocked by dependencies. Check `/queue` to see milestone status and dependency chains.
### Worker crashed — how to recover
Workers now persist their state to disk automatically. If a worker process dies, the coordinator detects the dead PID via heartbeat expiry and marks the worker as crashed. On restart, the worker picks up from disk state — crash recovery, worktree re-entry, and completed-unit tracking carry over from the crashed session.
1. Run `/doctor --fix` to clean up stale sessions
2. Run `/parallel status` to see current state
3. Re-run `/parallel start` to spawn new workers for remaining milestones
### Merge conflicts after parallel completion
1. Run `/parallel merge` to see which milestones have conflicts
2. Resolve conflicts in the worktree at `.sf/worktrees/<MID>/`
3. Retry with `/parallel merge <MID>`
### Workers seem stuck
Check if budget ceiling was reached: `/parallel status` shows per-worker costs. Increase `parallel.budget_ceiling` or remove it to continue.