12 KiB
Parallel Milestone Orchestration
Run multiple milestones simultaneously in isolated git worktrees. Each milestone gets its own worker process, its own branch, and its own context window — while a coordinator tracks progress, enforces budgets, and keeps everything in sync.
Status: Behind
parallel.enabled: falseby default. Opt-in only — zero impact to existing users.
Quick Start
- Enable parallel mode in your preferences:
---
parallel:
enabled: true
max_workers: 2
---
- Start parallel execution:
/parallel start
SF scans your milestones, checks dependencies and file overlap, shows an eligibility report, and spawns workers for eligible milestones.
- Monitor progress:
/parallel status
- Stop when done:
/parallel stop
How It Works
Architecture
┌─────────────────────────────────────────────────────────┐
│ Coordinator (your SF session) │
│ │
│ Responsibilities: │
│ - Eligibility analysis (deps + file overlap) │
│ - Worker spawning and lifecycle │
│ - Budget tracking across all workers │
│ - Signal dispatch (pause/resume/stop) │
│ - Session status monitoring │
│ - Merge reconciliation │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ ... │
│ │ M001 │ │ M003 │ │ M005 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ .sf/worktrees/ .sf/worktrees/ .sf/worktrees/ │
│ M001/ M003/ M005/ │
│ (milestone/ (milestone/ (milestone/ │
│ M001 branch) M003 branch) M005 branch) │
└─────────────────────────────────────────────────────────┘
Worker Isolation
Each worker is a separate sf process with complete isolation:
| Resource | Isolation Method |
|---|---|
| Filesystem | Git worktree — each worker has its own checkout |
| Git branch | milestone/<MID> — one branch per milestone |
| State derivation | SF_MILESTONE_LOCK env var — deriveState() only sees the assigned milestone |
| Context window | Separate process — each worker has its own agent sessions |
| Metrics | Each worktree has its own .sf/metrics.json |
| Crash recovery | Each worktree has its own .sf/auto.lock |
Coordination
Workers and the coordinator communicate through file-based IPC:
- Session status files (
.sf/parallel/<MID>.status.json) — workers write heartbeats, the coordinator reads them - Signal files (
.sf/parallel/<MID>.signal.json) — coordinator writes signals, workers consume them - Atomic writes — write-to-temp + rename prevents partial reads
Eligibility Analysis
Before starting parallel execution, SF checks which milestones can safely run concurrently.
Rules
- Not complete — Finished milestones are skipped
- Dependencies satisfied — All
dependsOnentries must have statuscomplete - File overlap check — Milestones touching the same files get a warning (but are still eligible)
Example Report
# Parallel Eligibility Report
## Eligible for Parallel Execution (2)
- **M002** — Auth System
All dependencies satisfied.
- **M003** — Dashboard UI
All dependencies satisfied.
## Ineligible (2)
- **M001** — Core Types
Already complete.
- **M004** — API Integration
Blocked by incomplete dependencies: M002.
## File Overlap Warnings (1)
- **M002** <-> **M003** — 2 shared file(s):
- `src/types.ts`
- `src/middleware.ts`
File overlaps are warnings, not blockers. Both milestones work in separate worktrees, so they won't interfere at the filesystem level. Conflicts are detected and resolved during merge.
Configuration
Add to ~/.sf/PREFERENCES.md or .sf/PREFERENCES.md:
---
parallel:
enabled: false # Master toggle (default: false)
max_workers: 2 # Concurrent workers (1-4, default: 2)
budget_ceiling: 50.00 # Aggregate cost limit in dollars (optional)
merge_strategy: "per-milestone" # When to merge: "per-slice" or "per-milestone"
auto_merge: "confirm" # "auto", "confirm", or "manual"
---
Configuration Reference
| Key | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | false |
Master toggle. Must be true for /parallel commands to work. |
max_workers |
number (1-4) | 2 |
Maximum concurrent worker processes. Higher values use more memory and API budget. |
budget_ceiling |
number | none | Aggregate cost ceiling in USD across all workers. When reached, no new units are dispatched. |
merge_strategy |
"per-slice" or "per-milestone" |
"per-milestone" |
When worktree changes merge back to main. Per-milestone waits for the full milestone to complete. |
auto_merge |
"auto", "confirm", "manual" |
"confirm" |
How merge-back is handled. confirm prompts before merging. manual requires explicit /parallel merge. |
Commands
| Command | Description |
|---|---|
/parallel start |
Analyze eligibility, confirm, and start workers |
/parallel status |
Show all workers with state, units completed, and cost |
/parallel stop |
Stop all workers (sends SIGTERM) |
/parallel stop M002 |
Stop a specific milestone's worker |
/parallel pause |
Pause all workers (finish current unit, then wait) |
/parallel pause M002 |
Pause a specific worker |
/parallel resume |
Resume all paused workers |
/parallel resume M002 |
Resume a specific worker |
/parallel merge |
Merge all completed milestones back to main |
/parallel merge M002 |
Merge a specific milestone back to main |
Signal Lifecycle
The coordinator communicates with workers through signals:
Coordinator Worker
│ │
├── sendSignal("pause") ──→ │
│ ├── consumeSignal()
│ ├── pauseAuto()
│ │ (finish current unit, wait)
│ │
├── sendSignal("resume") ─→ │
│ ├── consumeSignal()
│ ├── resume dispatch loop
│ │
├── sendSignal("stop") ───→ │
│ + SIGTERM ────────────→ │
│ ├── consumeSignal() or SIGTERM handler
│ ├── stopAuto()
│ └── process exits
Workers check for signals between units (in handleAgentEnd). The coordinator also sends SIGTERM for immediate response on stop.
Merge Reconciliation
When milestones complete, their worktree changes need to merge back to main.
Merge Order
- Sequential (default): Milestones merge in ID order (M001 before M002)
- By-completion: Milestones merge in the order they finish
Conflict Handling
.sf/state files (STATE.md, metrics.json, etc.) — auto-resolved by accepting the milestone branch version- Code conflicts — stop and report. The merge halts, showing which files conflict. Resolve manually and retry with
/parallel merge <MID>.
Example
/parallel merge
# Merge Results
- **M002** — merged successfully (pushed)
- **M003** — CONFLICT (2 file(s)):
- `src/types.ts`
- `src/middleware.ts`
Resolve conflicts manually and run `/parallel merge M003` to retry.
Budget Management
When budget_ceiling is set, the coordinator tracks aggregate cost across all workers:
- Cost is summed from each worker's session status
- When the ceiling is reached, the coordinator signals workers to stop
- Each worker also respects the project-level
budget_ceilingpreference independently
Health Monitoring
Doctor Integration
/doctor detects parallel session issues:
- Stale parallel sessions — Worker process died without cleanup. Doctor finds
.sf/parallel/*.status.jsonfiles with dead PIDs or expired heartbeats and removes them.
Run /doctor --fix to clean up automatically.
Stale Detection
Sessions are considered stale when:
- The worker PID is no longer running (checked via
process.kill(pid, 0)) - The last heartbeat is older than 30 seconds
The coordinator runs stale detection during refreshWorkerStatuses() and automatically removes dead sessions.
Safety Model
| Safety Layer | Protection |
|---|---|
| Feature flag | parallel.enabled: false by default — existing users unaffected |
| Eligibility analysis | Dependency and file overlap checks before starting |
| Worker isolation | Separate processes, worktrees, branches, context windows |
SF_MILESTONE_LOCK |
Each worker only sees its milestone in state derivation |
SF_PARALLEL_WORKER |
Workers cannot spawn nested parallel sessions |
| Budget ceiling | Aggregate cost enforcement across all workers |
| Signal-based shutdown | Graceful stop via file signals + SIGTERM |
| Doctor integration | Detects and cleans up orphaned sessions |
| Conflict-aware merge | Stops on code conflicts, auto-resolves .sf/ state conflicts |
File Layout
.sf/
├── parallel/ # Coordinator ↔ worker IPC
│ ├── M002.status.json # Worker heartbeat + progress
│ ├── M002.signal.json # Coordinator → worker signals
│ ├── M003.status.json
│ └── M003.signal.json
├── worktrees/ # Git worktrees (one per milestone)
│ ├── M002/ # M002's isolated checkout
│ │ ├── .sf/ # M002's own state files
│ │ │ ├── auto.lock
│ │ │ ├── metrics.json
│ │ │ └── milestones/
│ │ └── src/ # M002's working copy
│ └── M003/
│ └── ...
└── ...
Both .sf/parallel/ and .sf/worktrees/ are gitignored — they're runtime-only coordination files that never get committed.
Troubleshooting
"Parallel mode is not enabled"
Set parallel.enabled: true in your preferences file.
"No milestones are eligible for parallel execution"
All milestones are either complete or blocked by dependencies. Check /queue to see milestone status and dependency chains.
Worker crashed — how to recover
Workers now persist their state to disk automatically. If a worker process dies, the coordinator detects the dead PID via heartbeat expiry and marks the worker as crashed. On restart, the worker picks up from disk state — crash recovery, worktree re-entry, and completed-unit tracking carry over from the crashed session.
- Run
/doctor --fixto clean up stale sessions - Run
/parallel statusto see current state - Re-run
/parallel startto spawn new workers for remaining milestones
Merge conflicts after parallel completion
- Run
/parallel mergeto see which milestones have conflicts - Resolve conflicts in the worktree at
.sf/worktrees/<MID>/ - Retry with
/parallel merge <MID>
Workers seem stuck
Check if budget ceiling was reached: /parallel status shows per-worker costs. Increase parallel.budget_ceiling or remove it to continue.