Mikael Hugo b5893d1c28 Make SF direct command surface baseline

2026-05-08 01:34:07 +02:00

12 KiB

Raw Permalink Blame History

Parallel Milestone Orchestration

Run multiple milestones simultaneously in isolated git worktrees. Each milestone gets its own worker process, its own branch, and its own context window — while a coordinator tracks progress, enforces budgets, and keeps everything in sync.

Status: Behind parallel.enabled: false by default. Opt-in only — zero impact to existing users.

Quick Start

Enable parallel mode in your preferences:

---
parallel:
  enabled: true
  max_workers: 2
---

Start parallel execution:

/parallel start

SF scans your milestones, checks dependencies and file overlap, shows an eligibility report, and spawns workers for eligible milestones.

Monitor progress:

/parallel status

Stop when done:

/parallel stop

How It Works

Architecture

┌─────────────────────────────────────────────────────────┐
│  Coordinator (your SF session)                         │
│                                                         │
│  Responsibilities:                                      │
│  - Eligibility analysis (deps + file overlap)           │
│  - Worker spawning and lifecycle                        │
│  - Budget tracking across all workers                   │
│  - Signal dispatch (pause/resume/stop)                  │
│  - Session status monitoring                            │
│  - Merge reconciliation                                 │
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  ...          │
│  │ M001     │  │ M003     │  │ M005     │              │
│  └──────────┘  └──────────┘  └──────────┘              │
│       │              │              │                   │
│       ▼              ▼              ▼                   │
│  .sf/worktrees/ .sf/worktrees/ .sf/worktrees/       │
│  M001/           M003/           M005/                  │
│  (milestone/     (milestone/     (milestone/            │
│   M001 branch)    M003 branch)    M005 branch)          │
└─────────────────────────────────────────────────────────┘

Worker Isolation

Each worker is a separate sf process with complete isolation:

Resource	Isolation Method
Filesystem	Git worktree — each worker has its own checkout
Git branch	`milestone/<MID>` — one branch per milestone
State derivation	`SF_MILESTONE_LOCK` env var — `deriveState()` only sees the assigned milestone
Context window	Separate process — each worker has its own agent sessions
Metrics	Each worktree has its own `.sf/metrics.json`
Crash recovery	Each worktree has its own `.sf/auto.lock`

Coordination

Workers and the coordinator communicate through file-based IPC:

Session status files (.sf/parallel/<MID>.status.json) — workers write heartbeats, the coordinator reads them
Signal files (.sf/parallel/<MID>.signal.json) — coordinator writes signals, workers consume them
Atomic writes — write-to-temp + rename prevents partial reads

Eligibility Analysis

Before starting parallel execution, SF checks which milestones can safely run concurrently.

Rules

Not complete — Finished milestones are skipped
Dependencies satisfied — All dependsOn entries must have status complete
File overlap check — Milestones touching the same files get a warning (but are still eligible)

Example Report

# Parallel Eligibility Report

## Eligible for Parallel Execution (2)

- **M002** — Auth System
  All dependencies satisfied.
- **M003** — Dashboard UI
  All dependencies satisfied.

## Ineligible (2)

- **M001** — Core Types
  Already complete.
- **M004** — API Integration
  Blocked by incomplete dependencies: M002.

## File Overlap Warnings (1)

- **M002** <-> **M003** — 2 shared file(s):
  - `src/types.ts`
  - `src/middleware.ts`

File overlaps are warnings, not blockers. Both milestones work in separate worktrees, so they won't interfere at the filesystem level. Conflicts are detected and resolved during merge.

Configuration

Add to ~/.sf/PREFERENCES.md or .sf/PREFERENCES.md:

---
parallel:
  enabled: false            # Master toggle (default: false)
  max_workers: 2            # Concurrent workers (1-4, default: 2)
  budget_ceiling: 50.00     # Aggregate cost limit in dollars (optional)
  merge_strategy: "per-milestone"  # When to merge: "per-slice" or "per-milestone"
  auto_merge: "confirm"            # "auto", "confirm", or "manual"
---

Configuration Reference

Key	Type	Default	Description
`enabled`	boolean	`false`	Master toggle. Must be `true` for `/parallel` commands to work.
`max_workers`	number (1-4)	`2`	Maximum concurrent worker processes. Higher values use more memory and API budget.
`budget_ceiling`	number	none	Aggregate cost ceiling in USD across all workers. When reached, no new units are dispatched.
`merge_strategy`	`"per-slice"` or `"per-milestone"`	`"per-milestone"`	When worktree changes merge back to main. Per-milestone waits for the full milestone to complete.
`auto_merge`	`"auto"`, `"confirm"`, `"manual"`	`"confirm"`	How merge-back is handled. `confirm` prompts before merging. `manual` requires explicit `/parallel merge`.

Commands

Command	Description
`/parallel start`	Analyze eligibility, confirm, and start workers
`/parallel status`	Show all workers with state, units completed, and cost
`/parallel stop`	Stop all workers (sends SIGTERM)
`/parallel stop M002`	Stop a specific milestone's worker
`/parallel pause`	Pause all workers (finish current unit, then wait)
`/parallel pause M002`	Pause a specific worker
`/parallel resume`	Resume all paused workers
`/parallel resume M002`	Resume a specific worker
`/parallel merge`	Merge all completed milestones back to main
`/parallel merge M002`	Merge a specific milestone back to main

Signal Lifecycle

The coordinator communicates with workers through signals:

Coordinator                    Worker
    │                            │
    ├── sendSignal("pause") ──→  │
    │                            ├── consumeSignal()
    │                            ├── pauseAuto()
    │                            │   (finish current unit, wait)
    │                            │
    ├── sendSignal("resume") ─→  │
    │                            ├── consumeSignal()
    │                            ├── resume dispatch loop
    │                            │
    ├── sendSignal("stop") ───→  │
    │   + SIGTERM ────────────→  │
    │                            ├── consumeSignal() or SIGTERM handler
    │                            ├── stopAuto()
    │                            └── process exits

Workers check for signals between units (in handleAgentEnd). The coordinator also sends SIGTERM for immediate response on stop.

Merge Reconciliation

When milestones complete, their worktree changes need to merge back to main.

Merge Order

Sequential (default): Milestones merge in ID order (M001 before M002)
By-completion: Milestones merge in the order they finish

Conflict Handling

.sf/ state files (STATE.md, metrics.json, etc.) — auto-resolved by accepting the milestone branch version
Code conflicts — stop and report. The merge halts, showing which files conflict. Resolve manually and retry with /parallel merge <MID>.

Example

/parallel merge

# Merge Results

- **M002** — merged successfully (pushed)
- **M003** — CONFLICT (2 file(s)):
  - `src/types.ts`
  - `src/middleware.ts`
  Resolve conflicts manually and run `/parallel merge M003` to retry.

Budget Management

When budget_ceiling is set, the coordinator tracks aggregate cost across all workers:

Cost is summed from each worker's session status
When the ceiling is reached, the coordinator signals workers to stop
Each worker also respects the project-level budget_ceiling preference independently

Health Monitoring

Doctor Integration

/doctor detects parallel session issues:

Stale parallel sessions — Worker process died without cleanup. Doctor finds .sf/parallel/*.status.json files with dead PIDs or expired heartbeats and removes them.

Run /doctor --fix to clean up automatically.

Stale Detection

Sessions are considered stale when:

The worker PID is no longer running (checked via process.kill(pid, 0))
The last heartbeat is older than 30 seconds

The coordinator runs stale detection during refreshWorkerStatuses() and automatically removes dead sessions.

Safety Model

Safety Layer	Protection
Feature flag	`parallel.enabled: false` by default — existing users unaffected
Eligibility analysis	Dependency and file overlap checks before starting
Worker isolation	Separate processes, worktrees, branches, context windows
`SF_MILESTONE_LOCK`	Each worker only sees its milestone in state derivation
`SF_PARALLEL_WORKER`	Workers cannot spawn nested parallel sessions
Budget ceiling	Aggregate cost enforcement across all workers
Signal-based shutdown	Graceful stop via file signals + SIGTERM
Doctor integration	Detects and cleans up orphaned sessions
Conflict-aware merge	Stops on code conflicts, auto-resolves `.sf/` state conflicts

File Layout

.sf/
├── parallel/                    # Coordinator ↔ worker IPC
│   ├── M002.status.json         # Worker heartbeat + progress
│   ├── M002.signal.json         # Coordinator → worker signals
│   ├── M003.status.json
│   └── M003.signal.json
├── worktrees/                   # Git worktrees (one per milestone)
│   ├── M002/                    # M002's isolated checkout
│   │   ├── .sf/                # M002's own state files
│   │   │   ├── auto.lock
│   │   │   ├── metrics.json
│   │   │   └── milestones/
│   │   └── src/                 # M002's working copy
│   └── M003/
│       └── ...
└── ...

Both .sf/parallel/ and .sf/worktrees/ are gitignored — they're runtime-only coordination files that never get committed.

Troubleshooting

"Parallel mode is not enabled"

Set parallel.enabled: true in your preferences file.

"No milestones are eligible for parallel execution"

All milestones are either complete or blocked by dependencies. Check /queue to see milestone status and dependency chains.

Worker crashed — how to recover

Workers now persist their state to disk automatically. If a worker process dies, the coordinator detects the dead PID via heartbeat expiry and marks the worker as crashed. On restart, the worker picks up from disk state — crash recovery, worktree re-entry, and completed-unit tracking carry over from the crashed session.

Run /doctor --fix to clean up stale sessions
Run /parallel status to see current state
Re-run /parallel start to spawn new workers for remaining milestones

Merge conflicts after parallel completion

Run /parallel merge to see which milestones have conflicts
Resolve conflicts in the worktree at .sf/worktrees/<MID>/
Retry with /parallel merge <MID>

Workers seem stuck

Check if budget ceiling was reached: /parallel status shows per-worker costs. Increase parallel.budget_ceiling or remove it to continue.

12 KiB Raw Permalink Blame History