Mikael Hugo 15269f4176 sf snapshot: uncommitted changes after 202m inactivity

2026-05-08 13:31:08 +02:00

21 KiB

Raw Blame History

SF Agent Mode System

Status: Draft specification. Promoted from copilot-thoughts.md research notes. Scope: TUI mode surface, command structure, orthogonal state axes, skills, background work, runtime target. Decision authority: Product + architecture review required before implementation.

1. Problem Statement

SF's old command surface treated mode switching as separate commands rather than persistent states. There was no visible indicator of the current mode, and the /sf prefix positioned SF as a plugin rather than the system itself.

Competitors (Copilot CLI, Factory Droid, Amp) have cleaner mode surfaces with visible state and orthogonal controls. SF has deeper autonomous machinery but weaker presentation.

Goal: Make SF's mode system as obvious as Vim's insert/normal mode indicator, with the control depth of Factory Droid's autonomy levels and the skill system of Amp.

2. Orthogonal State Axes

SF state is five independent axes, not one overloaded "mode."

workMode:          chat | plan | build | review | repair | research
runControl:        manual | assisted | autonomous
permissionProfile: restricted | normal | trusted | unrestricted
modelMode:         fast | smart | deep
surface:           tui | web | headless | rpc

2.1 Axis Definitions

Axis	Question It Answers	Values
`workMode`	What kind of work is SF doing?	`chat`, `plan`, `build`, `review`, `repair`, `research`
`runControl`	Who advances the loop?	`manual` (user), `assisted` (one unit then pause), `autonomous` (continuous)
`permissionProfile`	What may proceed without approval?	`restricted`, `normal`, `trusted`, `unrestricted`
`modelMode`	Speed/cost/reasoning posture?	`fast` (cheap), `smart` (balanced), `deep` (reasoning)
`surface`	How is the user connected?	`tui`, `web`, `headless`, `rpc`

2.2 Example Combinations

plan     | manual     | normal     | deep     → user plans with reasoning model
build    | autonomous | trusted    | smart    → continuous implementation
repair   | assisted   | normal     | smart    → one repair unit at a time
research | autonomous | restricted | deep     → continuous research, read-only
review   | manual     | restricted | deep     → user reviews with reasoning model

2.3 Rules

permissionProfile never implies runControl. Autonomous run with restricted permissions is valid.
runControl never implies permissionProfile. Manual run with unrestricted permissions is valid.
Denylists and safety gates override permissionProfile regardless of value.
Every risk decision logs all five axis values.
sandboxProfile may become a sixth axis later. It is separate from permissionProfile: sandboxing controls process/filesystem/network containment, while permission profile controls what SF may approve.

3. Work Modes

3.1 `chat`

Default conversational mode. Questions, explanations, low-commitment exploration. No durable artifacts created without explicit user request.

3.2 `plan`

Research, clarify, write/update specs, derive tasks, produce explicit acceptance point before implementation. Primary user journey starts here.

Plan → Build handoff:

plan | manual | normal | deep
accept plan
build | autonomous | selected-permission-profile | smart

Surfaces:

TUI: plan acceptance prompt includes "run autonomously" button
Web: plan acceptance button includes "run autonomously"
Headless: --autonomous chains into direct /autonomous
RPC: machine event records transition explicitly

3.3 `build`

Implement, test, lint, typecheck, verify, prepare commit-ready changes. The autonomous default.

3.4 `review`

Inspect diffs, tests, risks, regressions, security issues, missing evidence. Requires reasoning model (deep).

3.5 `repair`

Fix SF health, repo health, runtime drift, broken generated state, bad command surfaces, failing workflow infrastructure, stale locks, broken installed runtime copies, failed gates, generated/runtime drift, and broken state.

Doctor is the diagnostic engine, not the mode. /doctor inspects. /repair switches work mode. repair is a workMode, not a separate subsystem.

Commands:

/doctor          → inspect and report
/doctor fix      → deterministic auto-fix
/doctor heal     → LLM-assisted deep healing
/repair          → switch workMode to repair
/repair --autonomous → repair until clean, blocked, or limit-hit

Auto-transition to repair:

build | autonomous | trusted | smart
→ repair | autonomous | normal | smart

Allowed when:

pre-dispatch health gate fails
installed runtime drift detected
SF cannot dispatch safely
repo workflow state corrupted

Policy: configurable per project. Options: auto, ask, log-only.

3.6 `research`

Longer-form codebase, competitor, design, API, or dependency research. Uses web search, local code exploration, cross-repo research, helper agents.

4. Run Control

Value	Behavior
`manual`	User drives every step. Tool calls require approval.
`assisted`	SF executes one unit, then pauses for user review.
`autonomous`	SF continues until done, blocked, interrupted, budget-hit, or limit-hit.

4.1 Commands

/control manual
/control assisted
/control autonomous
/autonomous    → alias for /control autonomous
/next          → alias for /control assisted (one unit)
/pause         → pause autonomous, preserve state
/stop          → stop autonomous, clear state

4.2 Transition Scopes

Scope	Behavior
`now`	Apply immediately if no tool active. Abort current tool if policy allows.
`after-current-tool`	Finish active tool, then switch.
`after-current-unit`	Finish current SF unit, then switch.
`next-milestone`	Switch after current milestone completes.

Autonomous changes affect future decisions, never mutate active tool calls mid-execution.

4.3 Transition Logging

Every transition persists:

{
  "timestamp": "2026-05-08T10:00:00Z",
  "from": {"workMode": "build", "runControl": "autonomous"},
  "to": {"workMode": "repair", "runControl": "autonomous"},
  "reason": "pre-dispatch health gate failed",
  "scope": "after-current-unit",
  "sessionId": "..."
}

5. Permission Profiles

Profile	Description
`restricted`	Read-only and explicitly allowlisted actions.
`normal`	Safe edits, non-destructive local commands.
`trusted`	Build/test/install/local commits and bounded repo automation.
`unrestricted`	High-risk orchestration only in intentionally trusted environments.

5.1 Enforcement

Permission profile is enforced at three layers:

Tool registry: Each tool declares required profile. Tools below profile are hidden from model.
Execution gate: Each tool call checks profile at invocation. Violation = error.
Safety harness: Destructive operations (delete, push to production, etc.) require explicit confirmation regardless of profile.

5.2 Commands

/permission-profile restricted
/permission-profile normal
/permission-profile trusted
/permission-profile unrestricted

6. Model Modes

Mode	Use Case	Routing Hint
`fast`	Small bounded tasks	Cheapest available model
`smart`	Default balanced work	Default routing table
`deep`	Planning, debugging, research, review	Reasoning model (o1, Claude Opus, etc.)

modelMode guides routing. It does not replace explicit /model selection.

7. Mode Switching UX

7.1 Direct Commands

/mode chat
/mode plan
/mode build
/mode review
/mode repair
/mode research
/control manual
/control assisted
/control autonomous
/permission-profile restricted
/permission-profile normal
/permission-profile trusted
/permission-profile unrestricted
/model-mode fast
/model-mode smart
/model-mode deep

7.2 Combined Forms

/mode repair --autonomous --permission-profile normal
/mode build --autonomous --permission-profile trusted
/mode research --autonomous --permission-profile restricted --model-mode deep

7.3 Autonomous Steering

/steer mode repair
/steer mode review after-current-unit
/steer permission-profile restricted now
/steer model-mode deep for-next-unit

7.4 Keyboard Shortcuts

Shortcut	Action
`Ctrl+Shift+M`	Cycle workMode: chat → plan → build → review → repair → research → chat
`Ctrl+Shift+A`	Set runControl to autonomous
`Ctrl+Shift+S`	Set runControl to assisted (step)
`Ctrl+Shift+I`	Set runControl to manual (interactive)
`Ctrl+Shift+R`	Set workMode to repair
`Ctrl+Shift+P`	Cycle permissionProfile: restricted → normal → trusted → unrestricted → restricted

8. Status and Mode Badge

8.1 Full Status Line

SF  build | autonomous | trusted | smart

8.2 Compact Badge Form

For narrow terminals (< 80 cols):

[B][A][T][S]

8.3 Critical State Labels

When workMode is repair or review, show full labels regardless of width:

repair | autonomous | normal | smart
review | assisted | normal | deep

8.4 Badge Placement

Surface	Placement
TUI header	Left side, after "SF" logo
TUI status bar	Bottom line when header hidden
tmux/terminal title	`SF[build
Web	Top bar, color-coded chip

8.5 Badge During Auto Mode

Current code hides header/footer during auto mode (if (isAutoActive()) return []). This must change:

Show minimal header during auto: badge + project name only
Or show badge in dedicated status bar separate from header/footer
Badge color pulses slowly during autonomous execution (subtle animation)

8.6 Badge Colors

Axis	Value	Color
workMode	`chat`	dim
workMode	`plan`	accent
workMode	`build`	success
workMode	`review`	warning
workMode	`repair`	error
workMode	`research`	info
runControl	`manual`	dim
runControl	`assisted`	warning
runControl	`autonomous`	success (pulsing)
permissionProfile	`restricted`	success
permissionProfile	`normal`	dim
permissionProfile	`trusted`	warning
permissionProfile	`unrestricted`	error

9. Background Work Surface (`/tasks`)

Unified view of all background work. Replaces scattered /status, /queue, /parallel status for work inspection.

9.1 What `/tasks` Shows

autonomous task lifecycle rows
parallel workers
scheduled autonomous dispatches and queued scheduler rows
background shell sessions
stuck or resumable sessions
remote questions waiting for answers
current cost/budget state
last checkpoint and next action

9.2 Data Model

SQLite tables:

-- Durable task state
CREATE TABLE tasks (
  id TEXT PRIMARY KEY,
  work_mode TEXT NOT NULL,
  run_control TEXT NOT NULL,
  permission_profile TEXT NOT NULL,
  model_mode TEXT NOT NULL,
  status TEXT NOT NULL, -- todo | running | verifying | reviewing | done | blocked | paused | failed | cancelled | retrying
  dependency_blockers TEXT, -- JSON array of task IDs
  retry_count INTEGER DEFAULT 0,
  max_retries INTEGER DEFAULT 3,
  checkpoint_ref TEXT, -- git ref or patch file
  cost_budget REAL,
  cost_spent REAL DEFAULT 0,
  created_at TEXT, -- Temporal.Instant ISO
  started_at TEXT,
  completed_at TEXT,
  next_action_at TEXT, -- Temporal.ZonedDateTime for scheduled
  intent_claim TEXT -- for parallel workers: "I will edit src/foo.ts lines 10-50"
);

-- Scheduler state is separate from task lifecycle state
CREATE TABLE task_scheduler (
  task_id TEXT PRIMARY KEY REFERENCES tasks(id),
  status TEXT NOT NULL, -- queued | due | claimed | dispatched | consumed | expired
  due_at TEXT,
  claimed_by TEXT,
  dispatched_at TEXT,
  consumed_at TEXT,
  expires_at TEXT
);

-- Ephemeral running state
CREATE TABLE task_runtime (
  task_id TEXT PRIMARY KEY REFERENCES tasks(id),
  process_pid INTEGER,
  worktree_path TEXT,
  current_model TEXT,
  context_usage_percent REAL,
  last_heartbeat_at TEXT, -- Temporal.Instant
  FOREIGN KEY (task_id) REFERENCES tasks(id)
);

-- Transition log
CREATE TABLE task_transitions (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  task_id TEXT NOT NULL,
  from_status TEXT NOT NULL,
  to_status TEXT NOT NULL,
  reason TEXT,
  scope TEXT,
  timestamp TEXT NOT NULL -- Temporal.Instant
);

Parallel workers must stay worktree-isolated and report heartbeat/status into .sf state. Scheduler rows may use queued; task lifecycle rows use todo.

9.3 Complementary Commands

/tasks does not replace:

/status → project health dashboard
/queue → milestone/slice dispatch order
/parallel status → parallel orchestrator detail
/session-report → cost/token summary
/logs → activity logs
/forensics → execution forensics

10. Skills System

10.1 Directory Structure

.agents/skills/<skill-name>/
  SKILL.md          -- skill definition with YAML frontmatter
  scripts/          -- supporting scripts
  schemas/          -- JSON schemas for inputs/outputs
  checklists/       -- verification checklists
  mcp.json          -- MCP server config if applicable

10.2 Skill Frontmatter

---
name: forge-command-surface
description: Use when changing SF slash commands, browser command parity, or headless command dispatch.
user-invocable: true
model-invocable: true
side-effects: code-edits
permission-profile: normal
---

Fields:

name: unique identifier
description: when to use this skill
user-invocable: can user explicitly invoke?
model-invocable: can model auto-invoke when relevant?
side-effects: none, code-edits, production-mutation, etc.
permission-profile: minimum profile required

10.3 Skill Categories

Type	Example	`model-invocable`
Background knowledge	`forge-autonomous-runtime`	true
User tool	`production-deploy`	false
Shared capability	`forge-command-surface`	true

Dangerous skills (production-mutation) are never model-invoked by default.

10.4 Auto-Creation Flow

Detect repeated repo-specific evidence (same files, commands, failure modes, rules)
Propose skill in manual/restricted contexts
Generate/update automatically only when policy allows
Record repeated source evidence in .sf state
Keep narrow, linted, and evaled like code
Commit with repo when accepted

10.5 Skill Eval Cases

Every auto-created skill needs eval cases:

.agents/skills/<skill-name>/evals/
  case-1/
    task.md          -- user-like prompt
    grader.js        -- deterministic checker
    hidden/          -- reference answers (not visible to agent)
    work/            -- agent workspace

Graders inspect: files, artifacts, answer.json, trace.jsonl, result state. Failed trials preserve workspace for debugging.

11. Command Surfaces

11.1 Human Slash Commands

SF registers direct command roots only:

/status
/autonomous
/doctor
/rate
/session-report
/parallel
/remote
/tasks

/sf is not a command root. TUI and browser command parity tests reject it so compatibility shims do not grow back.

/remote is a full-session steering surface. Remote answers may change workMode, runControl, permissionProfile, and modelMode; they are not limited to question delivery.

11.2 Shell Surface

Machine surface remains prefixed:

sf headless autonomous
sf headless --autonomous ...

The shell prefix is the executable name, not an interactive slash-command namespace.

12. Runtime Target: Node 26

12.1 Policy

current compatibility floor: Node 26.1+
internal target runtime:     Node 26.1
canonical baseline:          Node 26.1
Node 25:                     skip except quick probes

12.2 Why Node 26

Temporal enabled by default
V8 14.6 baseline
Undici 8 HTTP/fetch baseline
Removes legacy APIs, hardens against old assumptions

12.3 Temporal Adoption

Store semantic type, not just formatted string:

Concept	Temporal Type	Use Case
Exact instant	`Temporal.Instant`	Journal events, checkpoints, lock leases
Local time	`Temporal.ZonedDateTime`	Reminders, schedules, audits
Calendar date	`Temporal.PlainDate`	Daily reports, milestone reviews
Wall-clock time	`Temporal.PlainTime`	Recurring policies
Time amount	`Temporal.Duration`	Budgets, leases, cooldowns, retry delays

12.4 Adoption Priority

sf schedule — highest user-visible impact
Lock/lease — highest operational correctness
Journals/traces — highest debugging impact
Session reports — nice to have
Background tasks — future work

12.5 Gate

node@26 --version
npm run lint
npm run typecheck:extensions
npm run build
npm test
sf --version
sf --help
sf --print "ping"

13. Implementation Pull-Through

13.1 Already Directionally Right

UOK lifecycle records carry runControl
UOK lifecycle records carry permissionProfile
Schedule command state uses autonomous_dispatch
DB-backed state, recovery, verification, scheduling, captures, forensics
Skills and project-specific skill paths exist
Parallel orchestration and remote-question infrastructure

13.2 Still Needed

Priority	Item	Effort
P2	Decide whether `sandboxProfile` becomes a sixth persisted axis	Medium
P2	Remove `/sf` from docs/web/tests (Phase 2 deprecation)	Small

13.4 Recently Completed (This Session)

Priority	Item	Status
P1	Centralized metrics system (`metrics-central.js`)	✓
P1	Cost command (`/cost`) with DB + ledger queries	✓
P1	Explicit stage commands (`/research`, `/plan`, `/implement`)	✓
P2	Reasoning assist foundation (`reasoning-assist.js`)	✓
P2	Self-feedback → workMode auto-transition	✓
P2	UOK events carry workMode + modelMode	✓
P2	`/sf` prefix deprecation warning (Phase 1)	✓

13.3 Completed

Priority	Item	Status
P0	Make mode state durable in SQLite	✓
P0	Add direct `/mode`, `/control`, `/permission-profile`, `/model-mode` commands	✓
P0	Add visible mode badge to TUI header/status bar	✓
P1	Make `--autonomous` chain into direct `/autonomous`	✓
P1	Expose autonomous continuation limits in settings and status	✓
P1	Add `/tasks` backed by DB execution graph state	✓
P1	Make `repair` first-class workflow over `doctor`	✓
P1	Enhanced `/steer` with mode/permission-profile/model-mode transitions	✓
P1	TUI keyboard shortcuts for mode cycling (Ctrl+Shift+M/R/A/S/P)	✓
P1	Minimal auto-mode header/footer (badge visible during autonomy)	✓
P1	Remove `/sf` namespace registration and parity-test against fallback	✓
P1	Parallel worker intent/claim registry backed by UOK SQLite coordination	✓
P1	Skill eval harness foundation	✓
P1	Terminal title mode indicator	✓
P2	Policy-aware project skill suggestion/generation with DB cooldown	✓
P2	Schema-backed task frontmatter (risk, mutation, verification, approval)	✓
P2	Subagent provider/model/permission inheritance audit and guard	✓
P2	Remote steering as full-session surface from remote answers	✓

14. Open Questions

Should plan mode show badge [P] or plan text in full?
Should paused autonomous show previous badge dimmed, or [P] for paused?
Should mode be per-session or per-project? (Current: per-session)
Should badge appear in tmux/terminal window titles?
Should mode transitions have sound/notification?
Should repair auto-transition be ask by default for new projects?
Should skill eval cases run in CI or only on-demand?
Should /tasks be a TUI overlay or a separate scrollable panel?
Should reasoning assist call a fast model automatically, or only prepare prompts for now?

15. References

GitHub Docs, "Allowing GitHub Copilot CLI to work autonomously" — https://docs.github.com/en/copilot/concepts/agents/copilot-cli/autopilot
Factory Droid, "Autonomy Level" — https://docs.factory.ai/cli/user-guides/auto-run
Amp manual — https://ampcode.com/manual
Smelt (mode cycling) — https://github.com/leonardcser/smelt
ORCH (task state machine) — https://github.com/oxgeneral/ORCH
AgentPlane (schema-first tasks) — https://github.com/basilisk-labs/agentplane
Relay (channels and tickets) — https://github.com/jcast90/relay
Sage (runtime-neutral orchestration) — https://github.com/youwangd/SageCLI
Wit (symbol-level locks) — https://github.com/amaar-mc/wit

21 KiB Raw Blame History