singularity-forge/docs/user-docs/troubleshooting.md
2026-05-08 03:01:20 +02:00

14 KiB

Troubleshooting

/doctor

The built-in diagnostic tool validates .sf/ integrity:

/doctor

It checks:

  • File structure and naming conventions
  • Roadmap ↔ slice ↔ task referential integrity
  • Completion state consistency
  • Git worktree health (worktree and branch modes only — skipped in none mode)
  • Stale lock files and orphaned runtime records

Common Issues

Autonomous mode loops on the same unit

Symptoms: The same unit (e.g., research-slice or plan-slice) dispatches repeatedly until hitting the dispatch limit.

Causes:

  • Stale cache after a crash — the in-memory file listing doesn't reflect new artifacts
  • The LLM didn't produce the expected artifact file

Fix: Run /doctor to repair state, then resume with /autonomous. If the issue persists, check that the expected artifact file exists on disk.

Autonomous mode stops with "Loop detected"

Cause: A unit failed to produce its expected artifact twice in a row.

Fix: Check the task plan for clarity. If the plan is ambiguous, refine it manually, then /autonomous to resume.

Wrong files in worktree

Symptoms: Planning artifacts or code appear in the wrong directory.

Cause: The LLM wrote to the main repo instead of the worktree.

Fix: This was fixed in v2.14+. If you're on an older version, update. The dispatch prompt now includes explicit working directory instructions.

command not found: sf after install

Symptoms: npm install -g singularity-forge succeeds but sf isn't found.

Cause: npm's global bin directory isn't in your shell's $PATH.

Fix:

# Find where npm installed the binary
npm prefix -g

# Add the bin directory to your PATH if missing
echo 'export PATH="$(npm prefix -g)/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Workaround: Run npx singularity-forge or $(npm prefix -g)/bin/next directly.

Common causes:

  • Version manager (nvm, fnm, mise) — global bin is version-specific; ensure your version manager initializes in your shell config
  • oh-my-zsh — the gitfast plugin aliases sf to git svn dcommit. Check with alias sf and unalias if needed

npm install -g singularity-forge fails

Common causes:

  • Missing workspace packages — fixed in v2.10.4+
  • postinstall hangs on Linux (Playwright --with-deps triggering sudo) — fixed in v2.3.6+
  • Node.js version too old — requires ≥ 24.0.0

Provider errors during autonomous mode

Symptoms: Autonomous mode pauses with a provider error (rate limit, server error, auth failure).

How SF handles it (v2.26):

Error type Auto-resume? Delay
Rate limit (429, "too many requests") Yes retry-after header or 60s
Server error (500, 502, 503, "overloaded") Yes 30s
Auth/billing ("unauthorized", "invalid key") No Manual resume

For transient errors, SF pauses briefly and resumes automatically. For permanent errors, configure fallback models:

models:
  execution:
    model: claude-sonnet-4-6
    fallbacks:
      - openrouter/minimax/minimax-m2.5

Machine surface: sf headless autonomous restarts the process automatically on crash (default 3 attempts with exponential backoff). Combined with provider error recovery, this enables true overnight unattended execution.

For common provider setup issues (role errors, streaming errors, model ID mismatches), see the Provider Setup Guide — Common Pitfalls.

Budget ceiling reached

Symptoms: Autonomous mode pauses with "Budget ceiling reached."

Fix: Increase budget_ceiling in preferences, or switch to budget token profile to reduce per-unit cost, then resume with /autonomous.

Stale lock file

Symptoms: Autonomous mode won't start, says another session is running.

Fix: SF automatically detects stale locks — if the owning PID is dead, the lock is cleaned up and re-acquired on the next /autonomous. This includes stranded .sf.lock/ directories left by proper-lockfile after crashes. If automatic recovery fails, delete .sf/auto.lock and the .sf.lock/ directory manually:

rm -f .sf/auto.lock
rm -rf "$(dirname .sf)/.sf.lock"

Git merge conflicts

Symptoms: Worktree merge fails on .sf/ files.

Fix: SF auto-resolves conflicts on .sf/ runtime files. For content conflicts in code files, the LLM is given an opportunity to resolve them via a fix-merge session. If that fails, manual resolution is needed.

Pre-dispatch says the milestone integration branch no longer exists

Symptoms: Autonomous mode or /doctor reports that a milestone recorded an integration branch that no longer exists in git.

What it means: The milestone's .sf/milestones/<MID>/<MID>-META.json still points at the branch that was active when the milestone started, but that branch has since been renamed or deleted.

Current behavior:

  • If SF can deterministically recover to a safe branch, it no longer hard-stops autonomous mode.
  • Safe fallbacks are:
    • explicit git.main_branch when configured and present
    • the repo's detected default integration branch (for example main or master)
  • In that case /doctor reports a warning and /doctor fix rewrites the stale metadata to the effective branch.
  • SF still blocks when no safe fallback branch can be determined.

Fix:

  • Run /doctor fix to rewrite the stale milestone metadata automatically when the fallback is obvious.
  • If SF still blocks, recreate the missing branch or update your git preferences so git.main_branch points at a real branch.

Transient EBUSY / EPERM / EACCES while writing .sf/ files

Symptoms: Autonomous mode or doctor occasionally fails while updating .sf/ files with errors like EBUSY, EPERM, or EACCES.

Cause: Antivirus, indexers, editors, or filesystem watchers can briefly lock the destination or temp file just as SF performs the atomic rename.

Current behavior: SF now retries those transient rename failures with a short bounded backoff before surfacing an error. The retry is intentionally limited so genuine filesystem problems still fail loudly instead of hanging forever.

Fix:

  • Re-run the operation; most transient lock races clear quickly.
  • If the error persists, close tools that may be holding the file open and then retry.
  • If repeated failures continue, run /doctor to confirm the repo state is still healthy and report the exact path + error code.

Node v26 web boot failure

Symptoms: sf --web fails with ERR_UNSUPPORTED_NODE_MODULES_TYPE_STRIPPING on Node v26.

Cause: Node v26 changed type-stripping behavior for node_modules, breaking the Next.js web build.

Fix: Fixed in v2.42.0+ (#1864). Upgrade to the latest version.

Orphan web server process

Symptoms: sf --web fails because port 3000 is already in use, even though no SF session is running.

Cause: A previous web server process was not cleaned up on exit.

Fix: Fixed in v2.42.0+. SF now cleans up stale web server processes automatically. If you're on an older version, kill the orphan process manually: lsof -ti:3000 | xargs kill.

Non-JS project blocked by worktree health check

Symptoms: Worktree health check fails or blocks autonomous mode in projects that don't use Node.js (e.g., Rust, Go, Python).

Cause: The worktree health check only recognized JavaScript ecosystems prior to v2.42.0.

Fix: Fixed in v2.42.0+ (#1860). The health check now supports 17+ ecosystems. Upgrade to the latest version.

German/non-English locale git errors

Symptoms: Git commands fail or produce unexpected results when the system locale is non-English (e.g., German).

Cause: SF parsed git output assuming English locale strings.

Fix: Fixed in v2.42.0+. All git commands now force LC_ALL=C to ensure consistent English output regardless of system locale.

MCP Client Issues

mcp_servers shows no configured servers

Symptoms: mcp_servers reports no servers configured.

Common causes:

  • No .mcp.json or .sf/mcp.json file exists in the current project
  • The config file is malformed JSON
  • The server is configured in a different project directory than the one where you launched SF

Fix:

  • Add the server to .mcp.json or .sf/mcp.json
  • Verify the file parses as JSON
  • Re-run mcp_servers(refresh=true)

mcp_discover times out

Symptoms: mcp_discover fails with a timeout.

Common causes:

  • The server process starts but never completes the MCP handshake
  • The configured command points to a script that hangs on startup
  • The server is waiting on an unavailable dependency or backend service

Fix:

  • Run the configured command directly outside SF and confirm the server actually starts
  • Check that any backend URLs or required services are reachable
  • For local custom servers, verify the implementation is using an MCP SDK or a correct stdio protocol implementation

mcp_discover reports connection closed

Symptoms: mcp_discover fails immediately with a connection-closed error.

Common causes:

  • Wrong executable path
  • Wrong script path
  • Missing runtime dependency
  • The server crashes before responding

Fix:

  • Verify command and args paths are correct and absolute
  • Run the command manually to catch import/runtime errors
  • Check that the configured interpreter or runtime exists on the machine

mcp_call fails because required arguments are missing

Symptoms: A discovered MCP tool exists, but calling it fails validation because required fields are missing.

Common causes:

  • The call shape is wrong
  • The target server's tool schema changed
  • You're calling a stale server definition or stale branch build

Fix:

  • Re-run mcp_discover(server="name") and confirm the exact required argument names
  • Call the tool with mcp_call(server="name", tool="tool_name", args={...})
  • If you're developing SF itself, rebuild after schema changes with npm run build

Local stdio server works manually but not in SF

Symptoms: Running the server command manually seems fine, but SF can't connect.

Common causes:

  • The server depends on shell state that SF doesn't inherit
  • Relative paths only work from a different working directory
  • Required environment variables exist in your shell but not in the MCP config

Fix:

  • Use absolute paths for command and script arguments
  • Set required environment variables in the MCP config's env block
  • If needed, set cwd explicitly in the server definition

Session lock stolen by /next in another terminal

Symptoms: Running /next (assisted mode) in a second terminal causes a running autonomous mode session to lose its lock.

Fix: Fixed in v2.36.0. Bare /next no longer steals the session lock from a running autonomous mode session. Upgrade to the latest version.

Worktree commits landing on main instead of milestone branch

Symptoms: Autonomous mode commits in a worktree end up on main instead of the milestone/<MID> branch.

Fix: Fixed in v2.37.1. CWD is now realigned before dispatch and stale merge state is cleaned on failure. Upgrade to the latest version.

Extension loader fails with subpath export error

Symptoms: Extension fails to load with a Cannot find module error referencing npm subpath exports.

Cause: Dynamic imports in the extension loader didn't resolve npm subpath exports (e.g., @pkg/foo/bar).

Fix: Fixed in v2.38+. The extension loader now auto-resolves npm subpath exports and creates a node_modules symlink for dynamic import resolution. Upgrade to the latest version.

Recovery Procedures

Reset autonomous mode state

rm .sf/auto.lock
rm .sf/completed-units.json

Then /autonomous to restart from current disk state.

Reset routing history

If adaptive model routing is producing bad results, clear the routing history:

rm .sf/routing-history.json

Full state rebuild

/doctor

Doctor derives current state from the DB-backed runtime model when available, regenerates projections such as STATE.md, and fixes detected inconsistencies. File-based plan and roadmap parsing is only a recovery path for unmigrated or damaged state.

Getting Help

  • GitHub Issues: github.com/singularity-ng/singularity-forge/issues
  • Dashboard: Ctrl+Alt+G or /status for real-time diagnostics
  • Forensics: /forensics for structured post-mortem analysis of autonomous mode failures
  • Session logs: .sf/activity/ contains JSONL session dumps for crash forensics

Database Issues

"SF database is not available"

Symptoms: sf_decision_save, sf_requirement_update, or sf_summary_save fail with this error.

Cause: The SQLite database wasn't initialized. This happens in manual /next sessions (non-autonomous mode) on versions before v2.29.

Fix: Updated in v2.29+ to auto-initialize the database on first tool call. Upgrade to the latest version.

Verification Issues

Verification gate fails with shell syntax error

Symptoms: stderr: /bin/sh: 1: Syntax error: "(" unexpected during verification checks.

Cause: A description-like string (e.g., All 10 checks pass (build, lint)) was treated as a shell command. This can happen when task plans have verify: fields with prose instead of actual commands.

Fix: Updated in v2.29+ to filter preference commands through isLikelyCommand(). Ensure verification_commands in preferences contains only valid shell commands, not descriptions.

LSP (Language Server Protocol)

"LSP isn't available in this workspace"

SF auto-detects language servers based on project files (e.g. package.json → TypeScript, Cargo.toml → Rust, go.mod → Go). If no servers are detected, the agent skips LSP features.

Check status:

lsp status

This shows which servers are active and, if none are found, diagnoses why — including which project markers were detected but which server commands are missing.

Common fixes:

Project type Install command
TypeScript/JavaScript npm install -g typescript-language-server typescript
Python pip install pyright or pip install python-lsp-server
Rust rustup component add rust-analyzer
Go go install golang.org/x/tools/gopls@latest

After installing, run lsp reload to restart detection without restarting SF.

Verify: After applying either fix, test with:

terminal-notifier -title "SF" -message "working!" -sound Glass