14 KiB
Troubleshooting
/sf doctor
The built-in diagnostic tool validates .sf/ integrity:
/sf doctor
It checks:
- File structure and naming conventions
- Roadmap ↔ slice ↔ task referential integrity
- Completion state consistency
- Git worktree health (worktree and branch modes only — skipped in none mode)
- Stale lock files and orphaned runtime records
Common Issues
Auto mode loops on the same unit
Symptoms: The same unit (e.g., research-slice or plan-slice) dispatches repeatedly until hitting the dispatch limit.
Causes:
- Stale cache after a crash — the in-memory file listing doesn't reflect new artifacts
- The LLM didn't produce the expected artifact file
Fix: Run /sf doctor to repair state, then resume with /sf autonomous. If the issue persists, check that the expected artifact file exists on disk.
Auto mode stops with "Loop detected"
Cause: A unit failed to produce its expected artifact twice in a row.
Fix: Check the task plan for clarity. If the plan is ambiguous, refine it manually, then /sf autonomous to resume.
Wrong files in worktree
Symptoms: Planning artifacts or code appear in the wrong directory.
Cause: The LLM wrote to the main repo instead of the worktree.
Fix: This was fixed in v2.14+. If you're on an older version, update. The dispatch prompt now includes explicit working directory instructions.
command not found: sf after install
Symptoms: npm install -g singularity-forge succeeds but sf isn't found.
Cause: npm's global bin directory isn't in your shell's $PATH.
Fix:
# Find where npm installed the binary
npm prefix -g
# Add the bin directory to your PATH if missing
echo 'export PATH="$(npm prefix -g)/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
Workaround: Run npx singularity-forge or $(npm prefix -g)/bin/sf directly.
Common causes:
- Version manager (nvm, fnm, mise) — global bin is version-specific; ensure your version manager initializes in your shell config
- oh-my-zsh — the
gitfastplugin aliasessftogit svn dcommit. Check withalias sfand unalias if needed
npm install -g singularity-forge fails
Common causes:
- Missing workspace packages — fixed in v2.10.4+
postinstallhangs on Linux (Playwright--with-depstriggering sudo) — fixed in v2.3.6+- Node.js version too old — requires ≥ 24.0.0
Provider errors during autonomous mode
Symptoms: Auto mode pauses with a provider error (rate limit, server error, auth failure).
How SF handles it (v2.26):
| Error type | Auto-resume? | Delay |
|---|---|---|
| Rate limit (429, "too many requests") | ✅ Yes | retry-after header or 60s |
| Server error (500, 502, 503, "overloaded") | ✅ Yes | 30s |
| Auth/billing ("unauthorized", "invalid key") | ❌ No | Manual resume |
For transient errors, SF pauses briefly and resumes automatically. For permanent errors, configure fallback models:
models:
execution:
model: claude-sonnet-4-6
fallbacks:
- openrouter/minimax/minimax-m2.5
Headless mode: sf headless autonomous auto-restarts the entire process on crash (default 3 attempts with exponential backoff). Combined with provider error auto-resume, this enables true overnight unattended execution.
For common provider setup issues (role errors, streaming errors, model ID mismatches), see the Provider Setup Guide — Common Pitfalls.
Budget ceiling reached
Symptoms: Auto mode pauses with "Budget ceiling reached."
Fix: Increase budget_ceiling in preferences, or switch to budget token profile to reduce per-unit cost, then resume with /sf autonomous.
Stale lock file
Symptoms: Auto mode won't start, says another session is running.
Fix: SF automatically detects stale locks — if the owning PID is dead, the lock is cleaned up and re-acquired on the next /sf autonomous. This includes stranded .sf.lock/ directories left by proper-lockfile after crashes. If automatic recovery fails, delete .sf/auto.lock and the .sf.lock/ directory manually:
rm -f .sf/auto.lock
rm -rf "$(dirname .sf)/.sf.lock"
Git merge conflicts
Symptoms: Worktree merge fails on .sf/ files.
Fix: SF auto-resolves conflicts on .sf/ runtime files. For content conflicts in code files, the LLM is given an opportunity to resolve them via a fix-merge session. If that fails, manual resolution is needed.
Pre-dispatch says the milestone integration branch no longer exists
Symptoms: Auto mode or /sf doctor reports that a milestone recorded an integration branch that no longer exists in git.
What it means: The milestone's .sf/milestones/<MID>/<MID>-META.json still points at the branch that was active when the milestone started, but that branch has since been renamed or deleted.
Current behavior:
- If SF can deterministically recover to a safe branch, it no longer hard-stops autonomous mode.
- Safe fallbacks are:
- explicit
git.main_branchwhen configured and present - the repo's detected default integration branch (for example
mainormaster)
- explicit
- In that case
/sf doctorreports a warning and/sf doctor fixrewrites the stale metadata to the effective branch. - SF still blocks when no safe fallback branch can be determined.
Fix:
- Run
/sf doctor fixto rewrite the stale milestone metadata automatically when the fallback is obvious. - If SF still blocks, recreate the missing branch or update your git preferences so
git.main_branchpoints at a real branch.
Transient EBUSY / EPERM / EACCES while writing .sf/ files
Symptoms: Autonomous mode or doctor occasionally fails while updating .sf/ files with errors like EBUSY, EPERM, or EACCES.
Cause: Antivirus, indexers, editors, or filesystem watchers can briefly lock the destination or temp file just as SF performs the atomic rename.
Current behavior: SF now retries those transient rename failures with a short bounded backoff before surfacing an error. The retry is intentionally limited so genuine filesystem problems still fail loudly instead of hanging forever.
Fix:
- Re-run the operation; most transient lock races clear quickly.
- If the error persists, close tools that may be holding the file open and then retry.
- If repeated failures continue, run
/sf doctorto confirm the repo state is still healthy and report the exact path + error code.
Node v24 web boot failure
Symptoms: sf --web fails with ERR_UNSUPPORTED_NODE_MODULES_TYPE_STRIPPING on Node v24.
Cause: Node v24 changed type-stripping behavior for node_modules, breaking the Next.js web build.
Fix: Fixed in v2.42.0+ (#1864). Upgrade to the latest version.
Orphan web server process
Symptoms: sf --web fails because port 3000 is already in use, even though no SF session is running.
Cause: A previous web server process was not cleaned up on exit.
Fix: Fixed in v2.42.0+. SF now cleans up stale web server processes automatically. If you're on an older version, kill the orphan process manually: lsof -ti:3000 | xargs kill.
Non-JS project blocked by worktree health check
Symptoms: Worktree health check fails or blocks autonomous mode in projects that don't use Node.js (e.g., Rust, Go, Python).
Cause: The worktree health check only recognized JavaScript ecosystems prior to v2.42.0.
Fix: Fixed in v2.42.0+ (#1860). The health check now supports 17+ ecosystems. Upgrade to the latest version.
German/non-English locale git errors
Symptoms: Git commands fail or produce unexpected results when the system locale is non-English (e.g., German).
Cause: SF parsed git output assuming English locale strings.
Fix: Fixed in v2.42.0+. All git commands now force LC_ALL=C to ensure consistent English output regardless of system locale.
MCP Client Issues
mcp_servers shows no configured servers
Symptoms: mcp_servers reports no servers configured.
Common causes:
- No
.mcp.jsonor.sf/mcp.jsonfile exists in the current project - The config file is malformed JSON
- The server is configured in a different project directory than the one where you launched SF
Fix:
- Add the server to
.mcp.jsonor.sf/mcp.json - Verify the file parses as JSON
- Re-run
mcp_servers(refresh=true)
mcp_discover times out
Symptoms: mcp_discover fails with a timeout.
Common causes:
- The server process starts but never completes the MCP handshake
- The configured command points to a script that hangs on startup
- The server is waiting on an unavailable dependency or backend service
Fix:
- Run the configured command directly outside SF and confirm the server actually starts
- Check that any backend URLs or required services are reachable
- For local custom servers, verify the implementation is using an MCP SDK or a correct stdio protocol implementation
mcp_discover reports connection closed
Symptoms: mcp_discover fails immediately with a connection-closed error.
Common causes:
- Wrong executable path
- Wrong script path
- Missing runtime dependency
- The server crashes before responding
Fix:
- Verify
commandandargspaths are correct and absolute - Run the command manually to catch import/runtime errors
- Check that the configured interpreter or runtime exists on the machine
mcp_call fails because required arguments are missing
Symptoms: A discovered MCP tool exists, but calling it fails validation because required fields are missing.
Common causes:
- The call shape is wrong
- The target server's tool schema changed
- You're calling a stale server definition or stale branch build
Fix:
- Re-run
mcp_discover(server="name")and confirm the exact required argument names - Call the tool with
mcp_call(server="name", tool="tool_name", args={...}) - If you're developing SF itself, rebuild after schema changes with
npm run build
Local stdio server works manually but not in SF
Symptoms: Running the server command manually seems fine, but SF can't connect.
Common causes:
- The server depends on shell state that SF doesn't inherit
- Relative paths only work from a different working directory
- Required environment variables exist in your shell but not in the MCP config
Fix:
- Use absolute paths for
commandand script arguments - Set required environment variables in the MCP config's
envblock - If needed, set
cwdexplicitly in the server definition
Session lock stolen by /sf in another terminal
Symptoms: Running /sf (step mode) in a second terminal causes a running autonomous mode session to lose its lock.
Fix: Fixed in v2.36.0. Bare /sf no longer steals the session lock from a running autonomous mode session. Upgrade to the latest version.
Worktree commits landing on main instead of milestone branch
Symptoms: Autonomous mode commits in a worktree end up on main instead of the milestone/<MID> branch.
Fix: Fixed in v2.37.1. CWD is now realigned before dispatch and stale merge state is cleaned on failure. Upgrade to the latest version.
Extension loader fails with subpath export error
Symptoms: Extension fails to load with a Cannot find module error referencing npm subpath exports.
Cause: Dynamic imports in the extension loader didn't resolve npm subpath exports (e.g., @pkg/foo/bar).
Fix: Fixed in v2.38+. The extension loader now auto-resolves npm subpath exports and creates a node_modules symlink for dynamic import resolution. Upgrade to the latest version.
Recovery Procedures
Reset autonomous mode state
rm .sf/auto.lock
rm .sf/completed-units.json
Then /sf autonomous to restart from current disk state.
Reset routing history
If adaptive model routing is producing bad results, clear the routing history:
rm .sf/routing-history.json
Full state rebuild
/sf doctor
Doctor rebuilds STATE.md from plan and roadmap files on disk and fixes detected inconsistencies.
Getting Help
- GitHub Issues: github.com/singularity-ng/singularity-forge/issues
- Dashboard:
Ctrl+Alt+Gor/sf statusfor real-time diagnostics - Forensics:
/sf forensicsfor structured post-mortem analysis of autonomous mode failures - Session logs:
.sf/activity/contains JSONL session dumps for crash forensics
Database Issues
"SF database is not available"
Symptoms: sf_decision_save, sf_requirement_update, or sf_summary_save fail with this error.
Cause: The SQLite database wasn't initialized. This happens in manual /sf sessions (non-autonomous mode) on versions before v2.29.
Fix: Updated in v2.29+ to auto-initialize the database on first tool call. Upgrade to the latest version.
Verification Issues
Verification gate fails with shell syntax error
Symptoms: stderr: /bin/sh: 1: Syntax error: "(" unexpected during verification checks.
Cause: A description-like string (e.g., All 10 checks pass (build, lint)) was treated as a shell command. This can happen when task plans have verify: fields with prose instead of actual commands.
Fix: Updated in v2.29+ to filter preference commands through isLikelyCommand(). Ensure verification_commands in preferences contains only valid shell commands, not descriptions.
LSP (Language Server Protocol)
"LSP isn't available in this workspace"
SF auto-detects language servers based on project files (e.g. package.json → TypeScript, Cargo.toml → Rust, go.mod → Go). If no servers are detected, the agent skips LSP features.
Check status:
lsp status
This shows which servers are active and, if none are found, diagnoses why — including which project markers were detected but which server commands are missing.
Common fixes:
| Project type | Install command |
|---|---|
| TypeScript/JavaScript | npm install -g typescript-language-server typescript |
| Python | pip install pyright or pip install python-lsp-server |
| Rust | rustup component add rust-analyzer |
| Go | go install golang.org/x/tools/gopls@latest |
After installing, run lsp reload to restart detection without restarting SF.
Verify: After applying either fix, test with:
terminal-notifier -title "SF" -message "working!" -sound Glass