# Troubleshooting ## `/doctor` The built-in diagnostic tool validates `.sf/` integrity: ``` /doctor ``` It checks: - File structure and naming conventions - Roadmap ↔ slice ↔ task referential integrity - Completion state consistency - Git worktree health (worktree and branch modes only — skipped in none mode) - Stale lock files and orphaned runtime records ## Common Issues ### Autonomous mode loops on the same unit **Symptoms:** The same unit (e.g., `research-slice` or `plan-slice`) dispatches repeatedly until hitting the dispatch limit. **Causes:** - Stale cache after a crash — the in-memory file listing doesn't reflect new artifacts - The LLM didn't produce the expected artifact file **Fix:** Run `/doctor` to repair state, then resume with `/autonomous`. If the issue persists, check that the expected artifact file exists on disk. ### Autonomous mode stops with "Loop detected" **Cause:** A unit failed to produce its expected artifact twice in a row. **Fix:** Check the task plan for clarity. If the plan is ambiguous, refine it manually, then `/autonomous` to resume. ### Wrong files in worktree **Symptoms:** Planning artifacts or code appear in the wrong directory. **Cause:** The LLM wrote to the main repo instead of the worktree. **Fix:** This was fixed in v2.14+. If you're on an older version, update. The dispatch prompt now includes explicit working directory instructions. ### `command not found: sf` after install **Symptoms:** `npm install -g singularity-forge` succeeds but `sf` isn't found. **Cause:** npm's global bin directory isn't in your shell's `$PATH`. **Fix:** ```bash # Find where npm installed the binary npm prefix -g # Add the bin directory to your PATH if missing echo 'export PATH="$(npm prefix -g)/bin:$PATH"' >> ~/.bashrc source ~/.bashrc ``` **Workaround:** Run `npx singularity-forge` or `$(npm prefix -g)/bin/next` directly. **Common causes:** - **Version manager (nvm, fnm, mise)** — global bin is version-specific; ensure your version manager initializes in your shell config - **oh-my-zsh** — the `gitfast` plugin aliases `sf` to `git svn dcommit`. Check with `alias sf` and unalias if needed ### `npm install -g singularity-forge` fails **Common causes:** - Missing workspace packages — fixed in v2.10.4+ - `postinstall` hangs on Linux (Playwright `--with-deps` triggering sudo) — fixed in v2.3.6+ - Node.js version too old — requires ≥ 24.0.0 ### Provider errors during autonomous mode **Symptoms:** Autonomous mode pauses with a provider error (rate limit, server error, auth failure). **How SF handles it (v2.26):** | Error type | Auto-resume? | Delay | |-----------|-------------|-------| | Rate limit (429, "too many requests") | ✅ Yes | retry-after header or 60s | | Server error (500, 502, 503, "overloaded") | ✅ Yes | 30s | | Auth/billing ("unauthorized", "invalid key") | ❌ No | Manual resume | For transient errors, SF pauses briefly and resumes automatically. For permanent errors, configure fallback models: ```yaml models: execution: model: claude-sonnet-4-6 fallbacks: - openrouter/minimax/minimax-m2.5 ``` **Machine surface:** `sf headless autonomous` restarts the process automatically on crash (default 3 attempts with exponential backoff). Combined with provider error recovery, this enables true overnight unattended execution. For common provider setup issues (role errors, streaming errors, model ID mismatches), see the [Provider Setup Guide — Common Pitfalls](./providers.md#common-pitfalls). ### Budget ceiling reached **Symptoms:** Autonomous mode pauses with "Budget ceiling reached." **Fix:** Increase `budget_ceiling` in preferences, or switch to `budget` token profile to reduce per-unit cost, then resume with `/autonomous`. ### Stale lock file **Symptoms:** Autonomous mode won't start, says another session is running. **Fix:** SF automatically detects stale locks — if the owning PID is dead, the lock is cleaned up and re-acquired on the next `/autonomous`. This includes stranded `.sf.lock/` directories left by `proper-lockfile` after crashes. If automatic recovery fails, delete `.sf/auto.lock` and the `.sf.lock/` directory manually: ```bash rm -f .sf/auto.lock rm -rf "$(dirname .sf)/.sf.lock" ``` ### Git merge conflicts **Symptoms:** Worktree merge fails on `.sf/` files. **Fix:** SF auto-resolves conflicts on `.sf/` runtime files. For content conflicts in code files, the LLM is given an opportunity to resolve them via a fix-merge session. If that fails, manual resolution is needed. ### Pre-dispatch says the milestone integration branch no longer exists **Symptoms:** Autonomous mode or `/doctor` reports that a milestone recorded an integration branch that no longer exists in git. **What it means:** The milestone's `.sf/milestones//-META.json` still points at the branch that was active when the milestone started, but that branch has since been renamed or deleted. **Current behavior:** - If SF can deterministically recover to a safe branch, it no longer hard-stops autonomous mode. - Safe fallbacks are: - explicit `git.main_branch` when configured and present - the repo's detected default integration branch (for example `main` or `master`) - In that case `/doctor` reports a warning and `/doctor fix` rewrites the stale metadata to the effective branch. - SF still blocks when no safe fallback branch can be determined. **Fix:** - Run `/doctor fix` to rewrite the stale milestone metadata automatically when the fallback is obvious. - If SF still blocks, recreate the missing branch or update your git preferences so `git.main_branch` points at a real branch. ### Transient `EBUSY` / `EPERM` / `EACCES` while writing `.sf/` files **Symptoms:** Autonomous mode or doctor occasionally fails while updating `.sf/` files with errors like `EBUSY`, `EPERM`, or `EACCES`. **Cause:** Antivirus, indexers, editors, or filesystem watchers can briefly lock the destination or temp file just as SF performs the atomic rename. **Current behavior:** SF now retries those transient rename failures with a short bounded backoff before surfacing an error. The retry is intentionally limited so genuine filesystem problems still fail loudly instead of hanging forever. **Fix:** - Re-run the operation; most transient lock races clear quickly. - If the error persists, close tools that may be holding the file open and then retry. - If repeated failures continue, run `/doctor` to confirm the repo state is still healthy and report the exact path + error code. ### Node v26 web boot failure **Symptoms:** `sf --web` fails with `ERR_UNSUPPORTED_NODE_MODULES_TYPE_STRIPPING` on Node v26. **Cause:** Node v26 changed type-stripping behavior for `node_modules`, breaking the Next.js web build. **Fix:** Fixed in v2.42.0+ (#1864). Upgrade to the latest version. ### Orphan web server process **Symptoms:** `sf --web` fails because port 3000 is already in use, even though no SF session is running. **Cause:** A previous web server process was not cleaned up on exit. **Fix:** Fixed in v2.42.0+. SF now cleans up stale web server processes automatically. If you're on an older version, kill the orphan process manually: `lsof -ti:3000 | xargs kill`. ### Non-JS project blocked by worktree health check **Symptoms:** Worktree health check fails or blocks autonomous mode in projects that don't use Node.js (e.g., Rust, Go, Python). **Cause:** The worktree health check only recognized JavaScript ecosystems prior to v2.42.0. **Fix:** Fixed in v2.42.0+ (#1860). The health check now supports 17+ ecosystems. Upgrade to the latest version. ### German/non-English locale git errors **Symptoms:** Git commands fail or produce unexpected results when the system locale is non-English (e.g., German). **Cause:** SF parsed git output assuming English locale strings. **Fix:** Fixed in v2.42.0+. All git commands now force `LC_ALL=C` to ensure consistent English output regardless of system locale. ## MCP Client Issues ### `mcp_servers` shows no configured servers **Symptoms:** `mcp_servers` reports no servers configured. **Common causes:** - No `.mcp.json` or `.sf/mcp.json` file exists in the current project - The config file is malformed JSON - The server is configured in a different project directory than the one where you launched SF **Fix:** - Add the server to `.mcp.json` or `.sf/mcp.json` - Verify the file parses as JSON - Re-run `mcp_servers(refresh=true)` ### `mcp_discover` times out **Symptoms:** `mcp_discover` fails with a timeout. **Common causes:** - The server process starts but never completes the MCP handshake - The configured command points to a script that hangs on startup - The server is waiting on an unavailable dependency or backend service **Fix:** - Run the configured command directly outside SF and confirm the server actually starts - Check that any backend URLs or required services are reachable - For local custom servers, verify the implementation is using an MCP SDK or a correct stdio protocol implementation ### `mcp_discover` reports connection closed **Symptoms:** `mcp_discover` fails immediately with a connection-closed error. **Common causes:** - Wrong executable path - Wrong script path - Missing runtime dependency - The server crashes before responding **Fix:** - Verify `command` and `args` paths are correct and absolute - Run the command manually to catch import/runtime errors - Check that the configured interpreter or runtime exists on the machine ### `mcp_call` fails because required arguments are missing **Symptoms:** A discovered MCP tool exists, but calling it fails validation because required fields are missing. **Common causes:** - The call shape is wrong - The target server's tool schema changed - You're calling a stale server definition or stale branch build **Fix:** - Re-run `mcp_discover(server="name")` and confirm the exact required argument names - Call the tool with `mcp_call(server="name", tool="tool_name", args={...})` - If you're developing SF itself, rebuild after schema changes with `npm run build` ### Local stdio server works manually but not in SF **Symptoms:** Running the server command manually seems fine, but SF can't connect. **Common causes:** - The server depends on shell state that SF doesn't inherit - Relative paths only work from a different working directory - Required environment variables exist in your shell but not in the MCP config **Fix:** - Use absolute paths for `command` and script arguments - Set required environment variables in the MCP config's `env` block - If needed, set `cwd` explicitly in the server definition ### Session lock stolen by `/next` in another terminal **Symptoms:** Running `/next` (assisted mode) in a second terminal causes a running autonomous mode session to lose its lock. **Fix:** Fixed in v2.36.0. Bare `/next` no longer steals the session lock from a running autonomous mode session. Upgrade to the latest version. ### Worktree commits landing on main instead of milestone branch **Symptoms:** Autonomous mode commits in a worktree end up on `main` instead of the `milestone/` branch. **Fix:** Fixed in v2.37.1. CWD is now realigned before dispatch and stale merge state is cleaned on failure. Upgrade to the latest version. ### Extension loader fails with subpath export error **Symptoms:** Extension fails to load with a `Cannot find module` error referencing npm subpath exports. **Cause:** Dynamic imports in the extension loader didn't resolve npm subpath exports (e.g., `@pkg/foo/bar`). **Fix:** Fixed in v2.38+. The extension loader now auto-resolves npm subpath exports and creates a `node_modules` symlink for dynamic import resolution. Upgrade to the latest version. ## Recovery Procedures ### Reset autonomous mode state ```bash rm .sf/auto.lock rm .sf/completed-units.json ``` Then `/autonomous` to restart from current disk state. ### Reset routing history If adaptive model routing is producing bad results, clear the routing history: ```bash rm .sf/routing-history.json ``` ### Full state rebuild ``` /doctor ``` Doctor derives current state from the DB-backed runtime model when available, regenerates projections such as `STATE.md`, and fixes detected inconsistencies. File-based plan and roadmap parsing is only a recovery path for unmigrated or damaged state. ## Getting Help - **GitHub Issues:** [github.com/singularity-ng/singularity-forge/issues](https://github.com/singularity-ng/singularity-forge/issues) - **Dashboard:** `Ctrl+Alt+G` or `/status` for real-time diagnostics - **Forensics:** `/forensics` for structured post-mortem analysis of autonomous mode failures - **Session logs:** `.sf/activity/` contains JSONL session dumps for crash forensics ## Database Issues ### "SF database is not available" **Symptoms:** `sf_decision_save`, `sf_requirement_update`, or `sf_summary_save` fail with this error. **Cause:** The SQLite database wasn't initialized. This happens in manual `/next` sessions (non-autonomous mode) on versions before v2.29. **Fix:** Updated in v2.29+ to auto-initialize the database on first tool call. Upgrade to the latest version. ## Verification Issues ### Verification gate fails with shell syntax error **Symptoms:** `stderr: /bin/sh: 1: Syntax error: "(" unexpected` during verification checks. **Cause:** A description-like string (e.g., `All 10 checks pass (build, lint)`) was treated as a shell command. This can happen when task plans have `verify:` fields with prose instead of actual commands. **Fix:** Updated in v2.29+ to filter preference commands through `isLikelyCommand()`. Ensure `verification_commands` in preferences contains only valid shell commands, not descriptions. ## LSP (Language Server Protocol) ### "LSP isn't available in this workspace" SF auto-detects language servers based on project files (e.g. `package.json` → TypeScript, `Cargo.toml` → Rust, `go.mod` → Go). If no servers are detected, the agent skips LSP features. **Check status:** ``` lsp status ``` This shows which servers are active and, if none are found, diagnoses why — including which project markers were detected but which server commands are missing. **Common fixes:** | Project type | Install command | |-------------|-----------------| | TypeScript/JavaScript | `npm install -g typescript-language-server typescript` | | Python | `pip install pyright` or `pip install python-lsp-server` | | Rust | `rustup component add rust-analyzer` | | Go | `go install golang.org/x/tools/gopls@latest` | After installing, run `lsp reload` to restart detection without restarting SF. **Verify:** After applying either fix, test with: ```bash terminal-notifier -title "SF" -message "working!" -sound Glass ```