6.7 KiB
6.7 KiB
ADR-013: Network and remote-execution layer
Date: 2026-04-29 Status: proposed (deferred — capture for staged execution)
Context
sf today runs as a single daemon per host. Three forces push it toward a multi-host topology:
- SSH workers: the orchestrator dispatches unit attempts to remote hosts (GPU, Windows, parallel scaling) — needs an SSH-served worker process.
- Singularity Memory remote-mode (ADR-012, ADR-014): the cross-instance knowledge layer runs as a service on the tailnet, reachable from SF and other clients.
- Multi-instance federation (ADR-012): future federated agents and benchmarks ride the same network substrate.
This ADR fixes the network and SSH-execution layer the above all depend on.
Decision
- Network substrate: tailnet — Tailscale wire protocol with Headscale as the self-hosted control plane (the user already runs Headscale at
mikki-bunker). sf core is wire-agnostic; it assumes addressable, authenticated peers. - SSH worker host stack: Go +
charmbracelet/wish+charmbracelet/x/xpty(Linux/macOS) andcharmbracelet/x/conpty(Windows). One thin Go shim per worker host; orchestrator (TS) talks SSH stdio to it. - Worker observability:
charmbracelet/promwish— Prometheus middleware mounted on Wish gives/metricsfor free. - Worker identity:
charmbracelet/x/sshkey+charmbracelet/melt— auto-provisioning + Ed25519-with-seed-words backup.
Alternatives Considered
Network substrate
- Public internet + sshd + manual key management — works, but key sprawl is a real problem (each new host adds N×M keys), and dynamic IPs break stable hostnames. Tailnet's MagicDNS + ACLs replace both. Rejected.
- Plain WireGuard mesh — no control plane; manual peer config. Higher ops overhead than Headscale. Rejected.
- Tailscale-the-service — fine, but Headscale is already running and self-hosted means full ownership. Rejected.
- ZeroTier / Netbird — viable alternatives. Rejected because the user already has Headscale and switching costs nothing-to-gain.
SSH worker stack
- Node-based SSH server (
ssh2lib) — keeps everything TS but reinvents what Wish gives for free; no battle-tested middleware patterns. Rejected. - OpenSSH
sshdwithForceCommand— works for simple cases, terrible for multiplexed agent dispatch with per-connection state. Rejected. - Plain Go
crypto/ssh— lower-level than Wish, no middleware, no built-in metrics. Rejected — Wish wraps the right primitives.
Consequences
Positive
- sf's network model is explicit: tailnet first, ACLs in Headscale's admin, no per-service auth invention.
- SSH worker host inherits Wish's mature middleware (
wish/logging,wish/elapsed, etc.) andpromwishobservability. - Cross-platform pty support (
xptyLinux/macOS,conptyWindows) lets workers spawn real ttys for the agent — load-bearing for Windows-only test runs onmikki-bunker-windows. - Stable hostnames via Headscale's MagicDNS —
mikki-bunker.tailnet.ts.hugo.dkresolves regardless of network change. - Identity story is clean: each worker host has its own Ed25519 keypair (
sshkey), backed up viameltseed words.
Negative
- Tailnet dependency: when Headscale is down, new connections can't auth (existing connections survive). Mitigation: Headscale on a stable host with monitoring.
- Polyglot deployment: TS orchestrator + Go worker. One clean SSH-stdio boundary, but two languages to keep in CI. Acceptable per ADR-016 (parallel build).
- ACL drift: if Headscale ACLs forbid a worker host, sf degrades silently. Doctor-check should detect and surface explicitly (see "implementation" below).
Risks and mitigations
- Risk: SSH disconnect mid-turn produces zombie agent processes.
- Mitigation: worker cleanup script on disconnect;
--sf-run-id=<id>marker on the agent process forpgrep/kill.
- Mitigation: worker cleanup script on disconnect;
- Risk:
wishAPI churn pre-1.0.- Mitigation: pin a version; planned upgrade window once per quarter.
- Risk:
xpty/conptyedge cases on niche shells.- Mitigation: worker has a flag to fall back to non-pty stdio; logged loudly.
Out of Scope
- Multi-tenant network isolation (one tailnet, multiple users with separate ACL domains) — defer until concrete need.
- Public-internet exposure — sf is tailnet-only by deployment recommendation. If a use case needs a public endpoint, it goes through
tailscale funnelor a dedicated reverse proxy outside sf. - Cross-tailnet federation — out of scope; one tailnet per deployment.
Sequencing
| When | Action |
|---|---|
| Now | Capture this ADR as the deployment assumption. |
| Tier 1 (next 1–3 months) | Build sf-worker (Go + Wish + xpty/conpty + promwish) as a separate package or repo. Orchestrator-side dispatch path in TS already plans for worker_host per SPEC §22 — just point it at the SSH endpoint. |
| Tier 2 | Doctor check: validate tailnet ACL allows the orchestrator → all configured worker hosts. Surface failures in sf doctor. |
| Tier 3 | Worker auto-provisioning script: sf worker bootstrap <host> generates a key, registers with Headscale, drops the worker binary. |
Implementation Sketch
[sf orchestrator (TS)] on the daemon host
│
│ ssh user@worker.tailnet.ts.hugo.dk -- carries sf-rpc envelope
│
▼
[sf-worker (Go)] on each worker tailnet node
├── wish.Server with logging + elapsed + promwish middleware
├── per-connection handler spawns the agent via xpty/conpty
├── /metrics via promwish — scraped by your Prometheus
└── /healthz, /readyz simple HTTP for orchestrator health checks
The worker is stateless — claim, lease, retry, persistence are all the orchestrator's job. Older SPEC notes captured this as distributed-execution evidence; current implementation must persist accepted requirements through .sf/DB-backed state. The worker just executes one attempt at a time and streams output.
References
- Older distributed-execution SPEC notes — external design evidence only; project accepted facts into
.sf/DB-backed state before treating them as operational. ADR-012— Multi-instance federation (this ADR provides the substrate).ADR-014— Singularity Knowledge + Agent Platform (deploys onto this substrate).ADR-016— Charm AI stack adoption strategy (frames why Go for new services).charmbracelet/wish— SSH server framework.charmbracelet/x/xpty,charmbracelet/x/conpty— pty primitives.charmbracelet/promwish— Prometheus middleware for Wish.- Headscale — open-source Tailscale control plane.