singularity-forge/docs/dev/ADR-013-network-and-remote-execution.md

6.7 KiB
Raw Permalink Blame History

ADR-013: Network and remote-execution layer

Date: 2026-04-29 Status: proposed (deferred — capture for staged execution)

Context

sf today runs as a single daemon per host. Three forces push it toward a multi-host topology:

  • SSH workers: the orchestrator dispatches unit attempts to remote hosts (GPU, Windows, parallel scaling) — needs an SSH-served worker process.
  • Singularity Memory remote-mode (ADR-012, ADR-014): the cross-instance knowledge layer runs as a service on the tailnet, reachable from SF and other clients.
  • Multi-instance federation (ADR-012): future federated agents and benchmarks ride the same network substrate.

This ADR fixes the network and SSH-execution layer the above all depend on.

Decision

  • Network substrate: tailnet — Tailscale wire protocol with Headscale as the self-hosted control plane (the user already runs Headscale at mikki-bunker). sf core is wire-agnostic; it assumes addressable, authenticated peers.
  • SSH worker host stack: Go + charmbracelet/wish + charmbracelet/x/xpty (Linux/macOS) and charmbracelet/x/conpty (Windows). One thin Go shim per worker host; orchestrator (TS) talks SSH stdio to it.
  • Worker observability: charmbracelet/promwish — Prometheus middleware mounted on Wish gives /metrics for free.
  • Worker identity: charmbracelet/x/sshkey + charmbracelet/melt — auto-provisioning + Ed25519-with-seed-words backup.

Alternatives Considered

Network substrate

  • Public internet + sshd + manual key management — works, but key sprawl is a real problem (each new host adds N×M keys), and dynamic IPs break stable hostnames. Tailnet's MagicDNS + ACLs replace both. Rejected.
  • Plain WireGuard mesh — no control plane; manual peer config. Higher ops overhead than Headscale. Rejected.
  • Tailscale-the-service — fine, but Headscale is already running and self-hosted means full ownership. Rejected.
  • ZeroTier / Netbird — viable alternatives. Rejected because the user already has Headscale and switching costs nothing-to-gain.

SSH worker stack

  • Node-based SSH server (ssh2 lib) — keeps everything TS but reinvents what Wish gives for free; no battle-tested middleware patterns. Rejected.
  • OpenSSH sshd with ForceCommand — works for simple cases, terrible for multiplexed agent dispatch with per-connection state. Rejected.
  • Plain Go crypto/ssh — lower-level than Wish, no middleware, no built-in metrics. Rejected — Wish wraps the right primitives.

Consequences

Positive

  • sf's network model is explicit: tailnet first, ACLs in Headscale's admin, no per-service auth invention.
  • SSH worker host inherits Wish's mature middleware (wish/logging, wish/elapsed, etc.) and promwish observability.
  • Cross-platform pty support (xpty Linux/macOS, conpty Windows) lets workers spawn real ttys for the agent — load-bearing for Windows-only test runs on mikki-bunker-windows.
  • Stable hostnames via Headscale's MagicDNS — mikki-bunker.tailnet.ts.hugo.dk resolves regardless of network change.
  • Identity story is clean: each worker host has its own Ed25519 keypair (sshkey), backed up via melt seed words.

Negative

  • Tailnet dependency: when Headscale is down, new connections can't auth (existing connections survive). Mitigation: Headscale on a stable host with monitoring.
  • Polyglot deployment: TS orchestrator + Go worker. One clean SSH-stdio boundary, but two languages to keep in CI. Acceptable per ADR-016 (parallel build).
  • ACL drift: if Headscale ACLs forbid a worker host, sf degrades silently. Doctor-check should detect and surface explicitly (see "implementation" below).

Risks and mitigations

  • Risk: SSH disconnect mid-turn produces zombie agent processes.
    • Mitigation: worker cleanup script on disconnect; --sf-run-id=<id> marker on the agent process for pgrep / kill.
  • Risk: wish API churn pre-1.0.
    • Mitigation: pin a version; planned upgrade window once per quarter.
  • Risk: xpty / conpty edge cases on niche shells.
    • Mitigation: worker has a flag to fall back to non-pty stdio; logged loudly.

Out of Scope

  • Multi-tenant network isolation (one tailnet, multiple users with separate ACL domains) — defer until concrete need.
  • Public-internet exposure — sf is tailnet-only by deployment recommendation. If a use case needs a public endpoint, it goes through tailscale funnel or a dedicated reverse proxy outside sf.
  • Cross-tailnet federation — out of scope; one tailnet per deployment.

Sequencing

When Action
Now Capture this ADR as the deployment assumption.
Tier 1 (next 13 months) Build sf-worker (Go + Wish + xpty/conpty + promwish) as a separate package or repo. Orchestrator-side dispatch path in TS already plans for worker_host per SPEC §22 — just point it at the SSH endpoint.
Tier 2 Doctor check: validate tailnet ACL allows the orchestrator → all configured worker hosts. Surface failures in sf doctor.
Tier 3 Worker auto-provisioning script: sf worker bootstrap <host> generates a key, registers with Headscale, drops the worker binary.

Implementation Sketch

[sf orchestrator (TS)]                       on the daemon host
        │
        │  ssh user@worker.tailnet.ts.hugo.dk  --  carries sf-rpc envelope
        │
        ▼
[sf-worker (Go)]                             on each worker tailnet node
  ├── wish.Server                            with logging + elapsed + promwish middleware
  ├── per-connection handler                 spawns the agent via xpty/conpty
  ├── /metrics                               via promwish — scraped by your Prometheus
  └── /healthz, /readyz                      simple HTTP for orchestrator health checks

The worker is stateless — claim, lease, retry, persistence are all the orchestrator's job. Older SPEC notes captured this as distributed-execution evidence; current implementation must persist accepted requirements through .sf/DB-backed state. The worker just executes one attempt at a time and streams output.

References

  • Older distributed-execution SPEC notes — external design evidence only; project accepted facts into .sf/DB-backed state before treating them as operational.
  • ADR-012 — Multi-instance federation (this ADR provides the substrate).
  • ADR-014 — Singularity Knowledge + Agent Platform (deploys onto this substrate).
  • ADR-016 — Charm AI stack adoption strategy (frames why Go for new services).
  • charmbracelet/wish — SSH server framework.
  • charmbracelet/x/xpty, charmbracelet/x/conpty — pty primitives.
  • charmbracelet/promwish — Prometheus middleware for Wish.
  • Headscale — open-source Tailscale control plane.