singularity-forge/docs/dev/ADR-012-multi-instance-federation.md

8 KiB
Raw Permalink Blame History

ADR-012: Multi-instance federation — when sf instances interlink

Date: 2026-04-29 Status: proposed (deferred — capture for future implementation)

Context

sf today is per-project: each project has its own .sf/sf.db, and a single daemon (packages/daemon) on a host serves all projects under its scan roots. As deployment grows beyond one host (laptop, mikki-bunker, aidev), the question arises: should sf instances on different hosts (or different projects on the same host) interlink? And if so, on which surfaces?

Without thought-out federation, instances repeatedly re-learn the same lessons — anti-patterns, model outages, provider quirks — wasting tokens and duplicating mistakes. With over-eager federation, sf inherits cross-host trust, schema-version, and latency problems it doesn't need yet.

This ADR maps the federation surfaces, takes a position on each, and sequences the work.

Decision

Defer most federation. Wire Singularity Memory first as the single load-bearing federation primitive; defer federated benchmarks, cross-repo orchestration, and federated agents until the pain is concrete.

Federation Surfaces

Surface 1 — Knowledge (anti-patterns, learnings, contracts)

Status: captured in older SPEC notes as §16; treat that as external design evidence, not current operational authority. The current SF working model must project accepted federation facts into .sf/DB-backed state. Singularity Memory (sm) is the proposed cross-instance knowledge layer: an HTTP API holding memories, learnings, and anti-patterns.

Code reality: not yet wired. src/resources/extensions/sf/memory-store.ts and memory-extractor.ts write to a local SQLite memories table. The spec's "remote-mode" isn't connected.

Decision: wire it. Singularity Memory is the load-bearing federation primitive. If Mikki learns "Provider X drops requests at 03:00 UTC", that anti-pattern should be reachable from any sf instance on the tailnet without re-learning. Once wired, ~80 % of the "should they interlink?" question answers itself.

Surface 2 — Benchmarks and circuit breakers

Status: per-DB today. benchmark_results and circuit_breakers tables live in each project's .sf/sf.db. One instance trips a breaker on kimi-coding/k2p5; another instance has to independently rediscover the outage.

Decision: defer; revisit after Singularity Memory lands. Two clean options when we revisit:

  • Ride Singularity Memory — store benchmark observations as a memory category, recall as needed. Cheap; semantically clean (benchmarks ARE learning).
  • Separate thin HTTP service — purpose-built benchmark aggregator with statistical smoothing and a publish/subscribe channel for circuit-breaker events.

The pain ceiling is bounded today (per-instance discovery is at worst a few wasted dispatches). Only build when concrete cost emerges.

Surface 3 — Cross-project unit dependencies

Status: not designed. sf has no concept of "milestone in repo A produces an artefact repo B depends on". The unit hierarchy (milestone → slice → task) is project-local.

Decision: out of scope for sf. Cross-repo orchestration is a different abstraction layer — it belongs in a meta-coordinator that consumes sf's daemon/RPC or headless interfaces, not in sf itself. Building it inside sf would conflate "agent that ships one project" with "fleet manager that ships an org's roadmap." Different products.

Surface 4 — Federated persistent agents

Status: not designed. Older SPEC notes sketched persistent agents scoped to a single project's DB; those notes are evidence only until projected into current .sf/DB state.

Decision: defer. Per-instance for v3. If Mikki has a "code-reviewer" persistent agent, it lives in Mikki's DB. Federation requires:

  • Cross-host auth (who can wake whose agents).
  • Agent-state schema versioning (instances may run different sf versions).
  • Leader-election story for shared-agent updates.
  • A migration path from per-instance → federated.

None of this earns its keep until we have a concrete use case where one agent should genuinely serve multiple projects/hosts. Premature now.

Surface 5 — Distributed execution (clarifying note, not federation)

Status: captured in older SPEC notes; not built. SSH workers mean one daemon dispatches units to remote worker hosts.

Decision: clarify that this is NOT federation. Distributed execution = one daemon owns many workers (parallel scaling). Federation = many daemons share state across hosts (knowledge sharing). Different problems. The spec already separates them; this ADR just affirms the line.

Consequences

Positive (after Singularity Memory lands)

  • Knowledge sharing without re-learning — anti-patterns, gotchas, contract findings reachable across hosts and other agent products on the tailnet.
  • Lower per-instance cost — fewer wasted dispatches re-discovering provider quirks.
  • Reusable for non-sf agents — Hermes, Claude Code, Cursor can also read/write Singularity Memory, so the network effect grows beyond sf.

Negative

  • Tailnet dependency — when remote-mode Singularity Memory is configured, tailnet outage degrades sf to local-only. Mitigation: spec already allows embedded (in-process) mode; remote is opt-in.
  • Cross-instance prompt-injection surface — a malicious memory written by one instance could leak into another's recall. Mitigation: Singularity Memory MUST track provenance per memory and let consumers filter by trusted source. Capture as a sub-ADR if/when implemented.
  • Schema versioning across instances — different sf versions accessing the same memory store. Mitigation: memory schema must be append-only and additive; new fields are optional reads.

Risks and mitigations

  • Risk: Singularity Memory becomes a bottleneck — sf can't dispatch when memory is down.
    • Mitigation: sf MUST treat memory as best-effort. A memory-fetch failure logs degraded-mode and proceeds with empty recall. Local SQLite stays as the authoritative scheduler state.
  • Risk: federated benchmarks make sf overconfident in stale data.
    • Mitigation: every benchmark observation carries recorded_at and host. Consumers weight by recency and reject stale data older than circuit_breaker_resets_at + N.
  • Risk: cross-instance attacker plants poisoned anti-patterns to steer agent behaviour.
    • Mitigation: same as the prompt-injection mitigation above — provenance + trusted-source filter, plus rate-limiting per writer.

Out of Scope

  • Cross-repo unit graph — meta-coordinator territory.
  • Federated persistent-agent fleets — defer until concrete pain.
  • Multi-tenant Singularity Memory — current design assumes a single-user-or-team trust domain. Multi-tenant is a separate product.
  • Auto-sharding sf instances — sf is one daemon per host; we don't horizontally split a single host's daemon.

Sequencing

When Action
Tier 1+ (next 13 months) Wire Singularity Memory remote-mode in memory-store.ts. Provider chain fallback: remote → embedded → local-only. Promote accepted runtime requirements into .sf/DB-backed state once landed.
After Singularity Memory in production for 1+ month Decide whether to ride it for benchmarks (Surface 2) or build a separate service. Decision driven by observed cost of duplicated benchmark discovery.
If/when concrete cross-instance agent pain shows up Reopen Surface 4 (federated persistent agents). Don't pre-build.
Never in sf Surface 3 (cross-repo unit deps) — that's a separate product.

References

  • Older SPEC notes for Singularity Memory, persistent agents, inter-agent messaging, and distributed execution — external design evidence only; project accepted facts into .sf/DB-backed state before treating them as operational.
  • src/resources/extensions/sf/memory-store.ts — current local-only memory store.
  • packages/daemon/src/daemon.ts — single-host daemon process.
  • docs/dev/ADR-011-swarm-chat-and-debate-mode.md — related: ephemeral swarms within a single instance.