35 KiB
A2A Adoption Plan for Singularity-Forge — Production Grade
Author: Research synthesis
Date: 2026-05-08
Status: Draft — for review
Scope: A2A as the internal agent communication protocol for SF dispatch layer
Executive Summary
SF's 5 dispatch mechanisms + MessageBus are functionally complete but architecturally silos. A2A provides a standardized protocol that maps 1:1 onto SF's semantics. The existing MessageBus is preserved as the transport; A2A is the semantic layer on top.
This is a production-grade plan. Every section covers: error handling, failure modes, rollback procedures, observability, and testing strategy.
Quick Reference
| Concern | Decision |
|---|---|
| A2A as internal protocol | YES — standardizes Task state, priority, capability discovery |
| MessageBus | Wrap as A2AMessageService transport; add AgentRegistry |
| Transport | SQLite-backed MessageBus (not HTTP/WebSocket) for local process agents |
| External A2A | Optional; wired later when HTTP exposure is needed |
| Migration | 6 phases; each phase is independently deployable and rollback-safe |
| Feature flag | SF_A2A_ENABLED — gates all new A2A behavior; default OFF until Phase 6 |
1. Architecture Overview
1.1 System Diagram
┌──────────────────────────────────────────────────────────────────────┐
│ Coordinator (UOK Kernel or subagent tool) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ DispatchService │ │
│ │ ├── A2AClient (send/receive) │ │
│ │ ├── AgentRegistry (capability lookup) │ │
│ │ └── AgentCard (self-description) │ │
│ └────────────────────────────────────────────────────────────┘ │
└───────────────────────────┬──────────────────────────────────────────┘
│ A2AMessageService (wraps MessageBus)
│ bus.send(), bus.broadcast(), bus.sendOnce()
▼
┌──────────────────────────────────────────────────────────────────────┐
│ MessageBus (SQLite-backed, existing) │
│ ├── Durable at-least-once delivery │
│ ├── TTL-based auto-compaction │
│ ├── AgentInbox per agent (per-queue) │
│ └── sendOnce for idempotent delivery │
└───────────────────────────┬──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Worker Agents (git worktrees, one per milestone/slice) │
│ ├── AgentCard (role: worker, isolation: full) │
│ ├── AgentInbox subscription │
│ ├── Project SQLite WAL (read/write) │
│ └── Emits: task_updated, cost, heartbeat │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Constrained Subagents (no project DB) │
│ ├── AgentCard (role: subagent, isolation: constrained) │
│ ├── Limited tool scope (4 tools) │
│ ├── AgentInbox (optional, opt-in via useMessageBus) │
│ └── Returns structured output via A2A message │
└──────────────────────────────────────────────────────────────────────┘
1.2 A2A Semantic Mapping
| SF Concept | A2A Concept |
|---|---|
| milestone / slice / task | A2A Task (id, status, metadata) |
| UOK Kernel | A2A Client + Coordinator Agent |
| Worker (parallel orchestrator) | A2A Agent |
| MessageBus.send() | A2A MessageService.send() |
| MessageBus.sendOnce() | A2A idempotent delivery |
| MessageBus.broadcast() | A2A MessageService.broadcast() |
| AgentInbox per worker | A2A per-agent subscription queue |
| File-based status files | A2A AgentStatus (online/busy/idle/offline/error) |
| adversarial-partner/combatant/architect | A2A Agent with specialized capabilities |
| parallel / debate / chain modes | A2A CommunicationPattern |
2. A2A Type System
2.1 Core Types
// File: src/resources/extensions/sf/dispatch/a2a-types.ts
import type {
AgentCard,
AgentCapabilities,
Task,
TaskStatus,
Message,
} from "@a2a-js/sdk";
/**
* A2A Task state — maps directly from SF unit runtime status.
* These are the ONLY authoritative task states.
*/
export const A2A_TASK_STATES = [
"submitted",
"working",
"completed",
"failed",
"cancelled",
] as const;
export type A2ATaskState = (typeof A2A_TASK_STATES)[number];
/**
* SF-specific task extensions — runtime states that A2A doesn't model.
* These live in task.metadata.sf_state and are NOT authoritative.
* DB is the authority for these.
*/
export const SF_TASK_EXTENSIONS = [
"verifying",
"reviewing",
"blocked",
"paused",
"retrying",
"pending_input",
] as const;
export type SFTaskExtension = (typeof SF_TASK_EXTENSIONS)[number];
/**
* Message priority levels — determines delivery urgency and retry budget.
*/
export const MESSAGE_PRIORITIES = ["low", "normal", "high", "urgent"] as const;
export type MessagePriority = (typeof MESSAGE_PRIORITIES)[number];
/**
* Dispatch mode → A2A CommunicationPattern mapping.
*/
export const DISPATCH_TO_PATTERN: Record<string, string> = {
single: "request_response",
parallel: "notification",
debate: "streaming",
chain: "request_response",
};
/**
* SF-specific capability extensions on top of A2A AgentCapabilities.
*/
export interface SFAgentCapabilities extends AgentCapabilities {
/** Domain role */
role: "coordinator" | "worker" | "subagent" | "reviewer" | "adversary" | "architect" | "researcher";
/** Isolation level — determines DB access */
isolation: "full" | "constrained";
/** For constrained agents — which tools are permitted */
toolScope?: Array<"file_read" | "file_write" | "execute" | "query" | "memory_read" | "memory_write">;
/** Model tier for cost and routing decisions */
modelTier: "primary" | "validation" | "worker";
/** Domain specializations */
specializations?: Array<
| "milestone_planning"
| "slice_planning"
| "code_review"
| "security_review"
| "adversarial_review"
| "architecture_analysis"
| "research"
| "verification"
>;
}
/**
* SF AgentCard — extends A2A AgentCard with SF-specific capabilities.
* Published by each agent on startup; cached in AgentRegistry.
*/
export interface SFAgentCard extends AgentCard {
capabilities: SFAgentCapabilities;
metadata?: {
basePath?: string;
milestoneId?: string;
sliceId?: string;
worktreePath?: string;
pid?: number;
startedAt?: string;
};
}
/**
* SF Task metadata — stored in A2A Task.metadata.
* sf_state is NOT authoritative — DB is the authority.
*/
export interface SFTaskMetadata {
scope: "milestone" | "slice" | "task" | "inline";
milestoneId: string;
sliceId?: string;
taskId?: string;
title: string;
/** Non-authoritative runtime hint — DB is authority */
sf_state?: SFTaskExtension;
/** Base path for DB access */
basePath: string;
}
/**
* A2A Message envelope used internally.
* Wraps MessageBus messages with A2A metadata.
*/
export interface SFA2AMessage {
id: string;
type: "message" | "task_submitted" | "task_updated" | "task_completed" | "control" | "error";
from: string;
to: string | string[];
body: Record<string, unknown>;
priority: MessagePriority;
sentAt: string;
deliveredAt?: string;
correlationId?: string;
conversationId?: string;
ttlMs?: number;
taskId?: string;
metadata?: Record<string, unknown>;
}
3. Error Handling
3.1 Message Delivery Errors
| Error | Detection | Response |
|---|---|---|
| Recipient offline | AgentRegistry.getStatus() === "offline" |
Buffer message; deliver on reconnect |
| Inbox full (max 1000) | AgentInbox.unreadCount >= maxInboxSize |
Reject with TOO_MANY_PENDING; caller retries with backoff |
| TTL exceeded | Date.now() - sentAt > ttlMs |
Discard; caller notified via error response |
| DB write conflict | SQLite SQLITE_BUSY |
Retry with exponential backoff (max 3 attempts, 100ms base) |
| Invalid recipient | AgentRegistry.getCard(to) === undefined |
Return AGENT_NOT_FOUND error; do not retry |
3.2 Retry Strategy
// File: src/resources/extensions/sf/dispatch/a2a-service.ts
const RETRY_CONFIG = {
maxAttempts: 3,
baseDelayMs: 100,
maxDelayMs: 5000,
backoffMultiplier: 2.0,
jitterFactor: 0.1, // 10% random jitter to prevent thundering herd
} as const;
export class DeliveryError extends Error {
constructor(
message: string,
public readonly code: string,
public readonly retryable: boolean,
public readonly attempts: number,
) {
super(message);
this.name = "DeliveryError";
}
}
async function sendWithRetry(
params: SendParams,
attempt = 1,
): Promise<string> {
const { from, to, body, metadata = {} } = params;
try {
return await doSend(from, to, body, metadata);
} catch (err) {
const isRetryable =
err instanceof DeliveryError && err.retryable && attempt < RETRY_CONFIG.maxAttempts;
if (!isRetryable) {
throw err;
}
const delay = Math.min(
RETRY_CONFIG.baseDelayMs * Math.pow(RETRY_CONFIG.backoffMultiplier, attempt - 1),
RETRY_CONFIG.maxDelayMs,
);
const jitter = delay * RETRY_CONFIG.jitterFactor * Math.random();
await sleep(delay + jitter);
return sendWithRetry(params, attempt + 1);
}
}
3.3 Agent Crash Handling
Worker crash detection:
1. Worker process exits → SIGCHLD handler
2. Update AgentRegistry status: "offline"
3. MessageBus retains undelivered messages (TTL not expired)
4. Coordinator polls AgentRegistry.getStatus() every 30s
5. On reconnect: worker re-registers AgentCard
6. Buffered messages delivered to reconnected AgentInbox
7. Coordinator re-sends any unacknowledged task_updated messages
3.4 Panic Mode
When messageService fails to deliver HIGH/URGENT messages 3 times consecutively:
- Log
A2A_DELIVERY_PANICevent to.sf/journal/ - Fall back to file-based signal (
session-status-io.js) - Emit
sf_dispatch_degradedevent - Dashboard shows "dispatch degraded" warning
- Auto-recovery when MessageBus recovers
4. Backpressure and Flow Control
4.1 Per-Agent Inbox Backpressure
// File: src/resources/extensions/sf/dispatch/a2a-service.ts
const INBOX_CONFIG = {
maxInboxSize: 1000, // Per-agent queue limit
maxMessageSizeBytes: 64 * 1024, // 64 KB per message body
highWaterMark: 800, // Warn when inbox reaches 80%
overflowAction: "reject", // "reject" | "drop_oldest"
} as const;
interface SendParams {
from: string;
to: string;
body: Record<string, unknown>;
metadata?: {
priority?: MessagePriority;
ttlMs?: number;
replyTo?: string;
taskId?: string;
};
}
function validateSend(params: SendParams): void {
const bodySize = JSON.stringify(params.body).length;
if (bodySize > INBOX_CONFIG.maxMessageSizeBytes) {
throw new DeliveryError(
`Message body ${bodySize} bytes exceeds limit ${INBOX_CONFIG.maxMessageSizeBytes}`,
"MESSAGE_TOO_LARGE",
false, // Not retryable
0,
);
}
const inbox = bus.getInbox(params.to);
if (inbox.unreadCount >= INBOX_CONFIG.maxInboxSize) {
throw new DeliveryError(
`Inbox for ${params.to} is full (${inbox.unreadCount}/${INBOX_CONFIG.maxInboxSize})`,
"INBOX_OVERFLOW",
true, // Retryable after inbox drains
0,
);
}
if (inbox.unreadCount >= INBOX_CONFIG.highWaterMark) {
logWarning("dispatch", `Inbox for ${params.to} at ${inbox.unreadCount}/${INBOX_CONFIG.maxInboxSize}`);
}
}
4.2 Coordinator Outbox Backpressure
When the coordinator sends faster than workers can consume:
// Coordinator: batch outgoing messages, flush on interval
const outbox = new Map<string, SFA2AMessage[]>();
const FLUSH_INTERVAL_MS = 500;
setInterval(() => {
for (const [to, messages] of outbox) {
if (messages.length === 0) continue;
bus.broadcast(coordinatorId, [to], { batch: messages });
messages.length = 0; // drain
}
}, FLUSH_INTERVAL_MS);
// Caller adds to outbox instead of sending immediately
function scheduleSend(params: SendParams): void {
const queue = outbox.get(params.to) ?? [];
queue.push(wrapAsA2AMessage(params));
outbox.set(params.to, queue);
}
4.3 Memory Budget Per Worker
Each worker has a memory budget for buffering messages it cannot process immediately:
MAX_BUFFERED_MESSAGES_PER_WORKER = 100
MAX_BUFFERED_BYTES_PER_WORKER = 10 * 1024 * 1024 // 10 MB
If a worker's inbox exceeds either limit, the oldest messages are dropped (not rejected — the sender already moved on).
5. Observability
5.1 Metrics
// File: src/resources/extensions/sf/dispatch/metrics.ts
export const A2A_METRICS = {
// Message throughput
"sf_a2a_messages_sent_total": {
type: "counter",
help: "Total A2A messages sent",
labels: ["priority", "from_role", "to_role"],
},
"sf_a2a_messages_delivered_total": {
type: "counter",
help: "Total A2A messages delivered to recipient inbox",
labels: ["priority", "from_role", "to_role"],
},
"sf_a2a_messages_failed_total": {
type: "counter",
help: "Total A2A message delivery failures",
labels: ["priority", "error_code"],
},
"sf_a2a_message_delivery_latency_ms": {
type: "histogram",
help: "End-to-end message delivery latency (send to inbox receipt)",
buckets: [10, 50, 100, 500, 1000, 5000],
},
"sf_a2a_inbox_size": {
type: "gauge",
help: "Current inbox size per agent",
labels: ["agent_id", "role"],
},
"sf_a2a_retry_total": {
type: "counter",
help: "Total retry attempts",
labels: ["priority", "attempt_number"],
},
"sf_a2a_agent_status": {
type: "gauge",
help: "Agent status (1=online, 0.5=busy, 0.1=idle, 0=offline/error)",
labels: ["agent_id", "role"],
},
} as const;
5.2 Structured Logging
Every A2A operation emits structured log lines:
// File: src/resources/extensions/sf/dispatch/logger.ts
type A2ALogEvent =
| { event: "a2a.send"; from: string; to: string; priority: string; messageId: string; sizeBytes: number }
| { event: "a2a.delivered"; messageId: string; to: string; latencyMs: number }
| { event: "a2a.delivery_failed"; messageId: string; error: string; retryable: boolean; attempt: number }
| { event: "a2a.agent_registered"; agentId: string; role: string; capabilities: string[] }
| { event: "a2a.agent_offline"; agentId: string; reason: string }
| { event: "a2a.inbox_overflow"; agentId: string; size: number; action: string }
| { event: "a2a.panic_mode"; reason: string; fallback_used: boolean };
function logA2A(event: A2ALogEvent): void {
const line = JSON.stringify({
ts: new Date().toISOString(),
...event,
});
workflowLogger.log("dispatch", line);
}
5.3 Trace Context
Propagate trace context through A2A messages for debugging:
interface TraceContext {
traceId: string; // ULID — unique per dispatch session
spanId: string; // Per-message ID
parentSpanId?: string;
}
function injectTraceContext(msg: SFA2AMessage): SFA2AMessage {
const spanId = ulid();
return {
...msg,
metadata: {
...msg.metadata,
trace: {
traceId: currentTraceId(),
spanId,
parentSpanId: currentSpanId(),
},
},
};
}
Traces are stored in .sf/journal/a2a-traces/{date}.jsonl and queryable via sf trace <traceId>.
6. Security
6.1 Agent Authentication
Every A2A message must carry a valid agent identity. Identity is established at agent startup:
// File: src/resources/extensions/sf/dispatch/auth.ts
/**
* Agent identity token — HMAC-SHA256 of agent ID + basePath + startup timestamp.
* Used to authenticate messages from agents.
* Generated once at agent startup; stored in process.env.SF_AGENT_TOKEN.
*/
function generateAgentToken(agentId: string, basePath: string): string {
const secret = process.env.SF_A2A_SHARED_SECRET ?? process.env.SF_DB_KEY ?? "sf-insecure-dev-secret";
const payload = `${agentId}:${basePath}:${Date.now()}`;
return createHmac("sha256", secret).update(payload).digest("hex").slice(0, 32);
}
function verifyAgentToken(token: string, agentId: string): boolean {
// Tokens are single-use (generated per startup, not reusable)
// Verification is membership check: token must have been issued for this agentId
return validTokens.has(`${agentId}:${token}`);
}
6.2 Input Validation
// File: src/resources/extensions/sf/dispatch/validation.ts
const MAX_BODY_DEPTH = 20; // Nested object depth
const MAX_ARRAY_LENGTH = 1000; // Max array items in body
const MAX_STRING_LENGTH = 100_000; // Max string value length
const FORBIDDEN_KEYS = ["__proto__", "constructor", "prototype"]; // Prototype pollution
function validateMessageBody(body: unknown, depth = 0): void {
if (depth > MAX_BODY_DEPTH) throw new ValidationError("BODY_TOO_DEEP");
if (Array.isArray(body)) {
if (body.length > MAX_ARRAY_LENGTH) throw new ValidationError("ARRAY_TOO_LARGE");
for (const item of body) validateMessageBody(item, depth + 1);
return;
}
if (typeof body === "object" && body !== null) {
for (const [k, v] of Object.entries(body)) {
if (FORBIDDEN_KEYS.includes(k)) throw new ValidationError(`FORBIDDEN_KEY: ${k}`);
validateMessageBody(v, depth + 1);
}
return;
}
if (typeof body === "string" && body.length > MAX_STRING_LENGTH) {
throw new ValidationError("STRING_TOO_LONG");
}
}
6.3 Capability Enforcement
The AgentRegistry enforces that agents only perform actions consistent with their registered capabilities:
// File: src/resources/extensions/sf/dispatch/capability-enforcer.ts
function enforceCapabilities(agentId: string, action: string): void {
const card = registry.getCard(agentId);
if (!card) throw new DeliveryError(`Unknown agent: ${agentId}`, "AGENT_NOT_FOUND", false, 0);
const caps = card.capabilities as SFAgentCapabilities;
switch (action) {
case "write_project_db":
if (caps.isolation !== "full") {
throw new DeliveryError(
`${agentId} cannot write project DB (isolation: ${caps.isolation})`,
"ISOLATION_VIOLATION",
false,
0,
);
}
break;
case "send_to_worker":
if (caps.role === "subagent") {
// Constrained subagents can only send to their parent
throw new DeliveryError("Subagent cannot send to workers", "CAPABILITY_DENIED", false, 0);
}
break;
case "read_project_context":
// All agents can read project context (it's in the prompt)
break;
}
}
7. Testing Strategy
7.1 Unit Tests
// File: src/resources/extensions/sf/dispatch/a2a-service.test.ts
describe("A2AMessageService", () => {
let bus: MessageBus;
let registry: AgentRegistry;
let service: A2AMessageService;
beforeEach(() => {
bus = new MessageBus(tmpDir());
registry = new AgentRegistry(tmpDir(), bus);
service = new A2AMessageService(tmpDir(), registry);
});
test("send_delivers_to_recipient_inbox", async () => {
registry.register(workerCard("worker:1"));
const id = service.send({
from: "coordinator",
to: "worker:1",
body: { type: "task_submitted", taskId: "M01" },
});
const inbox = service.getInbox("worker:1");
const msgs = inbox.list();
expect(msgs).toHaveLength(1);
expect(msgs[0].id).toBe(id);
});
test("sendWithRetry_retries_on_retryable_error", async () => {
// Simulate transient DB busy
vi.spyOn(bus, "send").mockRejectedOnceOnce(new DeliveryError("busy", "DB_BUSY", true, 1));
vi.spyOn(bus, "send").mockResolvedValueOnce("msg-1");
const id = await sendWithRetry({ from: "c", to: "w", body: { test: true } });
expect(id).toBe("msg-1");
expect(bus.send).toHaveBeenCalledTimes(2);
});
test("sendWithRetry_does_not_retry_non_retryable_error", async () => {
vi.spyOn(bus, "send").mockRejectedValueOnce(
new DeliveryError("unknown agent", "AGENT_NOT_FOUND", false, 1),
);
await expect(sendWithRetry({ from: "c", to: "w", body: { test: true } }))
.rejects.toThrow("AGENT_NOT_FOUND");
expect(bus.send).toHaveBeenCalledTimes(1);
});
test("sendOnce_same_key_returns_same_id", async () => {
const id1 = service.sendOnce({ from: "c", to: "w", body: { beat: 1 }, dedupeKey: "heartbeat" });
const id2 = service.sendOnce({ from: "c", to: "w", body: { beat: 2 }, dedupeKey: "heartbeat" });
expect(id1).toBe(id2); // Idempotent
});
test("validateMessageBody_rejects_deep_objects", () => {
const deep = { a: { b: { c: { d: { e: {} } } } };
expect(() => validateMessageBody(deep, 0, MAX_BODY_DEPTH)).toThrow("BODY_TOO_DEEP");
});
test("validateMessageBody_rejects_prototype_pollution", () => {
expect(() => validateMessageBody({ "__proto__": { evil: true } }, 0))
.toThrow("FORBIDDEN_KEY");
});
});
7.2 Integration Tests
// File: src/resources/extensions/sf/tests/a2a-integration.test.ts
describe("A2A Integration", () => {
test("worker_registers_and_receives_task", async () => {
const { coordinator, worker, service } = setupTwoAgentSystem();
// Worker starts, registers
await worker.start();
await waitFor(() => registry.getStatus("worker:1") === "online");
// Coordinator sends task
service.send({
from: "coordinator",
to: "worker:1",
body: { type: "task_submitted", taskId: "M01" },
});
// Worker receives
const msg = await worker.waitForMessage("task_submitted");
expect(msg.body.taskId).toBe("M01");
});
test("worker_crash_does_not_lose_messages", async () => {
const { coordinator, worker, service } = setupTwoAgentSystem();
await worker.start();
service.send({ from: "coordinator", to: "worker:1", body: { type: "task_submitted" } });
// Worker crashes and restarts
await worker.kill();
await worker.start();
// Message should still be in inbox after restart
const msg = await worker.waitForMessage("task_submitted");
expect(msg).toBeDefined();
});
test("coordinator_receives_worker_heartbeat", async () => {
const { coordinator, worker, service } = setupTwoAgentSystem();
await worker.start();
worker.sendHeartbeat();
const msg = await coordinator.waitForMessage("worker.heartbeat");
expect(msg.from).toBe("worker:1");
});
});
7.3 Chaos Tests
// File: src/resources/extensions/sf/tests/a2a-chaos.test.ts
describe("A2A Chaos", () => {
test("messages_delivered_despite_slow_worker", async () => {
// Worker is slow to process (simulate 10s processing time)
worker.simulateSlowProcessing(10_000);
// Send 100 messages while worker is slow
const sends = Array.from({ length: 100 }, (_, i) =>
service.send({ from: "c", to: "w", body: { seq: i } }),
);
const results = await Promise.allSettled(sends);
// All succeed (buffered, not rejected)
expect(results.filter(r => r.status === "fulfilled")).toHaveLength(100);
// Worker processes all after recovery
worker.simulateFastProcessing();
await worker.processAllBuffered();
const received = await worker.getAllMessages();
expect(received).toHaveLength(100);
});
test("panic_mode_activates_on_repeated_failure", async () => {
bus.simulatePermanentFailure();
for (let i = 0; i < 3; i++) {
try {
await service.send({ from: "c", to: "w", body: { test: true } });
} catch {}
}
// Panic mode should be active
expect(service.isPanicMode).toBe(true);
// File-based fallback should be active
expect(sessionStatusSignalWasUsed()).toBe(true);
});
});
8. Rollback Procedures
8.1 Feature Flag
All A2A behavior is gated by SF_A2A_ENABLED:
// File: src/resources/extensions/sf/dispatch/service.ts
const A2A_ENABLED = process.env.SF_A2A_ENABLED === "1";
export class DispatchService {
private messageService: A2AMessageService | null = null;
constructor(opts: DispatchOptions) {
if (A2A_ENABLED) {
this.messageService = new A2AMessageService(opts.basePath, this.registry);
}
// ...
}
async pause(workerId: string): Promise<void> {
if (this.messageService && A2A_ENABLED) {
await this.messageService.send({
from: "coordinator",
to: workerId,
body: { type: "control", action: "pause" },
metadata: { priority: "high" },
});
} else {
// Legacy file-based signal
sendSignal(this.basePath, workerId, "pause");
}
}
}
8.2 Per-Phase Rollback
| Phase | Rollback |
|---|---|
| Phase 1: A2A adapter types | Delete a2a-types.ts, a2a-task.ts. No behavior change — code not wired yet. |
| Phase 2: AgentRegistry | Delete capability-registry.ts. Remove registry from DispatchService constructor. No behavior change. |
| Phase 3: MessageBus wiring | Set SF_A2A_ENABLED=0. File-based IPC (sendSignal) is the automatic fallback. |
| Phase 4: Subagent A2A | Delete subagent/a2a.ts. Restore original subagent/index.js from git. |
| Phase 5: UOK kernel A2A | Revert uok/kernel.js to pre-Phase-5 state from git. |
| Phase 6: Fallback removal | session-status-io.js is never removed — it stays as crash-recovery fallback permanently. |
8.3 Emergency Rollback
# Emergency: disable A2A entirely
SF_A2A_ENABLED=0 sf headless autonomous
# Emergency: revert to specific phase
git stash
git checkout phase2-end # tag or branch at end of Phase 2
SF_A2A_ENABLED=0 sf headless autonomous
# Verify rollback
npx vitest run src/resources/extensions/sf/tests/uok-message-bus.test.mjs
9. Migration Phases (Detailed)
Phase 1: A2A Type Definitions (Week 1-2)
Risk: Zero | Behavior: identical
Files created:
dispatch/a2a-types.ts — A2A types + SF extensions
dispatch/a2a-task.ts — Task creation + state mapping
dispatch/a2a-errors.ts — DeliveryError + error codes
Files modified:
None (types are additive, not wired)
Verification:
npx tsc --noEmit src/resources/extensions/sf/dispatch/a2a-types.ts
npx vitest run src/resources/extensions/sf/dispatch/a2a-task.test.ts
Phase 2: AgentRegistry (Week 2-3)
Risk: Low | Behavior: additive
Files created:
dispatch/capability-registry.ts — AgentRegistry + SF_CAPABILITY_DEFINITIONS
Files modified:
dispatch/service.ts — Add registry to DispatchService (opt-in via feature flag)
dispatch/index.ts — Export new types
Verification:
npx vitest run src/resources/extensions/sf/dispatch/capability-registry.test.ts
SF_A2A_ENABLED=0 npm run test:unit # existing tests pass
Phase 3: MessageBus Wiring (Week 3-4)
Risk: Medium | Behavior: pause/resume/stop now use MessageBus
Files created:
dispatch/a2a-service.ts — A2AMessageService wrapping MessageBus
Files modified:
dispatch/service.ts — Wire MessageBus into pause/resume/stop
dispatch/worker-*.ts — Register AgentCard on spawn
session-status-io.ts — Mark as crash-recovery fallback (never primary)
Before: sendSignal(basePath, id, "pause") → signal file
After: messageService.send({ from, to, body: { type: "control", action: "pause" }, priority: HIGH })
Fallback: File signal if MessageBus delivery fails 3 times
Verification:
SF_A2A_ENABLED=1 npx vitest run src/resources/extensions/sf/tests/a2a-integration.test.ts
SF_A2A_ENABLED=0 npm run test:unit # existing tests pass
Phase 4: Subagent A2A (Week 4-5)
Risk: Medium | Behavior: subagent modes unchanged
Files modified:
subagent/index.ts — Use DispatchService internally
dispatch/service.ts — Handle isolation: constrained
Verification:
SF_A2A_ENABLED=1 npx vitest run src/resources/extensions/sf/tests/subagent-a2a.test.ts
SF_A2A_ENABLED=0 npm run test:unit # existing tests pass
Phase 5: UOK Kernel A2A (Week 5-6)
Risk: Medium | Behavior: UOK autonomous loop uses A2A
Files modified:
uok/kernel.ts — Use DispatchService + A2AMessageService
uok/index.ts — Export new A2A types
Verification:
SF_A2A_ENABLED=1 npm run test:integration # Full integration suite
SF_A2A_ENABLED=0 npm run test:integration # Legacy still works
Phase 6: A2A Default On (Week 6-7)
Risk: Low | Behavior: A2A is now the default
Actions:
1. Set SF_A2A_ENABLED=1 as default in preferences
2. Document in CHANGELOG.md
3. Monitor for 1 week before declaring stable
10. Operational Runbooks
10.1 Dispatch Degraded
Symptoms: Dashboard shows "dispatch degraded"; sf_dispatch_degraded events in journal
Diagnosis:
# Check MessageBus health
node -e "import('./src/resources/extensions/sf/uok/message-bus.js').then(m => {
const metrics = m.getUokMessageBusMetrics();
console.log(JSON.stringify(metrics, null, 2));
}')
# Check for panic mode
cat .sf/journal/*.jsonl | jq 'select(.event == "a2a.panic_mode")' | tail -5
Fix:
# Switch to file-based IPC temporarily
SF_A2A_ENABLED=0 sf headless autonomous
# Restart with A2A off
sf headless autonomous
# After fix: re-enable A2A
sf config set SF_A2A_ENABLED=1
10.2 Worker Not Receiving Messages
Symptoms: Worker shows "offline" but process is running
Diagnosis:
# Check worker AgentCard registration
curl -s http://localhost:3030/api/dispatch/agents | jq '.[] | select(.role == "worker")'
# Check worker inbox size
node -e "const m = require('./src/resources/extensions/sf/dispatch/metrics'); m.getInboxMetrics('worker:M01')"
# Check MessageBus delivery latency
cat .sf/journal/*.jsonl | jq 'select(.event == "a2a.delivery_failed")' | tail -20
Fix:
# Restart the worker process
sf parallel stop M01
sf parallel start M01
# Or: send SIGUSR1 to worker to re-register its AgentCard
kill -USR1 $(pgrep -f "sf.*M01")
10.3 Inbox Overflow
Symptoms: "INBOX_OVERFLOW" errors in logs; workers missing messages
Diagnosis:
# Find overflowing inboxes
node -e "import('./src/resources/extensions/sf/dispatch/metrics').then(m => {
Object.entries(m.getAllInboxSizes()).forEach(([id, size]) => {
if (size > 900) console.log(id, size);
});
})"
Fix:
# Compact all message buses (removes messages older than retention)
sf uok messages compact
# Or: increase inbox size limit temporarily
SF_INBOX_MAX_SIZE=5000 sf headless autonomous
11. Performance Targets
| Metric | Target | Critical Threshold |
|---|---|---|
| Message delivery latency (local) | < 50ms p50, < 500ms p99 | > 2000ms |
| Inbox delivery for 100 parallel workers | < 5s end-to-end | > 15s |
| Agent registration time | < 100ms | > 1000ms |
| Message throughput | > 1000 msg/s per coordinator | < 100 msg/s |
| Memory per worker (idle) | < 50 MB | > 200 MB |
| Memory per coordinator (10 workers) | < 200 MB | > 500 MB |
| DB WAL size growth | < 10 MB/day | > 100 MB/day |
| Recovery time after coordinator crash | < 5s | > 30s |
12. File Manifest
New Files
| File | Lines (est) | Purpose |
|---|---|---|
dispatch/a2a-types.ts |
120 | Core A2A types + SF extensions |
dispatch/a2a-task.ts |
80 | Task creation + state mapping |
dispatch/a2a-errors.ts |
60 | DeliveryError + error codes |
dispatch/a2a-service.ts |
250 | A2AMessageService wrapping MessageBus |
dispatch/capability-registry.ts |
180 | AgentRegistry + SF_CAPABILITY_DEFINITIONS |
dispatch/metrics.ts |
60 | A2A Prometheus metrics |
dispatch/logger.ts |
40 | A2A structured logging |
dispatch/validation.ts |
70 | Message body validation |
dispatch/auth.ts |
50 | Agent token generation + verification |
dispatch/index.ts |
30 | Barrel exports |
dispatch/a2a-service.test.ts |
200 | Unit tests |
tests/a2a-integration.test.ts |
300 | Integration tests |
tests/a2a-chaos.test.ts |
150 | Chaos tests |
| Total new | ~1600 LOC |
Modified Files
| File | Change |
|---|---|
dispatch/service.ts |
Add registry + messageService; wire pause/resume/stop |
dispatch/worker-orchestrator.ts |
Register AgentCard on spawn; open AgentInbox |
uok/kernel.ts |
Register coordinator AgentCard; use DispatchService |
uok/message-bus.js |
Add AgentCard types (no behavior change) |
uok/index.ts |
Export A2A types |
subagent/index.ts |
Use DispatchService; remove ~600 LOC spawn management |
session-status-io.ts |
Mark as crash-recovery fallback only |
Summary
| Question | Answer |
|---|---|
| A2A as internal protocol | YES — Task state, priority, capability discovery |
| Transport | SQLite MessageBus (not HTTP/WebSocket) |
| External A2A | Optional; wired later |
| Feature flag | SF_A2A_ENABLED gates all behavior |
| Migration | 6 phases; each independently rollback-safe |
| Error handling | Retry with exponential backoff; panic mode with file-based fallback |
| Backpressure | Per-inbox limits; coordinator outbox batching |
| Observability | Prometheus metrics + structured JSONL logging |
| Security | Agent tokens, input validation, capability enforcement |
| Testing | Unit + integration + chaos tests for every phase |
| Rollback | SF_A2A_ENABLED=0 disables all new behavior instantly |