singularity-forge/docs/records/2026-05-07-metrics-central-vs-ra-aid-review.md

9.9 KiB

Metrics Central vs RA.Aid Architecture Review

Date: 2026-05-07 Reviewer: Claude Code (SF) Scope: metrics-central.js and its wiring, compared against RA.Aid patterns


RA.Aid Architecture Summary

RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:

Layer Pattern
State Peewee ORM over SQLite (.ra-aid/pk.db), WAL mode, contextvars for connection scoping
Agents LangGraph agents (research → planning → implementation) with explicit stage boundaries
Memory Key facts, key snippets, research notes, trajectories — all DB-backed with repositories
Trajectory Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message
Config JSON config file + runtime config repository with defaults
Shell Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling
Reasoning Optional expert model consultation before each stage (reasoning_assist)
Recovery Fallback handlers, retry with backoff, agent thread manager

RA.Aid's Observability Model

RA.Aid doesn't have a separate metrics system. Instead, observability is embedded in the trajectory:

  • Every tool execution → Trajectory record with cost, tokens, timing
  • Every stage transition → Trajectory record with record_type="stage_transition"
  • Every human input → HumanInput record linked to trajectories
  • Every error → Trajectory with is_error=true, error_type, error_details

This is event-sourced observability: the DB is the single source of truth for both state AND metrics.


Our Metrics-Central.js Design

What We Built

A Prometheus-compatible metrics collector with:

  • Counter, Gauge, Histogram types
  • In-memory aggregation with 60s flush to .sf/runtime/sf-metrics.prom
  • Pre-defined metric metadata registry
  • Wiring into subagent inheritance and mode transitions

Design Decisions and Their Trade-offs

Decision Rationale RA.Aid Comparison
Prometheus text format Compatible with existing exposition, scrapeable by Grafana RA.Aid uses DB queries; we support both
In-memory aggregation Zero dependencies, fast RA.Aid queries DB directly; we add a layer
60s flush interval Batch writes, reduce I/O RA.Aid writes per event; we batch
Separate from trajectory/audit Metrics are aggregated views, not individual events RA.Aid conflates events and metrics
Metric metadata registry Pre-defined help text and labels RA.Aid uses Peewee model definitions

The Review: 5 Lenses

Lens 1: Data Model Consistency

RA.Aid Pattern: Single SQLite DB with typed models. Trajectory is the universal event log.

Our Pattern: Dual persistence:

  • SQLite for operational state (UOK, sessions, tasks)
  • Prometheus text file for metrics exposition
  • JSONL for event durability

Verdict: ⚠️ NEEDS WORK

We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:

  • Risk of inconsistency between sf-metrics.prom and sf.db
  • No unified query surface for "show me all subagent blocks in the last hour"
  • Metrics file is write-only; no read path for programmatic consumption

Recommendation: Add a metrics table to sf.db that mirrors the Prometheus data model. The text file becomes a projection, not a source of truth.

CREATE TABLE metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),
    labels TEXT, -- JSON object
    value REAL NOT NULL,
    timestamp TEXT NOT NULL DEFAULT (datetime('now')),
    session_id TEXT
);

Lens 2: Event-Sourced vs Aggregated

RA.Aid Pattern: Every event is a row. Aggregation happens at query time.

Our Pattern: Aggregation happens at write time. Individual events are lost.

Verdict: ACCEPTABLE for metrics, but incomplete for observability

For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just sf_subagent_dispatch_blocked{reason="provider"} 5.

Recommendation: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.

Lens 3: Context and Session Scoping

RA.Aid Pattern: Every record has a session_id foreign key. Contextvars scope the DB connection.

Our Pattern: Metrics are global to the process. No session scoping.

Verdict: GAP

Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:

  • Per-session cost attribution
  • Debugging why a specific run failed
  • Multi-tenant scenarios (if SF ever serves multiple users)

Recommendation: Add session_id label to all metrics. Use ctx.sessionId or getAutoSession().currentTraceId.

Lens 4: Cost and Token Tracking

RA.Aid Pattern: Every trajectory record has current_cost, input_tokens, output_tokens.

Our Pattern: No cost/token metrics in metrics-central yet.

Verdict: MISSING

RA.Aid tracks cost per tool call. We track cost in metrics.js (SQLite + JSONL) but not in metrics-central. This means:

  • No Prometheus-compatible cost metrics
  • No cost alerts from Grafana
  • No cost attribution by work mode or permission profile

Recommendation: Add cost/token metrics:

"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },
"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },
"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },

Lens 5: Error Handling and Resilience

RA.Aid Pattern: Every error is caught, logged, and stored in the trajectory with full context.

Our Pattern: flushMetrics() catches and logs with logWarning(). No retry.

Verdict: ⚠️ ACCEPTABLE but could be stronger

Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:

  • Reopens closed DB connections automatically
  • Has fallback handlers for agent failures
  • Records error details in the trajectory

Recommendation:

  1. Add retry with exponential backoff for flush failures
  2. If flush fails 3 times, emit a metrics_flush_failed counter
  3. On process exit, attempt a final synchronous flush

Specific Code Review Findings

Finding 1: Unused Import

import { isDbAvailable } from "./sf-db.js";

This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.

Fix: Either implement DB persistence or remove the import.

Finding 2: Histogram Bucket Sorting

this.buckets = [...buckets].sort((a, b) => a - b);

This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.

Verdict: Correct.

Finding 3: Label Key Serialization

_key(labels) {
    return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
}

If a label value contains = or ,, the key parsing will break.

Fix: Add escaping or use a structured key format (e.g., JSON).

Finding 4: No Validation on Metric Names

export function recordCounter(name, labels = {}, amount = 1) {
    const meta = getMetricMeta(name);
    getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);
}

If name contains spaces or invalid Prometheus characters, the output will be malformed.

Fix: Add validateMetricName(name) that rejects invalid characters.

Finding 5: Timer Unref

if (_flushTimer.unref) _flushTimer.unref();

This is correct for Node.js but may not work in all environments (e.g., Bun).

Verdict: Acceptable with fallback.


Overall Assessment

Dimension Grade Notes
Correctness B+ Prometheus output is valid, but label escaping needs work
Completeness B Missing cost/token metrics, session scoping, DB persistence
Consistency with SF A Fits the extension model, uses existing patterns
Consistency with RA.Aid C RA.Aid would prefer event-sourced over aggregated
Production Readiness B Needs retry, validation, and DB projection before GA

Priority Fixes

  1. P0: Add session_id label to all metrics
  2. P0: Remove unused isDbAvailable import or implement DB persistence
  3. P1: Add cost/token metrics
  4. P1: Fix label value escaping
  5. P1: Add metric name validation
  6. P2: Add retry with backoff for flush failures
  7. P2: Add final flush on process exit
  8. P2: Consider a metrics table in sf.db as source of truth

RA.Aid Patterns Worth Adopting

  1. Trajectory-style event logging: Every metric should have a corresponding event in the audit/trajectory system
  2. Session-scoped connections: All observability should be filterable by session
  3. Per-tool cost tracking: Every tool call should record cost and tokens
  4. Error detail preservation: When metrics indicate failure, the detail should be queryable

Conclusion

metrics-central.js is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes exposition format over observability depth. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.

The right path forward:

  1. Keep metrics-central for Prometheus output (Grafana compatibility)
  2. Add a metrics table to sf.db for queryable aggregation
  3. Ensure every metric has a corresponding audit/trajectory event
  4. Add session scoping and cost tracking

This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.