Mikael Hugo 15269f4176 sf snapshot: uncommitted changes after 202m inactivity

2026-05-08 13:31:08 +02:00

9.9 KiB

Raw Blame History

Metrics Central vs RA.Aid Architecture Review

Date: 2026-05-07 Reviewer: Claude Code (SF) Scope: metrics-central.js and its wiring, compared against RA.Aid patterns

RA.Aid Architecture Summary

RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:

Layer	Pattern
State	Peewee ORM over SQLite (`.ra-aid/pk.db`), WAL mode, contextvars for connection scoping
Agents	LangGraph agents (research → planning → implementation) with explicit stage boundaries
Memory	Key facts, key snippets, research notes, trajectories — all DB-backed with repositories
Trajectory	Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message
Config	JSON config file + runtime config repository with defaults
Shell	Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling
Reasoning	Optional expert model consultation before each stage (reasoning_assist)
Recovery	Fallback handlers, retry with backoff, agent thread manager

RA.Aid's Observability Model

RA.Aid doesn't have a separate metrics system. Instead, observability is embedded in the trajectory:

Every tool execution → Trajectory record with cost, tokens, timing
Every stage transition → Trajectory record with record_type="stage_transition"
Every human input → HumanInput record linked to trajectories
Every error → Trajectory with is_error=true, error_type, error_details

This is event-sourced observability: the DB is the single source of truth for both state AND metrics.

Our Metrics-Central.js Design

What We Built

A Prometheus-compatible metrics collector with:

Counter, Gauge, Histogram types
In-memory aggregation with 60s flush to .sf/runtime/sf-metrics.prom
Pre-defined metric metadata registry
Wiring into subagent inheritance and mode transitions

Design Decisions and Their Trade-offs

Decision	Rationale	RA.Aid Comparison
Prometheus text format	Compatible with existing exposition, scrapeable by Grafana	RA.Aid uses DB queries; we support both
In-memory aggregation	Zero dependencies, fast	RA.Aid queries DB directly; we add a layer
60s flush interval	Batch writes, reduce I/O	RA.Aid writes per event; we batch
Separate from trajectory/audit	Metrics are aggregated views, not individual events	RA.Aid conflates events and metrics
Metric metadata registry	Pre-defined help text and labels	RA.Aid uses Peewee model definitions

The Review: 5 Lenses

Lens 1: Data Model Consistency

RA.Aid Pattern: Single SQLite DB with typed models. Trajectory is the universal event log.

Our Pattern: Dual persistence:

SQLite for operational state (UOK, sessions, tasks)
Prometheus text file for metrics exposition
JSONL for event durability

Verdict: ⚠️ NEEDS WORK

We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:

Risk of inconsistency between sf-metrics.prom and sf.db
No unified query surface for "show me all subagent blocks in the last hour"
Metrics file is write-only; no read path for programmatic consumption

Recommendation: Add a metrics table to sf.db that mirrors the Prometheus data model. The text file becomes a projection, not a source of truth.

CREATE TABLE metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),
    labels TEXT, -- JSON object
    value REAL NOT NULL,
    timestamp TEXT NOT NULL DEFAULT (datetime('now')),
    session_id TEXT
);

Lens 2: Event-Sourced vs Aggregated

RA.Aid Pattern: Every event is a row. Aggregation happens at query time.

Our Pattern: Aggregation happens at write time. Individual events are lost.

Verdict: ✅ ACCEPTABLE for metrics, but incomplete for observability

For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just sf_subagent_dispatch_blocked{reason="provider"} 5.

Recommendation: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.

Lens 3: Context and Session Scoping

RA.Aid Pattern: Every record has a session_id foreign key. Contextvars scope the DB connection.

Our Pattern: Metrics are global to the process. No session scoping.

Verdict: ❌ GAP

Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:

Per-session cost attribution
Debugging why a specific run failed
Multi-tenant scenarios (if SF ever serves multiple users)

Recommendation: Add session_id label to all metrics. Use ctx.sessionId or getAutoSession().currentTraceId.

Lens 4: Cost and Token Tracking

RA.Aid Pattern: Every trajectory record has current_cost, input_tokens, output_tokens.

Our Pattern: No cost/token metrics in metrics-central yet.

Verdict: ❌ MISSING

RA.Aid tracks cost per tool call. We track cost in metrics.js (SQLite + JSONL) but not in metrics-central. This means:

No Prometheus-compatible cost metrics
No cost alerts from Grafana
No cost attribution by work mode or permission profile

Recommendation: Add cost/token metrics:

"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },
"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },
"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },

Lens 5: Error Handling and Resilience

RA.Aid Pattern: Every error is caught, logged, and stored in the trajectory with full context.

Our Pattern: flushMetrics() catches and logs with logWarning(). No retry.

Verdict: ⚠️ ACCEPTABLE but could be stronger

Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:

Reopens closed DB connections automatically
Has fallback handlers for agent failures
Records error details in the trajectory

Recommendation:

Add retry with exponential backoff for flush failures
If flush fails 3 times, emit a metrics_flush_failed counter
On process exit, attempt a final synchronous flush

Specific Code Review Findings

Finding 1: Unused Import

import { isDbAvailable } from "./sf-db.js";

This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.

Fix: Either implement DB persistence or remove the import.

Finding 2: Histogram Bucket Sorting

this.buckets = [...buckets].sort((a, b) => a - b);

This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.

Verdict: ✅ Correct.

Finding 3: Label Key Serialization

_key(labels) {
    return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
}

If a label value contains = or ,, the key parsing will break.

Fix: Add escaping or use a structured key format (e.g., JSON).

Finding 4: No Validation on Metric Names

export function recordCounter(name, labels = {}, amount = 1) {
    const meta = getMetricMeta(name);
    getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);
}

If name contains spaces or invalid Prometheus characters, the output will be malformed.

Fix: Add validateMetricName(name) that rejects invalid characters.

Finding 5: Timer Unref

if (_flushTimer.unref) _flushTimer.unref();

This is correct for Node.js but may not work in all environments (e.g., Bun).

Verdict: ✅ Acceptable with fallback.

Overall Assessment

Dimension	Grade	Notes
Correctness	B+	Prometheus output is valid, but label escaping needs work
Completeness	B	Missing cost/token metrics, session scoping, DB persistence
Consistency with SF	A	Fits the extension model, uses existing patterns
Consistency with RA.Aid	C	RA.Aid would prefer event-sourced over aggregated
Production Readiness	B	Needs retry, validation, and DB projection before GA

Priority Fixes

P0: Add session_id label to all metrics
P0: Remove unused isDbAvailable import or implement DB persistence
P1: Add cost/token metrics
P1: Fix label value escaping
P1: Add metric name validation
P2: Add retry with backoff for flush failures
P2: Add final flush on process exit
P2: Consider a metrics table in sf.db as source of truth

RA.Aid Patterns Worth Adopting

Trajectory-style event logging: Every metric should have a corresponding event in the audit/trajectory system
Session-scoped connections: All observability should be filterable by session
Per-tool cost tracking: Every tool call should record cost and tokens
Error detail preservation: When metrics indicate failure, the detail should be queryable

Conclusion

metrics-central.js is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes exposition format over observability depth. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.

The right path forward:

Keep metrics-central for Prometheus output (Grafana compatibility)
Add a metrics table to sf.db for queryable aggregation
Ensure every metric has a corresponding audit/trajectory event
Add session scoping and cost tracking

This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.

9.9 KiB Raw Blame History