Mikael Hugo 93c1bbcb9a docs: plan judge calibration service

2026-04-29 18:28:45 +02:00

15 KiB

Raw Permalink Blame History

Repo-native Harness Architecture

Purpose

This document defines how sf builds, runs, and evolves repository-specific harnesses while preserving the split between tracked repo contracts, .sf/sf.db operational state, and Singularity Memory.

Goals

Generate harnesses that match the repo's actual stack, risks, and production contract.
Understand untracked files without silently owning them.
Use deterministic evidence before model judgment.
Retain proven lessons and anti-patterns into Singularity Memory.
Evolve harnesses when the repo changes.
Keep every generated file reviewable by the repo owner.

Non-goals

Replacing the repo's existing test runner or CI provider.
Treating LLM judge scores as sufficient for critical engineering correctness.
Storing memories or embeddings inside .sf/sf.db.
Auto-staging untracked files that sf merely observed.

System Flow

repo files + git + docs + CI + package manifests + prior runs
        |
        v
Repo Profiler
        |
        v
Risk Classifier
        |
        v
Harness Planner <---- Singularity Memory recall
        |
        v
Template Kit Registry
        |
        v
Harness Writer ---- tracked files: SPEC, ARCHITECTURE, harness/, gates/, CI snippets
        |
        v
Evidence Runner ---- .sf/sf.db: runs, cases, results, observations, drift
        |
        v
Memory Retainer ---- Singularity Memory: patterns, anti-patterns, repo risk notes
        |
        v
Evolution Engine ---- schedules harness update proposals

Auto-flow Integration

Repo-native harnessing should enter the sf flow in stages. The early stages are read-only or evidence-only; prompt behavior changes come after fixtures exist.

Flow point	Add now or later	Behavior
Session start / `sf init`	First implementation slice	Create a read-only `RepoProfile` snapshot from source, docs, CI, manifests, git status, and prior run history.
Plan phase	Later, after profiler tests	Surface missing harness coverage as a planning input, not as an automatic file write.
Execute phase	Later	Allow a task to adopt a proposed harness file only when the task plan claims it.
Verify phase	First implementation slice after manifest	Run harness commands and eval suites declared in `harness/manifest.json`.
PostUnit hook	First implementation slice	Store evidence summaries in `.sf/sf.db`; retain durable learnings and anti-patterns into Singularity Memory.
Reassess phase	Later	Use failed gates and repeated drift to propose harness updates.
Workflow prompt injection	Last	Inject top harness lessons and anti-patterns into prompts only after fixture coverage proves it improves outcomes.

The immediate flow contract is:

Observe repo shape.
Record untracked files as observations only.
Compare observed risk against existing harness coverage.
Propose harness changes as reviewable artifacts.
Run accepted harnesses.
Retain evidence-backed lessons.

Do not jump directly to automatic prompt injection. That is where stale or noisy memory can degrade agent behavior before the evidence path is reliable.

Data Ownership

Data	Stored in	Why
Human contract	Tracked repo files	Reviewable, diffable, travels with code.
Executable gates and eval cases	Tracked repo files	CI can run them without sf internals.
Run history	`.sf/sf.db`	Local operational evidence and fast queries.
Repo profile snapshots	`.sf/sf.db`	Derived state, can be recomputed.
Untracked-file observations	`.sf/sf.db`	Important context, but not owned by sf.
Learnings and anti-patterns	Singularity Memory	Durable knowledge across sessions and tools.
Large reports	`.sf/reports/` or harness report dirs	Avoid bloating SQLite and prompts.

Repo Profiler

The profiler reads the repository and emits a RepoProfile snapshot:

interface RepoProfile {
  profileId: string;
  projectHash: string;
  git: {
    head: string | null;
    branch: string | null;
    remoteHash: string | null;
    dirty: boolean;
    changedFiles: RepoFileObservation[];
  };
  stacks: StackSignal[];
  entrypoints: EntrypointSignal[];
  tests: TestSignal[];
  ci: CiSignal[];
  docs: DocumentSignal[];
  dataStores: DataStoreSignal[];
  networkSurfaces: NetworkSurfaceSignal[];
  riskHints: RiskHint[];
  createdAt: number;
}

Inputs include:

git status --short, git ls-files, current branch, remote, recent history.
Package manifests such as package.json, go.mod, Cargo.toml, pyproject.toml, flake.nix, Dockerfiles, Compose files, devcontainers.
Test directories, scripts, CI workflows, lint configs, migrations, fixtures, browser tests, smoke tests.
Documentation such as SPEC.md, ARCHITECTURE.md, AGENTS.md, ADRs, runbooks, deployment docs.
Source structure, entry points, route maps, command definitions, service definitions, generated files.

Untracked File Policy

Untracked files are part of repo reality. sf must see them.

interface RepoFileObservation {
  path: string;
  gitStatus: "tracked" | "modified" | "deleted" | "renamed" | "untracked" | "ignored";
  ownership: "sf_generated" | "user_owned" | "observed_only" | "candidate_harness";
  language: string | null;
  sizeBytes: number;
  contentHash: string | null;
  summary: string | null;
  firstSeenAt: number;
  lastSeenAt: number;
  adoptedAt: number | null;
  adoptionUnitId: string | null;
}

Rules:

untracked defaults to observed_only.
observed_only files can influence context, risk classification, and memory.
observed_only files cannot be staged, deleted, reformatted, moved, or overwritten by automatic flows.
A file becomes sf_generated or candidate_harness only when a unit plan declares that ownership and the diff is reviewable.
Repeated observations can produce a harness recommendation, not an automatic commit.

This lets sf understand documents, scratch specs, generated reports, and local experiments without turning them into accidental repository history.

Risk Classifier

The classifier maps RepoProfile to required harness families.

Risk family	Signals	Required harness examples
Web	Next.js, Playwright, routes, CSS, browser tools	Playwright smoke, a11y, visual diffs, performance budget, browser trace replay
Agent	tool registry, prompts, MCP, provider SDKs	fixture replay, trajectory assertions, tool permission tests, injection red-team
RAG / retrieval	vector DB, embeddings, search, chunking	recall@k, MRR, NDCG, near-miss sets, faithfulness, context recall
Infrastructure	Nix, Docker, CI, deploy scripts	build matrix, secret scan, config validation, rollback checks
Database	migrations, SQL, ORM	migration up/down, data contract tests, destructive-change guard
Windows service	`_windows.go`, service managers, PowerShell	GOOS windows build, service install smoke, PowerShell contract tests
Security	auth, sessions, tokens, secrets	auth bypass tests, CSRF, rate limit, sensitive log scan
Performance	native bindings, compile-heavy code, hot loops	benchmark suite, regression threshold, flamegraph capture

Harness Planner

The planner compares the required harness families against the repo's current harness inventory.

Outputs:

missing: risks with no harness coverage.
weak: harness exists but lacks thresholds, fixtures, CI wiring, or reports.
stale: harness references files/scripts that no longer exist.
overbroad: harness is too slow or too generic for the risk.
proposed: exact files and commands to add or modify.

Every proposal must include:

Purpose.
Consumer.
Risk protected.
Files written.
Commands run.
Blocking criteria.
Rollback path.

Template Kit Registry

Template kits are starting points, not permanent truth.

interface HarnessTemplateKit {
  id: string;
  title: string;
  appliesWhen: RiskHint[];
  writes: TemplateOutput[];
  commands: HarnessCommand[];
  requiredEvidence: EvidenceRequirement[];
  evolutionRules: EvolutionRule[];
}

Core kits:

Kit	Files
`go-service`	`harness/manifest.json`, `gates/go-test.sh`, `gates/go-vet.sh`, optional `gates/windows-build.sh`
`typescript-cli`	`gates/npm-build.sh`, `gates/typecheck.sh`, fixture replay config
`agent-runtime`	`harness/evals/agent/*.jsonl`, trajectory assertions, injection red-team cases
`rag-system`	retrieval datasets, recall metrics, near-miss cases, judge rubrics
`web-app`	Playwright smoke, visual baseline policy, a11y checks
`database`	migration tests, destructive SQL guard, seed data fixtures
`nix-project`	`nix flake check`, dev shell smoke, direnv policy checks
`charm-service`	Go build/test, Wish SSH smoke, VCR session recording checks

Harness Manifest

Each repo can carry a manifest:

{
  "schema": "sf.harness.v1",
  "owner": "sf",
  "generatedBy": "sf",
  "repoProfileId": "01J...",
  "riskFamilies": ["agent", "rag", "web"],
  "commands": [
    {
      "id": "fixture-replay",
      "command": "npm run test:fixtures",
      "phase": "post_slice",
      "blocks": true,
      "timeoutSeconds": 300
    }
  ],
  "evalSuites": [
    {
      "id": "agent-tool-safety",
      "path": "harness/evals/agent-tool-safety.jsonl",
      "runner": "sf-eval",
      "threshold": 0.95
    }
  ]
}

The manifest is a tracked contract. .sf/sf.db stores run history for the manifest, not the manifest itself.

Judge Rig

The judge rig follows one rule: deterministic evidence first, model judgment second.

Implementation boundary

This is documented now; it is not part of the current repo-profiler slice.

Placement:

SF core stays in TypeScript for repo profiling, harness proposal planning, project preferences/config, and .sf/sf.db run ledgers.
Deterministic and structural assertions can run locally from SF because they already map to commands, AST checks, schemas, and git/diff checks.
Model-judge execution and calibration should be a future Go/Charm service, not another TS subsystem. Use fantasy/catwalk for model/provider routing, Go HTTP/MCP APIs for SF integration, and promwish-style metrics when it is daemonized.
Durable calibration lessons belong in Singularity Memory. Local .sf/sf.db stores run IDs, rubric hashes, model IDs, scores, raw output references, and pass/fail summaries.
Repo-local custom skills remain out of scope. Repo-specific eval suites or harness files are later opt-in proposals only.

Case format

{
  "id": "rag-role-reversal-001",
  "kind": "retrieval",
  "input": {
    "query": "Which service owns failover routing?",
    "expected_documents": ["docs/architecture.md#gateway"]
  },
  "assert": [
    { "type": "recall_at_k", "k": 5, "threshold": 1.0 },
    { "type": "context_recall", "threshold": 0.85 },
    { "type": "llm_rubric", "rubric": "Answer must identify the gateway and not the portal as the routing owner.", "advisory": true }
  ],
  "tags": ["rag", "role-reversal", "near-miss"]
}

Assertion types

Type	Blocking default	Notes
`exit_code`	yes	Command pass/fail.
`contains` / `not_contains`	yes	Deterministic text contracts.
`json_schema`	yes	Structured output contract.
`ast_match`	yes	Code shape and API use.
`recall_at_k`	yes when calibrated	Retrieval coverage.
`mrr` / `ndcg`	yes when calibrated	Ranking quality.
`tool_call_f1`	yes when calibrated	Agent tool precision/recall.
`trajectory_goal_success`	no by default	Useful judge signal, requires trace data.
`llm_rubric`	no by default	Advisory until calibrated.
`factuality`	no by default	Needs references and judge calibration.
`select_best`	no	Useful for model/prompt comparison.

Judge calibration

Before a model judge can block:

The rubric file must be tracked.
The judge model and provider must be pinned.
A calibration suite with known pass/fail examples must pass.
A disagreement policy must exist for high-risk suites.
The runner must store the judge prompt hash, model ID, score, reason, and raw output reference.

For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough.

Calibration lifecycle:

Build a golden set with known pass, fail, and ambiguous examples from real bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good outputs.
Split it into calibration and held-out suites. Tune rubrics only against the calibration suite.
Pin the judge provider, model ID, temperature, output schema, rubric file, and rubric hash.
Measure false-pass rate, false-block rate, precision/recall/F1 for the failure class, schema validity, quorum disagreement, and rerun stability.
Keep the judge advisory until the held-out suite meets the threshold for the risk family.
Promote to blocking only with either a deterministic/structural companion gate or a calibrated judge quorum.
Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case schema, or target workflow changes.

Default promotion bar:

Critical/security gates: zero false passes on held-out critical failures, plus deterministic or structural companion evidence.
Product-quality gates: false-block rate low enough that developers do not route around the gate; judge remains advisory if noisy.
RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG, context-recall, tool-call F1, or trajectory success; model rubrics explain failures but do not replace the metric.

Singularity Memory Integration

Pre-dispatch:

Recall repo-specific harness lessons from project/{hash}.
Recall global engineering anti-patterns from global/coding.
Inject only the top relevant items into context.
Keep untracked observations summarized, not pasted wholesale.

Post-unit:

Retain successful harness changes only after gates pass.
Retain failures as anti-patterns with source unit and evidence IDs.
Retain judge calibration results separately from normal coding memories.
Link memory entries to .sf/sf.db run IDs and report paths.

Over time:

Repeated failing eval cases become anti-patterns.
Repeated successful fixes mature from candidate to established to proven.
Stale memories decay unless revalidated by passing evidence.
Drift events propose new harness tasks when repo reality changes.

Web As TUI

For web repos, treat the browser as an evented terminal:

The DOM/accessibility tree is the screen buffer.
User actions are keypress/click/form events.
Playwright traces are VCR recordings.
Visual diffs are frame comparisons.
Browser console and network logs are stderr/stdout.

The web harness should include action replay, semantic assertions, accessibility checks, screenshot diffs, and performance budgets. It should not rely on screenshots alone.

Acceptance Criteria

sf can profile a repo and produce a stable RepoProfile snapshot.
sf records untracked files as observed_only and never stages them by default.
sf can generate a reviewable harness manifest and at least one executable gate from a template kit.
sf can run a mixed deterministic/model-judge eval suite and store structured results.
sf retains successful patterns and failed anti-patterns into Singularity Memory with evidence links.
sf can detect harness drift and propose a follow-up unit instead of silently mutating files.

15 KiB Raw Permalink Blame History