singularity-forge/docs/dev/repo-native-harness-architecture.md
2026-04-29 18:28:45 +02:00

15 KiB

Repo-native Harness Architecture

Purpose

This document defines how sf builds, runs, and evolves repository-specific harnesses while preserving the split between tracked repo contracts, .sf/sf.db operational state, and Singularity Memory.

Goals

  • Generate harnesses that match the repo's actual stack, risks, and production contract.
  • Understand untracked files without silently owning them.
  • Use deterministic evidence before model judgment.
  • Retain proven lessons and anti-patterns into Singularity Memory.
  • Evolve harnesses when the repo changes.
  • Keep every generated file reviewable by the repo owner.

Non-goals

  • Replacing the repo's existing test runner or CI provider.
  • Treating LLM judge scores as sufficient for critical engineering correctness.
  • Storing memories or embeddings inside .sf/sf.db.
  • Auto-staging untracked files that sf merely observed.

System Flow

repo files + git + docs + CI + package manifests + prior runs
        |
        v
Repo Profiler
        |
        v
Risk Classifier
        |
        v
Harness Planner <---- Singularity Memory recall
        |
        v
Template Kit Registry
        |
        v
Harness Writer ---- tracked files: SPEC, ARCHITECTURE, harness/, gates/, CI snippets
        |
        v
Evidence Runner ---- .sf/sf.db: runs, cases, results, observations, drift
        |
        v
Memory Retainer ---- Singularity Memory: patterns, anti-patterns, repo risk notes
        |
        v
Evolution Engine ---- schedules harness update proposals

Auto-flow Integration

Repo-native harnessing should enter the sf flow in stages. The early stages are read-only or evidence-only; prompt behavior changes come after fixtures exist.

Flow point Add now or later Behavior
Session start / sf init First implementation slice Create a read-only RepoProfile snapshot from source, docs, CI, manifests, git status, and prior run history.
Plan phase Later, after profiler tests Surface missing harness coverage as a planning input, not as an automatic file write.
Execute phase Later Allow a task to adopt a proposed harness file only when the task plan claims it.
Verify phase First implementation slice after manifest Run harness commands and eval suites declared in harness/manifest.json.
PostUnit hook First implementation slice Store evidence summaries in .sf/sf.db; retain durable learnings and anti-patterns into Singularity Memory.
Reassess phase Later Use failed gates and repeated drift to propose harness updates.
Workflow prompt injection Last Inject top harness lessons and anti-patterns into prompts only after fixture coverage proves it improves outcomes.

The immediate flow contract is:

  1. Observe repo shape.
  2. Record untracked files as observations only.
  3. Compare observed risk against existing harness coverage.
  4. Propose harness changes as reviewable artifacts.
  5. Run accepted harnesses.
  6. Retain evidence-backed lessons.

Do not jump directly to automatic prompt injection. That is where stale or noisy memory can degrade agent behavior before the evidence path is reliable.

Data Ownership

Data Stored in Why
Human contract Tracked repo files Reviewable, diffable, travels with code.
Executable gates and eval cases Tracked repo files CI can run them without sf internals.
Run history .sf/sf.db Local operational evidence and fast queries.
Repo profile snapshots .sf/sf.db Derived state, can be recomputed.
Untracked-file observations .sf/sf.db Important context, but not owned by sf.
Learnings and anti-patterns Singularity Memory Durable knowledge across sessions and tools.
Large reports .sf/reports/ or harness report dirs Avoid bloating SQLite and prompts.

Repo Profiler

The profiler reads the repository and emits a RepoProfile snapshot:

interface RepoProfile {
  profileId: string;
  projectHash: string;
  git: {
    head: string | null;
    branch: string | null;
    remoteHash: string | null;
    dirty: boolean;
    changedFiles: RepoFileObservation[];
  };
  stacks: StackSignal[];
  entrypoints: EntrypointSignal[];
  tests: TestSignal[];
  ci: CiSignal[];
  docs: DocumentSignal[];
  dataStores: DataStoreSignal[];
  networkSurfaces: NetworkSurfaceSignal[];
  riskHints: RiskHint[];
  createdAt: number;
}

Inputs include:

  • git status --short, git ls-files, current branch, remote, recent history.
  • Package manifests such as package.json, go.mod, Cargo.toml, pyproject.toml, flake.nix, Dockerfiles, Compose files, devcontainers.
  • Test directories, scripts, CI workflows, lint configs, migrations, fixtures, browser tests, smoke tests.
  • Documentation such as SPEC.md, ARCHITECTURE.md, AGENTS.md, ADRs, runbooks, deployment docs.
  • Source structure, entry points, route maps, command definitions, service definitions, generated files.

Untracked File Policy

Untracked files are part of repo reality. sf must see them.

interface RepoFileObservation {
  path: string;
  gitStatus: "tracked" | "modified" | "deleted" | "renamed" | "untracked" | "ignored";
  ownership: "sf_generated" | "user_owned" | "observed_only" | "candidate_harness";
  language: string | null;
  sizeBytes: number;
  contentHash: string | null;
  summary: string | null;
  firstSeenAt: number;
  lastSeenAt: number;
  adoptedAt: number | null;
  adoptionUnitId: string | null;
}

Rules:

  • untracked defaults to observed_only.
  • observed_only files can influence context, risk classification, and memory.
  • observed_only files cannot be staged, deleted, reformatted, moved, or overwritten by automatic flows.
  • A file becomes sf_generated or candidate_harness only when a unit plan declares that ownership and the diff is reviewable.
  • Repeated observations can produce a harness recommendation, not an automatic commit.

This lets sf understand documents, scratch specs, generated reports, and local experiments without turning them into accidental repository history.

Risk Classifier

The classifier maps RepoProfile to required harness families.

Risk family Signals Required harness examples
Web Next.js, Playwright, routes, CSS, browser tools Playwright smoke, a11y, visual diffs, performance budget, browser trace replay
Agent tool registry, prompts, MCP, provider SDKs fixture replay, trajectory assertions, tool permission tests, injection red-team
RAG / retrieval vector DB, embeddings, search, chunking recall@k, MRR, NDCG, near-miss sets, faithfulness, context recall
Infrastructure Nix, Docker, CI, deploy scripts build matrix, secret scan, config validation, rollback checks
Database migrations, SQL, ORM migration up/down, data contract tests, destructive-change guard
Windows service _windows.go, service managers, PowerShell GOOS windows build, service install smoke, PowerShell contract tests
Security auth, sessions, tokens, secrets auth bypass tests, CSRF, rate limit, sensitive log scan
Performance native bindings, compile-heavy code, hot loops benchmark suite, regression threshold, flamegraph capture

Harness Planner

The planner compares the required harness families against the repo's current harness inventory.

Outputs:

  • missing: risks with no harness coverage.
  • weak: harness exists but lacks thresholds, fixtures, CI wiring, or reports.
  • stale: harness references files/scripts that no longer exist.
  • overbroad: harness is too slow or too generic for the risk.
  • proposed: exact files and commands to add or modify.

Every proposal must include:

  • Purpose.
  • Consumer.
  • Risk protected.
  • Files written.
  • Commands run.
  • Blocking criteria.
  • Rollback path.

Template Kit Registry

Template kits are starting points, not permanent truth.

interface HarnessTemplateKit {
  id: string;
  title: string;
  appliesWhen: RiskHint[];
  writes: TemplateOutput[];
  commands: HarnessCommand[];
  requiredEvidence: EvidenceRequirement[];
  evolutionRules: EvolutionRule[];
}

Core kits:

Kit Files
go-service harness/manifest.json, gates/go-test.sh, gates/go-vet.sh, optional gates/windows-build.sh
typescript-cli gates/npm-build.sh, gates/typecheck.sh, fixture replay config
agent-runtime harness/evals/agent/*.jsonl, trajectory assertions, injection red-team cases
rag-system retrieval datasets, recall metrics, near-miss cases, judge rubrics
web-app Playwright smoke, visual baseline policy, a11y checks
database migration tests, destructive SQL guard, seed data fixtures
nix-project nix flake check, dev shell smoke, direnv policy checks
charm-service Go build/test, Wish SSH smoke, VCR session recording checks

Harness Manifest

Each repo can carry a manifest:

{
  "schema": "sf.harness.v1",
  "owner": "sf",
  "generatedBy": "sf",
  "repoProfileId": "01J...",
  "riskFamilies": ["agent", "rag", "web"],
  "commands": [
    {
      "id": "fixture-replay",
      "command": "npm run test:fixtures",
      "phase": "post_slice",
      "blocks": true,
      "timeoutSeconds": 300
    }
  ],
  "evalSuites": [
    {
      "id": "agent-tool-safety",
      "path": "harness/evals/agent-tool-safety.jsonl",
      "runner": "sf-eval",
      "threshold": 0.95
    }
  ]
}

The manifest is a tracked contract. .sf/sf.db stores run history for the manifest, not the manifest itself.

Judge Rig

The judge rig follows one rule: deterministic evidence first, model judgment second.

Implementation boundary

This is documented now; it is not part of the current repo-profiler slice.

Placement:

  • SF core stays in TypeScript for repo profiling, harness proposal planning, project preferences/config, and .sf/sf.db run ledgers.
  • Deterministic and structural assertions can run locally from SF because they already map to commands, AST checks, schemas, and git/diff checks.
  • Model-judge execution and calibration should be a future Go/Charm service, not another TS subsystem. Use fantasy/catwalk for model/provider routing, Go HTTP/MCP APIs for SF integration, and promwish-style metrics when it is daemonized.
  • Durable calibration lessons belong in Singularity Memory. Local .sf/sf.db stores run IDs, rubric hashes, model IDs, scores, raw output references, and pass/fail summaries.
  • Repo-local custom skills remain out of scope. Repo-specific eval suites or harness files are later opt-in proposals only.

Case format

{
  "id": "rag-role-reversal-001",
  "kind": "retrieval",
  "input": {
    "query": "Which service owns failover routing?",
    "expected_documents": ["docs/architecture.md#gateway"]
  },
  "assert": [
    { "type": "recall_at_k", "k": 5, "threshold": 1.0 },
    { "type": "context_recall", "threshold": 0.85 },
    { "type": "llm_rubric", "rubric": "Answer must identify the gateway and not the portal as the routing owner.", "advisory": true }
  ],
  "tags": ["rag", "role-reversal", "near-miss"]
}

Assertion types

Type Blocking default Notes
exit_code yes Command pass/fail.
contains / not_contains yes Deterministic text contracts.
json_schema yes Structured output contract.
ast_match yes Code shape and API use.
recall_at_k yes when calibrated Retrieval coverage.
mrr / ndcg yes when calibrated Ranking quality.
tool_call_f1 yes when calibrated Agent tool precision/recall.
trajectory_goal_success no by default Useful judge signal, requires trace data.
llm_rubric no by default Advisory until calibrated.
factuality no by default Needs references and judge calibration.
select_best no Useful for model/prompt comparison.

Judge calibration

Before a model judge can block:

  • The rubric file must be tracked.
  • The judge model and provider must be pinned.
  • A calibration suite with known pass/fail examples must pass.
  • A disagreement policy must exist for high-risk suites.
  • The runner must store the judge prompt hash, model ID, score, reason, and raw output reference.

For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough.

Calibration lifecycle:

  1. Build a golden set with known pass, fail, and ambiguous examples from real bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good outputs.
  2. Split it into calibration and held-out suites. Tune rubrics only against the calibration suite.
  3. Pin the judge provider, model ID, temperature, output schema, rubric file, and rubric hash.
  4. Measure false-pass rate, false-block rate, precision/recall/F1 for the failure class, schema validity, quorum disagreement, and rerun stability.
  5. Keep the judge advisory until the held-out suite meets the threshold for the risk family.
  6. Promote to blocking only with either a deterministic/structural companion gate or a calibrated judge quorum.
  7. Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case schema, or target workflow changes.

Default promotion bar:

  • Critical/security gates: zero false passes on held-out critical failures, plus deterministic or structural companion evidence.
  • Product-quality gates: false-block rate low enough that developers do not route around the gate; judge remains advisory if noisy.
  • RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG, context-recall, tool-call F1, or trajectory success; model rubrics explain failures but do not replace the metric.

Singularity Memory Integration

Pre-dispatch:

  • Recall repo-specific harness lessons from project/{hash}.
  • Recall global engineering anti-patterns from global/coding.
  • Inject only the top relevant items into context.
  • Keep untracked observations summarized, not pasted wholesale.

Post-unit:

  • Retain successful harness changes only after gates pass.
  • Retain failures as anti-patterns with source unit and evidence IDs.
  • Retain judge calibration results separately from normal coding memories.
  • Link memory entries to .sf/sf.db run IDs and report paths.

Over time:

  • Repeated failing eval cases become anti-patterns.
  • Repeated successful fixes mature from candidate to established to proven.
  • Stale memories decay unless revalidated by passing evidence.
  • Drift events propose new harness tasks when repo reality changes.

Web As TUI

For web repos, treat the browser as an evented terminal:

  • The DOM/accessibility tree is the screen buffer.
  • User actions are keypress/click/form events.
  • Playwright traces are VCR recordings.
  • Visual diffs are frame comparisons.
  • Browser console and network logs are stderr/stdout.

The web harness should include action replay, semantic assertions, accessibility checks, screenshot diffs, and performance budgets. It should not rely on screenshots alone.

Acceptance Criteria

  • sf can profile a repo and produce a stable RepoProfile snapshot.
  • sf records untracked files as observed_only and never stages them by default.
  • sf can generate a reviewable harness manifest and at least one executable gate from a template kit.
  • sf can run a mixed deterministic/model-judge eval suite and store structured results.
  • sf retains successful patterns and failed anti-patterns into Singularity Memory with evidence links.
  • sf can detect harness drift and propose a follow-up unit instead of silently mutating files.