15 KiB
Repo-native Harness Architecture
Purpose
This document defines how sf builds, runs, and evolves repository-specific harnesses while preserving the split between tracked repo contracts, .sf/sf.db operational state, and Singularity Memory.
Goals
- Generate harnesses that match the repo's actual stack, risks, and production contract.
- Understand untracked files without silently owning them.
- Use deterministic evidence before model judgment.
- Retain proven lessons and anti-patterns into Singularity Memory.
- Evolve harnesses when the repo changes.
- Keep every generated file reviewable by the repo owner.
Non-goals
- Replacing the repo's existing test runner or CI provider.
- Treating LLM judge scores as sufficient for critical engineering correctness.
- Storing memories or embeddings inside
.sf/sf.db. - Auto-staging untracked files that sf merely observed.
System Flow
repo files + git + docs + CI + package manifests + prior runs
|
v
Repo Profiler
|
v
Risk Classifier
|
v
Harness Planner <---- Singularity Memory recall
|
v
Template Kit Registry
|
v
Harness Writer ---- tracked files: SPEC, ARCHITECTURE, harness/, gates/, CI snippets
|
v
Evidence Runner ---- .sf/sf.db: runs, cases, results, observations, drift
|
v
Memory Retainer ---- Singularity Memory: patterns, anti-patterns, repo risk notes
|
v
Evolution Engine ---- schedules harness update proposals
Auto-flow Integration
Repo-native harnessing should enter the sf flow in stages. The early stages are read-only or evidence-only; prompt behavior changes come after fixtures exist.
| Flow point | Add now or later | Behavior |
|---|---|---|
Session start / sf init |
First implementation slice | Create a read-only RepoProfile snapshot from source, docs, CI, manifests, git status, and prior run history. |
| Plan phase | Later, after profiler tests | Surface missing harness coverage as a planning input, not as an automatic file write. |
| Execute phase | Later | Allow a task to adopt a proposed harness file only when the task plan claims it. |
| Verify phase | First implementation slice after manifest | Run harness commands and eval suites declared in harness/manifest.json. |
| PostUnit hook | First implementation slice | Store evidence summaries in .sf/sf.db; retain durable learnings and anti-patterns into Singularity Memory. |
| Reassess phase | Later | Use failed gates and repeated drift to propose harness updates. |
| Workflow prompt injection | Last | Inject top harness lessons and anti-patterns into prompts only after fixture coverage proves it improves outcomes. |
The immediate flow contract is:
- Observe repo shape.
- Record untracked files as observations only.
- Compare observed risk against existing harness coverage.
- Propose harness changes as reviewable artifacts.
- Run accepted harnesses.
- Retain evidence-backed lessons.
Do not jump directly to automatic prompt injection. That is where stale or noisy memory can degrade agent behavior before the evidence path is reliable.
Data Ownership
| Data | Stored in | Why |
|---|---|---|
| Human contract | Tracked repo files | Reviewable, diffable, travels with code. |
| Executable gates and eval cases | Tracked repo files | CI can run them without sf internals. |
| Run history | .sf/sf.db |
Local operational evidence and fast queries. |
| Repo profile snapshots | .sf/sf.db |
Derived state, can be recomputed. |
| Untracked-file observations | .sf/sf.db |
Important context, but not owned by sf. |
| Learnings and anti-patterns | Singularity Memory | Durable knowledge across sessions and tools. |
| Large reports | .sf/reports/ or harness report dirs |
Avoid bloating SQLite and prompts. |
Repo Profiler
The profiler reads the repository and emits a RepoProfile snapshot:
interface RepoProfile {
profileId: string;
projectHash: string;
git: {
head: string | null;
branch: string | null;
remoteHash: string | null;
dirty: boolean;
changedFiles: RepoFileObservation[];
};
stacks: StackSignal[];
entrypoints: EntrypointSignal[];
tests: TestSignal[];
ci: CiSignal[];
docs: DocumentSignal[];
dataStores: DataStoreSignal[];
networkSurfaces: NetworkSurfaceSignal[];
riskHints: RiskHint[];
createdAt: number;
}
Inputs include:
git status --short,git ls-files, current branch, remote, recent history.- Package manifests such as
package.json,go.mod,Cargo.toml,pyproject.toml,flake.nix, Dockerfiles, Compose files, devcontainers. - Test directories, scripts, CI workflows, lint configs, migrations, fixtures, browser tests, smoke tests.
- Documentation such as
SPEC.md,ARCHITECTURE.md,AGENTS.md, ADRs, runbooks, deployment docs. - Source structure, entry points, route maps, command definitions, service definitions, generated files.
Untracked File Policy
Untracked files are part of repo reality. sf must see them.
interface RepoFileObservation {
path: string;
gitStatus: "tracked" | "modified" | "deleted" | "renamed" | "untracked" | "ignored";
ownership: "sf_generated" | "user_owned" | "observed_only" | "candidate_harness";
language: string | null;
sizeBytes: number;
contentHash: string | null;
summary: string | null;
firstSeenAt: number;
lastSeenAt: number;
adoptedAt: number | null;
adoptionUnitId: string | null;
}
Rules:
untrackeddefaults toobserved_only.observed_onlyfiles can influence context, risk classification, and memory.observed_onlyfiles cannot be staged, deleted, reformatted, moved, or overwritten by automatic flows.- A file becomes
sf_generatedorcandidate_harnessonly when a unit plan declares that ownership and the diff is reviewable. - Repeated observations can produce a harness recommendation, not an automatic commit.
This lets sf understand documents, scratch specs, generated reports, and local experiments without turning them into accidental repository history.
Risk Classifier
The classifier maps RepoProfile to required harness families.
| Risk family | Signals | Required harness examples |
|---|---|---|
| Web | Next.js, Playwright, routes, CSS, browser tools | Playwright smoke, a11y, visual diffs, performance budget, browser trace replay |
| Agent | tool registry, prompts, MCP, provider SDKs | fixture replay, trajectory assertions, tool permission tests, injection red-team |
| RAG / retrieval | vector DB, embeddings, search, chunking | recall@k, MRR, NDCG, near-miss sets, faithfulness, context recall |
| Infrastructure | Nix, Docker, CI, deploy scripts | build matrix, secret scan, config validation, rollback checks |
| Database | migrations, SQL, ORM | migration up/down, data contract tests, destructive-change guard |
| Windows service | _windows.go, service managers, PowerShell |
GOOS windows build, service install smoke, PowerShell contract tests |
| Security | auth, sessions, tokens, secrets | auth bypass tests, CSRF, rate limit, sensitive log scan |
| Performance | native bindings, compile-heavy code, hot loops | benchmark suite, regression threshold, flamegraph capture |
Harness Planner
The planner compares the required harness families against the repo's current harness inventory.
Outputs:
missing: risks with no harness coverage.weak: harness exists but lacks thresholds, fixtures, CI wiring, or reports.stale: harness references files/scripts that no longer exist.overbroad: harness is too slow or too generic for the risk.proposed: exact files and commands to add or modify.
Every proposal must include:
- Purpose.
- Consumer.
- Risk protected.
- Files written.
- Commands run.
- Blocking criteria.
- Rollback path.
Template Kit Registry
Template kits are starting points, not permanent truth.
interface HarnessTemplateKit {
id: string;
title: string;
appliesWhen: RiskHint[];
writes: TemplateOutput[];
commands: HarnessCommand[];
requiredEvidence: EvidenceRequirement[];
evolutionRules: EvolutionRule[];
}
Core kits:
| Kit | Files |
|---|---|
go-service |
harness/manifest.json, gates/go-test.sh, gates/go-vet.sh, optional gates/windows-build.sh |
typescript-cli |
gates/npm-build.sh, gates/typecheck.sh, fixture replay config |
agent-runtime |
harness/evals/agent/*.jsonl, trajectory assertions, injection red-team cases |
rag-system |
retrieval datasets, recall metrics, near-miss cases, judge rubrics |
web-app |
Playwright smoke, visual baseline policy, a11y checks |
database |
migration tests, destructive SQL guard, seed data fixtures |
nix-project |
nix flake check, dev shell smoke, direnv policy checks |
charm-service |
Go build/test, Wish SSH smoke, VCR session recording checks |
Harness Manifest
Each repo can carry a manifest:
{
"schema": "sf.harness.v1",
"owner": "sf",
"generatedBy": "sf",
"repoProfileId": "01J...",
"riskFamilies": ["agent", "rag", "web"],
"commands": [
{
"id": "fixture-replay",
"command": "npm run test:fixtures",
"phase": "post_slice",
"blocks": true,
"timeoutSeconds": 300
}
],
"evalSuites": [
{
"id": "agent-tool-safety",
"path": "harness/evals/agent-tool-safety.jsonl",
"runner": "sf-eval",
"threshold": 0.95
}
]
}
The manifest is a tracked contract. .sf/sf.db stores run history for the manifest, not the manifest itself.
Judge Rig
The judge rig follows one rule: deterministic evidence first, model judgment second.
Implementation boundary
This is documented now; it is not part of the current repo-profiler slice.
Placement:
- SF core stays in TypeScript for repo profiling, harness proposal planning,
project preferences/config, and
.sf/sf.dbrun ledgers. - Deterministic and structural assertions can run locally from SF because they already map to commands, AST checks, schemas, and git/diff checks.
- Model-judge execution and calibration should be a future Go/Charm service,
not another TS subsystem. Use
fantasy/catwalkfor model/provider routing, Go HTTP/MCP APIs for SF integration, andpromwish-style metrics when it is daemonized. - Durable calibration lessons belong in Singularity Memory. Local
.sf/sf.dbstores run IDs, rubric hashes, model IDs, scores, raw output references, and pass/fail summaries. - Repo-local custom skills remain out of scope. Repo-specific eval suites or harness files are later opt-in proposals only.
Case format
{
"id": "rag-role-reversal-001",
"kind": "retrieval",
"input": {
"query": "Which service owns failover routing?",
"expected_documents": ["docs/architecture.md#gateway"]
},
"assert": [
{ "type": "recall_at_k", "k": 5, "threshold": 1.0 },
{ "type": "context_recall", "threshold": 0.85 },
{ "type": "llm_rubric", "rubric": "Answer must identify the gateway and not the portal as the routing owner.", "advisory": true }
],
"tags": ["rag", "role-reversal", "near-miss"]
}
Assertion types
| Type | Blocking default | Notes |
|---|---|---|
exit_code |
yes | Command pass/fail. |
contains / not_contains |
yes | Deterministic text contracts. |
json_schema |
yes | Structured output contract. |
ast_match |
yes | Code shape and API use. |
recall_at_k |
yes when calibrated | Retrieval coverage. |
mrr / ndcg |
yes when calibrated | Ranking quality. |
tool_call_f1 |
yes when calibrated | Agent tool precision/recall. |
trajectory_goal_success |
no by default | Useful judge signal, requires trace data. |
llm_rubric |
no by default | Advisory until calibrated. |
factuality |
no by default | Needs references and judge calibration. |
select_best |
no | Useful for model/prompt comparison. |
Judge calibration
Before a model judge can block:
- The rubric file must be tracked.
- The judge model and provider must be pinned.
- A calibration suite with known pass/fail examples must pass.
- A disagreement policy must exist for high-risk suites.
- The runner must store the judge prompt hash, model ID, score, reason, and raw output reference.
For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough.
Calibration lifecycle:
- Build a golden set with known pass, fail, and ambiguous examples from real bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good outputs.
- Split it into calibration and held-out suites. Tune rubrics only against the calibration suite.
- Pin the judge provider, model ID, temperature, output schema, rubric file, and rubric hash.
- Measure false-pass rate, false-block rate, precision/recall/F1 for the failure class, schema validity, quorum disagreement, and rerun stability.
- Keep the judge advisory until the held-out suite meets the threshold for the risk family.
- Promote to blocking only with either a deterministic/structural companion gate or a calibrated judge quorum.
- Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case schema, or target workflow changes.
Default promotion bar:
- Critical/security gates: zero false passes on held-out critical failures, plus deterministic or structural companion evidence.
- Product-quality gates: false-block rate low enough that developers do not route around the gate; judge remains advisory if noisy.
- RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG, context-recall, tool-call F1, or trajectory success; model rubrics explain failures but do not replace the metric.
Singularity Memory Integration
Pre-dispatch:
- Recall repo-specific harness lessons from
project/{hash}. - Recall global engineering anti-patterns from
global/coding. - Inject only the top relevant items into context.
- Keep untracked observations summarized, not pasted wholesale.
Post-unit:
- Retain successful harness changes only after gates pass.
- Retain failures as anti-patterns with source unit and evidence IDs.
- Retain judge calibration results separately from normal coding memories.
- Link memory entries to
.sf/sf.dbrun IDs and report paths.
Over time:
- Repeated failing eval cases become anti-patterns.
- Repeated successful fixes mature from candidate to established to proven.
- Stale memories decay unless revalidated by passing evidence.
- Drift events propose new harness tasks when repo reality changes.
Web As TUI
For web repos, treat the browser as an evented terminal:
- The DOM/accessibility tree is the screen buffer.
- User actions are keypress/click/form events.
- Playwright traces are VCR recordings.
- Visual diffs are frame comparisons.
- Browser console and network logs are stderr/stdout.
The web harness should include action replay, semantic assertions, accessibility checks, screenshot diffs, and performance budgets. It should not rely on screenshots alone.
Acceptance Criteria
- sf can profile a repo and produce a stable
RepoProfilesnapshot. - sf records untracked files as
observed_onlyand never stages them by default. - sf can generate a reviewable harness manifest and at least one executable gate from a template kit.
- sf can run a mixed deterministic/model-judge eval suite and store structured results.
- sf retains successful patterns and failed anti-patterns into Singularity Memory with evidence links.
- sf can detect harness drift and propose a follow-up unit instead of silently mutating files.