singularity-forge/docs/SPEC_FIRST_TDD.md
2026-05-08 03:01:20 +02:00

16 KiB
Raw Blame History

sf Spec-First TDD

The change-method constitution for sf. Terse and procedural — optimized for agent retrieval. It operationalizes ADR-0000: SF Is a Purpose-to-Software Compiler.

Purpose

Every change in sf must:

  1. solve a real system need
  2. preserve or increase system value
  3. clarify behavior before implementation
  4. make tests define the contract
  5. find and close gaps in what already exists

Priority: purpose > value > contract > working code.

If purpose and value are clear but implementation is uncertain, write contract tests first and align code to them.

Iron Law

THE TEST IS THE SPEC.  THE JSDOC IS THE PURPOSE.  CODE EXISTS TO FULFILL PURPOSE.

NO BEHAVIOR CHANGE WITHOUT A FAILING TEST FIRST.
NO COMPLETION WITHOUT A REAL CONSUMER.
NO JUDGMENT CALL WITHOUT A CONFIDENCE AND FALSIFIER.

The test is the spec — not verification of the spec. Tests describe what the software MUST do, not what it happens to do. A test that mirrors implementation rubber-stamps bugs.

The JSDoc is the purpose — every exported function, type, and class opens with a one-line Purpose: statement. If you can't write the purpose before the code, you don't know what you're building. Purpose drives what the test asserts. Code without a stated purpose cannot be verified.

Code exists to fulfill purpose — not to compile, not to pass lint, not to look clean. Quality measure: does it satisfy the purpose (JSDoc) as verified by the spec (test)? Code that compiles but doesn't serve its stated purpose is a bug.

Purposeful tests vs. mechanical tests

Kind Asserts Survives refactor?
Purposeful "claim() returns rows_affected=1 only when the lease was free or expired" yes
Mechanical mockDb.update.calls.length === 1 no

Write purposeful tests first. They are the spec. A different implementation that passes them is equally correct. Add mechanical tests only as labelled implementation guards for specific failure modes (resource leaks, infinite loops).

Three-tier test organization

  1. Behaviour contracts (primary) — what the consumer receives. The spec.
  2. Degradation contracts — what happens when dependencies fail. Consumer must always get a useful response; failure must degrade, not crash.
  3. Implementation guards (secondary, labelled) — protect against specific failure modes. A refactor that changes internals updates guards, not behaviour contracts.

Decomposition Path

.sf working model + DB roadmap → Milestone → Slice → Task → contract test → code → evidence

Reject: prompt → files → hope.

Every unit (milestone, slice, task) sits in one of those rows. If a piece of work doesn't, it is unspecified.

Purpose Gate

Every artifact (slice plan, task plan, function, test, ADR) must answer the same 8 PDD fields captured by the purpose-driven-development skill — these fields ARE the Purpose Gate:

  • Purpose: why this behaviour exists.
  • Consumer: who depends on the outcome in production (real caller, not just tests).
  • Contract: what observable behaviour proves success — what the consumer receives, not how the implementation works internally.
  • Failure boundary: what correct failure looks like if the purpose can't be fulfilled — degrade, surface, do not swallow.
  • Evidence: the test, metric, or repro that proves the contract. Each criterion must be machine-executable (named test, queryable metric, runnable command) OR explicitly tagged [MANUAL: reviewer + scenario]. Prose-only evidence is unfalsifiable and rejected.
  • Non-goals: what this is not solving.
  • Invariants: what must remain true. If the change touches async, queues, timers, or state machines, split into safety ("X never happens") + liveness ("Y eventually happens"). Pure synchronous code may use safety-only.
  • Assumptions: conditions about the world that MUST be true for this spec to be valid — locking protocols, API stability, caller invariants, deployment context, data shape. World-side failures (assumption violated) are invisible to internal tests and are the most expensive failure class.

If any field is missing: BLOCKED: purpose unclear — [which field is missing]. Do not invent a plausible answer to proceed. Surfacing the gap is more valuable than rationalising past it.

Treat the contract as a falsifiable hypothesis: name the evidence that would prove it wrong before implementation locks in. A contract without a falsifier is half a contract.

Workflow (mapped to sf's phase machine)

Research phase — name the problem

Before any plan:

  • Where does this sit in .sf/PROJECT.md, .sf/REQUIREMENTS.md, .sf/DECISIONS.md, or DB-backed roadmap state?
  • Why is it useful, who needs it, what does it enable?
  • What breaks if wrong, what is out of scope?

For brownfield changes, consumer discovery precedes purpose articulation. Use rg / git grep to find real callers — never assume. You cannot reason about "what breaks" until you know who calls the code.

rg -nF "functionName" src/ packages/ --type=ts
git grep -n "functionName"

If you can't name a real consumer, stop. Don't add code yet.

Plan phase — clarify before deciding

Clarify highest-impact unknowns first: behaviour, acceptance criteria, data invariants, failure handling, security, integration boundaries.

For non-trivial contracts, pressure-test before locking the plan via the advisory-partner skill — this is sf's adversarial review surface, already wired into the Q3/Q4 gates and validate-milestone. It runs with the validation model, distinct from the planning/execution model — that's the point.

  1. Advocate pass — strengthen the best version of the contract.
  2. Challenger pass — attack assumptions AND propose an alternative. A challenger anchored to the advocate's framing is not adversarial.
  3. Falsifier (required gate, blocks Plan→Execute): FALSIFIER: this contract is wrong if [specific observable condition]. Generic falsifiers ("wrong if it doesn't work") are process failures.

Find the devil and find the experts:

  • Devil — finds the specific failure that compounds silently: wrong assumption → wrong test → wrong code → wrong evidence, all passing.
  • Experts — domain specialists who know what right looks like. Pick expertise matching the decision: SRE (reliability), security (trust boundary), distributed systems (consistency), API reviewer (ergonomics).

Both forces must act on the contract before it becomes tests. One strong pass each, unless concrete risk remains.

Plan from contracts, not files

Purpose re-check: restate purpose from the Research step in one sentence. If the plan now serves a different purpose, the contract drifted — go back.

Each behaviour slice defines: consumer, contract, code path, validation, falsifier.

Good Bad
Add failing test proving claim() rejects expired-lease takeover when claim_until > now(). Edit src/resources/extensions/sf/auto-dispatch.ts.

TDD phase — write the test first

  1. Write the failing test.
  2. Make it fail for the right reason (feature missing, not typo).
  3. Only then write production code.

Purpose re-check: does this test prove behaviour serving the stated purpose?

Test types:

Behaviour Test type
Pure logic, local invariants Unit
Interface/schema contracts Contract
Storage, orchestration, multi-component Integration
Existing behaviour you must preserve Characterisation
State machines, routing, normalisation Property/invariant

Test naming: test_<what>_<when>_<expected> or describe-blocks structured the same way. The name is the contract claim.

npm run test:unit -- path/to/file.test.ts

If it passes immediately, you're testing existing behaviour. Fix the test.

Execute phase — minimal production code

Smallest change that makes the spec (test) green while serving the purpose (JSDoc). Nothing more. No YAGNI violations, no surrounding cleanup.

Do not weaken the test to fit sloppy code — fix the code. Code that compiles and passes lint but doesn't fulfil its stated purpose is a bug.

Verify phase — green, lint, type-check

npm run typecheck:extensions
npm test

All tests green. Zero lint/type errors. Then refactor while green.

Review phase — verify usefulness

Purpose re-check (final): does the code serve a real production consumer?

Verify: who calls it (rg for usages), what production path depends on it, what signal would reveal breakage. If only tests call it, it is not finished or not needed.

Falsifier follow-through: re-check the falsifier from the Plan phase. If the falsifier is observable post-deploy, add it to monitoring or to the unit's verification commands. A falsifier that is never checked after deploy is half a contract.

Zero callers ≠ zero purpose. Before deleting: does it serve an unmet need (wire it in) or is it superseded (delete it)? Never test for absence of old code — test that new behaviour works.

Confidence Gate (between phases)

After completing a step, state confidence as a number 0.01.0 and a one-line reason. The number forces a pause to assess rather than plowing ahead on momentum.

Step Threshold Below threshold
Purpose & consumer 0.95 Run an adversarial review wave (advisory-partner Q3/Q5).
Contract test 0.90 Adversarial review wave.
Implementation 0.95 Add a specialist reviewer for the touched boundary (e.g. provider/transport/security).
Final evidence 0.97 Full adversarial: advocate + challenger + specialist.

Skip the gate for trivial steps (typo fix, exhaustive matches with full coverage). The gate earns its keep on I/O boundaries, async loading, protocol integration, and anything touching real backends or models.

LLM confidence numbers are poorly calibrated in absolute terms — the relative signal matters. If you write 0.7, you know you're guessing. Act on that.

Tests Find Gaps

Testing existing code is one of the highest-value activities sf can do. A test that reveals an existing gap is more valuable than one validating new code — the gap was compounding in production.

High-value gap tests:

  • Purpose — does this module do what its JSDoc claims?
  • Fallback — does failure surface or get masked?
  • Persistence — does state survive restart? (especially .sf/sf.db, .sf/runtime/*.json)
  • Boundary — what happens at empty input, max value, network partition, expired claim?
  • Contract — does the caller get what it expects?

When a test fails against existing code, fix the code. The test told you what was broken.

50 tested features > 500 untested ones.

Test Rules

  • Test first. Without it, you mirror implementation — bugs and all.
  • Bug = missing correct-behaviour test. Write a test for the correct behaviour first; it must fail (RED) because the bug exists. If it passes immediately, the test is wrong (testing the broken behaviour) — fix the test, not the code.
  • Bug reports → failing regression test first.
  • Behaviour change without tests is incomplete.
  • Bad tests produce bad code. A test validating silent failure is wrong — rewrite it.
  • Test through the public contract. Don't expose _helpers for testability; assert through real callers.
  • Test pin behaviour, not internal decomposition. A test that breaks on refactor without behaviour change is mechanical, not purposeful.
  • Critical invariants may need property tests, not just examples (e.g. ULID monotonicity, claim race, idempotent migrations).
  • Fix code to satisfy live-contract tests. Fix or delete tests encoding stale behaviour.
  • Fallbacks must deliver working behaviour or not exist. A fallback that silently returns nothing is worse than none.

Test Boundaries

  • Test through the public contract that production consumers use.
  • Do not promote _helper to helper for testing convenience.
  • Assert through public methods, not implementation detail.
  • Tests pin behaviour, not internal decomposition.
  • For Node.js native test runner: async test functions and await; never call .then()/.catch() chains in test bodies when await expresses the same contract.

Self-Modification Boundary

sf modifies its own codebase via the auto-loop. Without a protected zone, constitutional drift is silent.

Protected files (human approval required): .sf/PRINCIPLES.md, .sf/TASTE.md, .sf/ANTI-GOALS.md, .sf/REQUIREMENTS.md, .sf/DECISIONS.md, BUILD_PLAN.md, UPSTREAM_PORT_GUIDE.md, AGENTS.md, CLAUDE.md, CONTRIBUTING.md, docs/SPEC_FIRST_TDD.md, every docs/dev/ADR-*.md.

Autonomous agents may propose changes but must not merge to these without human review.

Test infrastructure (tests/, *.test.ts, tsconfig*.json, lint config) requires advocate/challenger/falsifier — a change to test infra can make all future tests pass vacuously. Treat test-infra changes as governance-adjacent: they alter the validity of every test that runs after them. A corrupted test runner is more dangerous than a corrupted test.

Evidence

Required for production-impacting changes:

  • failing test → passing test → type-check → lint
  • advocate's strongest support, challenger's strongest opposition, falsifier + outcome
  • runtime evidence: traces (.sf/traces/), event log (.sf/event-log.jsonl), gate results
  • for non-trivial runtime/provider fixes: explicit repro before code, solved boundary after code

Persist learning: when a unit produces a gotcha or anti-pattern, write to sf's memory store (memories table) so the next unit sees it. Evidence that only lives in the conversation dies on restart.

Degraded Operation

Dependency down Behaviour
Native engine (forge_engine.node) Fall back to JS implementations; log degraded mode. Never silently proceed without confirming fallback path is wired.
node:sqlite unavailable Block DB-owned operations; there is no normal no-DB planning mode or alternate SQLite engine fallback. Read files only as human evidence.
LLM provider Try next allowed provider per ~/.sf/preferences.md; if exhausted, halt unit with ErrModelUnavailable (no silent skip).
SOPS unavailable Use already-exported env vars; log that secret refresh is unavailable. Block secret-touching commands.

When a dependency is down: operate in defined degraded mode or stop. Never silently proceed.

Task Template

Each task:

Purpose (need + why) → Consumer (who depends) → Contract (test proving it) → Implementation (code changes) → Evidence (test + lint + runtime signal).

If a task cannot be described this way, it is underspecified.

See Also

References

  • GitHub Spec Kit — spec-first authoring patterns.
  • Ousterhout, A Philosophy of Software Design — deep modules, contract pattern.
  • Trail of Bits — anti-rationalisation rules.
  • ACE — original Iron Law / Purpose Gate framing this doc adapts.