singularity-forge/.plans/token-optimization-suite.md
Jeremy McSpadden 39b3daee6f feat: add token optimization suite for prompt caching, compression, and smart context selection
Introduces six new modules that work together to reduce token usage across
the dispatch pipeline while preserving semantic content quality:

- Provider-aware token counting with per-provider char/token ratios
- Prompt cache optimizer for maximizing Anthropic/OpenAI cache hit rates
- Structured data formatter (compact notation for decisions/requirements/tasks)
- Deterministic prompt compressor (light/moderate/aggressive levels)
- Semantic chunker with TF-IDF relevance scoring for context selection
- Summary distiller for condensed dependency summaries

Integration points:
- inlineDependencySummaries uses distillation before truncation (3+ deps)
- inlineDecisionsFromDb/inlineRequirementsFromDb use compact format at non-full levels
- buildExecuteTaskPrompt compresses carry-forward when it exceeds 40% of budget
- context-budget.reduceToFit combines compression with section-boundary truncation
- computeBudgets accepts optional provider for accurate char/token ratios

All existing 1475 unit tests + 30 integration tests pass with zero regressions.
157 new tests cover all optimization modules.
2026-03-17 22:02:27 -05:00

9.9 KiB

Token Optimization Suite — Implementation Plan

Overview

Comprehensive token optimization across the GSD dispatch pipeline. Six phases targeting prompt caching, accurate token counting, structured data compression, prompt compression, semantic context selection, and context distillation.

Phase 1: Prompt Cache Optimization (P0)

Goal: Restructure dispatch prompt assembly for maximum cache hit rates.

What

Anthropic prompt caching gives 90% savings on cached input tokens. Currently, GSD places cache_control on system prompts and the last user message (in packages/pi-ai/src/providers/anthropic.ts). But dispatch prompts in auto-prompts.ts mix static and dynamic content throughout, reducing cache prefix reuse.

Tasks

  1. Create prompt-cache-optimizer.ts — module that separates prompt content into cacheable (static) and dynamic (per-task) sections.

    • Static: templates, plans, decisions, roadmap, project context
    • Dynamic: task-specific instructions, file contents, overrides
    • Export splitForCaching(prompt: string, staticSections: string[]): { staticPrefix: string; dynamicSuffix: string }
  2. Add buildCacheablePrefix() to auto-prompts.ts — for each builder, extract the static portion that's reused across tasks in the same slice:

    • Slice plan (same across all tasks in slice)
    • Decisions register (same across all tasks)
    • Requirements (same within scope)
    • Templates (always the same)
  3. Metrics tracking — extend metrics.ts to track cacheHitRate per unit. Already tracks cacheRead and cacheWrite tokens — add derived percentage.

Files Modified

  • src/resources/extensions/gsd/prompt-cache-optimizer.ts (NEW)
  • src/resources/extensions/gsd/auto-prompts.ts (modify builders)
  • src/resources/extensions/gsd/metrics.ts (add cache hit rate)
  • src/resources/extensions/gsd/tests/prompt-cache-optimizer.test.ts (NEW)

Phase 2: Accurate Multi-Provider Token Counting (P1)

Goal: Replace GPT-4o-only tiktoken with provider-aware counting.

What

token-counter.ts uses tiktoken with gpt-4o encoder for ALL providers. Claude uses a different tokenizer, so counts can be off by 15-25%. This causes budget under/over-allocation.

Tasks

  1. Add provider-aware counting — extend countTokens() to accept an optional provider parameter:

    • anthropic: Use @anthropic-ai/sdk messages.countTokens() for exact counts
    • openai: Keep tiktoken (already accurate)
    • google/mistral/others: Keep chars/4 heuristic (best available)
  2. Add estimateTokensForProvider(text, provider) function — synchronous estimation that uses provider-specific char ratios:

    • Anthropic: ~3.5 chars/token (their tokenizer is slightly more efficient)
    • OpenAI: ~4 chars/token (tiktoken accurate)
    • Others: ~4 chars/token (conservative default)
  3. Update context-budget.ts — use provider-aware CHARS_PER_TOKEN constant based on the configured execution model's provider.

Files Modified

  • src/resources/extensions/gsd/token-counter.ts (extend)
  • src/resources/extensions/gsd/context-budget.ts (provider-aware ratio)
  • src/resources/extensions/gsd/tests/token-counter.test.ts (NEW)
  • src/resources/extensions/gsd/tests/context-budget.test.ts (extend)

Phase 3: Structured Data Compression with TOON (P1)

Goal: Reduce token usage for structured data blocks in prompts by 30-60%.

What

Decisions registers, requirements lists, task plans, and metrics are passed as verbose markdown tables. TOON (Token-Oriented Object Notation) removes braces/brackets/quotes, using indentation and tabular patterns instead.

Tasks

  1. Add @toon-format/toon dependency — install the npm package.

  2. Create structured-data-formatter.ts — module that converts structured data to TOON format for prompt injection:

    • formatDecisionsTOON(decisions: Decision[]): string
    • formatRequirementsTOON(requirements: Requirement[]): string
    • formatTaskPlanTOON(tasks: TaskPlanEntry[]): string
    • Each includes a brief format header so the LLM knows how to parse it
  3. Integrate with context-store.ts — add TOON variants of formatDecisionsForPrompt() and formatRequirementsForPrompt().

  4. Gate behind inline levelminimal and standard use TOON; full uses markdown (backward compatible).

Files Modified

  • package.json (add dependency)
  • src/resources/extensions/gsd/structured-data-formatter.ts (NEW)
  • src/resources/extensions/gsd/context-store.ts (add TOON variants)
  • src/resources/extensions/gsd/auto-prompts.ts (use TOON when level != full)
  • src/resources/extensions/gsd/tests/structured-data-formatter.test.ts (NEW)

Phase 4: Prompt Compression via LLMLingua-2 (P2)

Goal: Compress large context blocks 3-5x while preserving semantic meaning.

What

When context exceeds budget, instead of dropping entire sections (current behavior), compress them using LLMLingua-2. This preserves information density while reducing tokens.

Tasks

  1. Create prompt-compressor.ts — wrapper around compression logic:

    • compressContext(text: string, targetRatio: number): Promise<string>
    • Supports configurable compression ratios (2x for light, 5x for aggressive)
    • Falls back to section-boundary truncation if compression fails
    • Includes compression stats for metrics
  2. Integrate with context-budget.ts — add compressBeforeTruncate option:

    • When content exceeds budget, try compression first
    • Only truncate if compressed content still exceeds budget
    • Track compression ratio in metrics
  3. Gate behind preference — new compression_strategy preference:

    • "truncate" (default, backward-compatible): current section-boundary truncation
    • "compress": use LLMLingua-2 before truncating
    • Budget profile auto-enables compress for budget and balanced

Files Modified

  • src/resources/extensions/gsd/prompt-compressor.ts (NEW)
  • src/resources/extensions/gsd/context-budget.ts (integrate)
  • src/resources/extensions/gsd/preferences.ts (add compression_strategy)
  • src/resources/extensions/gsd/types.ts (add CompressionStrategy type)
  • src/resources/extensions/gsd/tests/prompt-compressor.test.ts (NEW)

Note

LLMLingua-2 JS port (@atjsh/llmlingua-2) is experimental. We'll implement the interface with a fallback path so the feature degrades gracefully. If the JS port isn't stable enough, we can use the Compresso REST API as an alternative, or implement a simpler heuristic compression (remove redundant whitespace, deduplicate repeated patterns, abbreviate common programming terms).


Phase 5: Semantic Context Selection (P2)

Goal: Only include semantically relevant content in prompts instead of entire files.

What

diff-context.ts currently selects recently-changed files. auto-prompts.ts inlines entire files. For large files, this wastes tokens on irrelevant sections.

Tasks

  1. Create semantic-chunker.ts — wrapper for semantic text splitting:

    • chunkByRelevance(content: string, query: string, maxChunks: number): string[]
    • Splits content into semantic chunks (function boundaries, class boundaries, etc.)
    • Scores chunks by relevance to the task description
    • Returns top-N most relevant chunks
    • Uses simple TF-IDF scoring (no embeddings needed for v1)
  2. Integrate with inlineFile() — when inlining large files (>2000 chars), chunk and select relevant portions:

    • Extract task description/plan as the "query"
    • Score file chunks against the query
    • Include only high-scoring chunks with [...N chunks omitted] markers
  3. Add context_selection preference:

    • "full": inline entire files (current behavior)
    • "smart": use semantic chunking for files over threshold
    • Auto-enabled for budget and balanced profiles

Files Modified

  • src/resources/extensions/gsd/semantic-chunker.ts (NEW)
  • src/resources/extensions/gsd/auto-prompts.ts (integrate with inlineFile)
  • src/resources/extensions/gsd/preferences.ts (add context_selection)
  • src/resources/extensions/gsd/types.ts (add ContextSelectionMode type)
  • src/resources/extensions/gsd/tests/semantic-chunker.test.ts (NEW)

Phase 6: Summary Distillation (P3)

Goal: Produce tighter dependency summaries when budget is constrained.

What

inlineDependencySummaries() currently concatenates full summaries from prior slices. When a slice has many dependencies, this consumes a large portion of the context budget.

Tasks

  1. Create summary-distiller.ts — reduces multiple summaries to a condensed form:

    • distillSummaries(summaries: string[], budgetChars: number): string
    • Extracts key facts: files modified, decisions made, patterns established
    • Removes verbose prose, keeps structured data
    • Preserves all key_files, key_decisions, provides, requires frontmatter
    • Falls back to section-boundary truncation for non-parseable summaries
  2. Integrate with auto-prompts.ts — use distiller when:

    • Dependency count > 2 AND budget is constrained
    • InlineLevel is "minimal" or "standard"
    • Budget pressure is above 50%

Files Modified

  • src/resources/extensions/gsd/summary-distiller.ts (NEW)
  • src/resources/extensions/gsd/auto-prompts.ts (integrate with inlineDependencySummaries)
  • src/resources/extensions/gsd/tests/summary-distiller.test.ts (NEW)

Implementation Order

  1. Phase 2 (token counting) — foundation, needed by other phases
  2. Phase 1 (cache optimization) — highest ROI
  3. Phase 3 (TOON format) — quick win on structured data
  4. Phase 6 (summary distillation) — pure logic, no 3rd party
  5. Phase 5 (semantic chunking) — TF-IDF v1, no 3rd party
  6. Phase 4 (prompt compression) — depends on 3rd party stability

Testing Strategy

  • Each phase adds dedicated unit tests
  • Existing tests must continue to pass (no regressions)
  • Token savings tests validate measurable reduction
  • Run full test suite after each phase: npm run test:unit