Introduces six new modules that work together to reduce token usage across the dispatch pipeline while preserving semantic content quality: - Provider-aware token counting with per-provider char/token ratios - Prompt cache optimizer for maximizing Anthropic/OpenAI cache hit rates - Structured data formatter (compact notation for decisions/requirements/tasks) - Deterministic prompt compressor (light/moderate/aggressive levels) - Semantic chunker with TF-IDF relevance scoring for context selection - Summary distiller for condensed dependency summaries Integration points: - inlineDependencySummaries uses distillation before truncation (3+ deps) - inlineDecisionsFromDb/inlineRequirementsFromDb use compact format at non-full levels - buildExecuteTaskPrompt compresses carry-forward when it exceeds 40% of budget - context-budget.reduceToFit combines compression with section-boundary truncation - computeBudgets accepts optional provider for accurate char/token ratios All existing 1475 unit tests + 30 integration tests pass with zero regressions. 157 new tests cover all optimization modules.
9.9 KiB
Token Optimization Suite — Implementation Plan
Overview
Comprehensive token optimization across the GSD dispatch pipeline. Six phases targeting prompt caching, accurate token counting, structured data compression, prompt compression, semantic context selection, and context distillation.
Phase 1: Prompt Cache Optimization (P0)
Goal: Restructure dispatch prompt assembly for maximum cache hit rates.
What
Anthropic prompt caching gives 90% savings on cached input tokens. Currently, GSD places
cache_control on system prompts and the last user message (in packages/pi-ai/src/providers/anthropic.ts).
But dispatch prompts in auto-prompts.ts mix static and dynamic content throughout,
reducing cache prefix reuse.
Tasks
-
Create
prompt-cache-optimizer.ts— module that separates prompt content into cacheable (static) and dynamic (per-task) sections.- Static: templates, plans, decisions, roadmap, project context
- Dynamic: task-specific instructions, file contents, overrides
- Export
splitForCaching(prompt: string, staticSections: string[]): { staticPrefix: string; dynamicSuffix: string }
-
Add
buildCacheablePrefix()to auto-prompts.ts — for each builder, extract the static portion that's reused across tasks in the same slice:- Slice plan (same across all tasks in slice)
- Decisions register (same across all tasks)
- Requirements (same within scope)
- Templates (always the same)
-
Metrics tracking — extend
metrics.tsto trackcacheHitRateper unit. Already trackscacheReadandcacheWritetokens — add derived percentage.
Files Modified
src/resources/extensions/gsd/prompt-cache-optimizer.ts(NEW)src/resources/extensions/gsd/auto-prompts.ts(modify builders)src/resources/extensions/gsd/metrics.ts(add cache hit rate)src/resources/extensions/gsd/tests/prompt-cache-optimizer.test.ts(NEW)
Phase 2: Accurate Multi-Provider Token Counting (P1)
Goal: Replace GPT-4o-only tiktoken with provider-aware counting.
What
token-counter.ts uses tiktoken with gpt-4o encoder for ALL providers. Claude uses a
different tokenizer, so counts can be off by 15-25%. This causes budget under/over-allocation.
Tasks
-
Add provider-aware counting — extend
countTokens()to accept an optionalproviderparameter:anthropic: Use@anthropic-ai/sdkmessages.countTokens()for exact countsopenai: Keep tiktoken (already accurate)google/mistral/others: Keep chars/4 heuristic (best available)
-
Add
estimateTokensForProvider(text, provider)function — synchronous estimation that uses provider-specific char ratios:- Anthropic: ~3.5 chars/token (their tokenizer is slightly more efficient)
- OpenAI: ~4 chars/token (tiktoken accurate)
- Others: ~4 chars/token (conservative default)
-
Update
context-budget.ts— use provider-awareCHARS_PER_TOKENconstant based on the configured execution model's provider.
Files Modified
src/resources/extensions/gsd/token-counter.ts(extend)src/resources/extensions/gsd/context-budget.ts(provider-aware ratio)src/resources/extensions/gsd/tests/token-counter.test.ts(NEW)src/resources/extensions/gsd/tests/context-budget.test.ts(extend)
Phase 3: Structured Data Compression with TOON (P1)
Goal: Reduce token usage for structured data blocks in prompts by 30-60%.
What
Decisions registers, requirements lists, task plans, and metrics are passed as verbose markdown tables. TOON (Token-Oriented Object Notation) removes braces/brackets/quotes, using indentation and tabular patterns instead.
Tasks
-
Add
@toon-format/toondependency — install the npm package. -
Create
structured-data-formatter.ts— module that converts structured data to TOON format for prompt injection:formatDecisionsTOON(decisions: Decision[]): stringformatRequirementsTOON(requirements: Requirement[]): stringformatTaskPlanTOON(tasks: TaskPlanEntry[]): string- Each includes a brief format header so the LLM knows how to parse it
-
Integrate with
context-store.ts— add TOON variants offormatDecisionsForPrompt()andformatRequirementsForPrompt(). -
Gate behind inline level —
minimalandstandarduse TOON;fulluses markdown (backward compatible).
Files Modified
package.json(add dependency)src/resources/extensions/gsd/structured-data-formatter.ts(NEW)src/resources/extensions/gsd/context-store.ts(add TOON variants)src/resources/extensions/gsd/auto-prompts.ts(use TOON when level != full)src/resources/extensions/gsd/tests/structured-data-formatter.test.ts(NEW)
Phase 4: Prompt Compression via LLMLingua-2 (P2)
Goal: Compress large context blocks 3-5x while preserving semantic meaning.
What
When context exceeds budget, instead of dropping entire sections (current behavior), compress them using LLMLingua-2. This preserves information density while reducing tokens.
Tasks
-
Create
prompt-compressor.ts— wrapper around compression logic:compressContext(text: string, targetRatio: number): Promise<string>- Supports configurable compression ratios (2x for light, 5x for aggressive)
- Falls back to section-boundary truncation if compression fails
- Includes compression stats for metrics
-
Integrate with
context-budget.ts— addcompressBeforeTruncateoption:- When content exceeds budget, try compression first
- Only truncate if compressed content still exceeds budget
- Track compression ratio in metrics
-
Gate behind preference — new
compression_strategypreference:"truncate"(default, backward-compatible): current section-boundary truncation"compress": use LLMLingua-2 before truncating- Budget profile auto-enables compress for
budgetandbalanced
Files Modified
src/resources/extensions/gsd/prompt-compressor.ts(NEW)src/resources/extensions/gsd/context-budget.ts(integrate)src/resources/extensions/gsd/preferences.ts(add compression_strategy)src/resources/extensions/gsd/types.ts(add CompressionStrategy type)src/resources/extensions/gsd/tests/prompt-compressor.test.ts(NEW)
Note
LLMLingua-2 JS port (@atjsh/llmlingua-2) is experimental. We'll implement the interface
with a fallback path so the feature degrades gracefully. If the JS port isn't stable enough,
we can use the Compresso REST API as an alternative, or implement a simpler heuristic
compression (remove redundant whitespace, deduplicate repeated patterns, abbreviate
common programming terms).
Phase 5: Semantic Context Selection (P2)
Goal: Only include semantically relevant content in prompts instead of entire files.
What
diff-context.ts currently selects recently-changed files. auto-prompts.ts inlines
entire files. For large files, this wastes tokens on irrelevant sections.
Tasks
-
Create
semantic-chunker.ts— wrapper for semantic text splitting:chunkByRelevance(content: string, query: string, maxChunks: number): string[]- Splits content into semantic chunks (function boundaries, class boundaries, etc.)
- Scores chunks by relevance to the task description
- Returns top-N most relevant chunks
- Uses simple TF-IDF scoring (no embeddings needed for v1)
-
Integrate with
inlineFile()— when inlining large files (>2000 chars), chunk and select relevant portions:- Extract task description/plan as the "query"
- Score file chunks against the query
- Include only high-scoring chunks with
[...N chunks omitted]markers
-
Add
context_selectionpreference:"full": inline entire files (current behavior)"smart": use semantic chunking for files over threshold- Auto-enabled for
budgetandbalancedprofiles
Files Modified
src/resources/extensions/gsd/semantic-chunker.ts(NEW)src/resources/extensions/gsd/auto-prompts.ts(integrate with inlineFile)src/resources/extensions/gsd/preferences.ts(add context_selection)src/resources/extensions/gsd/types.ts(add ContextSelectionMode type)src/resources/extensions/gsd/tests/semantic-chunker.test.ts(NEW)
Phase 6: Summary Distillation (P3)
Goal: Produce tighter dependency summaries when budget is constrained.
What
inlineDependencySummaries() currently concatenates full summaries from prior slices.
When a slice has many dependencies, this consumes a large portion of the context budget.
Tasks
-
Create
summary-distiller.ts— reduces multiple summaries to a condensed form:distillSummaries(summaries: string[], budgetChars: number): string- Extracts key facts: files modified, decisions made, patterns established
- Removes verbose prose, keeps structured data
- Preserves all
key_files,key_decisions,provides,requiresfrontmatter - Falls back to section-boundary truncation for non-parseable summaries
-
Integrate with
auto-prompts.ts— use distiller when:- Dependency count > 2 AND budget is constrained
- InlineLevel is "minimal" or "standard"
- Budget pressure is above 50%
Files Modified
src/resources/extensions/gsd/summary-distiller.ts(NEW)src/resources/extensions/gsd/auto-prompts.ts(integrate with inlineDependencySummaries)src/resources/extensions/gsd/tests/summary-distiller.test.ts(NEW)
Implementation Order
- Phase 2 (token counting) — foundation, needed by other phases
- Phase 1 (cache optimization) — highest ROI
- Phase 3 (TOON format) — quick win on structured data
- Phase 6 (summary distillation) — pure logic, no 3rd party
- Phase 5 (semantic chunking) — TF-IDF v1, no 3rd party
- Phase 4 (prompt compression) — depends on 3rd party stability
Testing Strategy
- Each phase adds dedicated unit tests
- Existing tests must continue to pass (no regressions)
- Token savings tests validate measurable reduction
- Run full test suite after each phase:
npm run test:unit