singularity-forge/docs/dev/pi-context-optimization-opportunities.md
ace-pm b29c12d5e5 refactor(native): rename gsd_parser.rs to forge_parser.rs
Final rebrand: rename remaining Rust source file to complete the gsd → forge
transition. All parser references already use forge_parser after earlier commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:58:21 +02:00

11 KiB
Raw Permalink Blame History

pi-coding-agent: Context Optimization Opportunities

Status: Research only — not planned for implementation. Scope: packages/pi-coding-agent and packages/pi-agent-core infrastructure. These changes would benefit every consumer of the pi engine, not just SF.


1. Prompt Caching (cache_control) — Highest Impact

Current state: Every LLM call re-pays full input token cost for the system prompt, tool definitions, and context files. No cache_control breakpoints are set anywhere in the API call path.

Opportunity: Anthropic's KV cache delivers 90% cost reduction on cached tokens (0.1x input rate). Claude Code achieves 9298% cache hit rates by placing stable content before volatile content.

Where to instrument (packages/pi-ai/src/providers/anthropic.ts):

  • Set cache_control: { type: "ephemeral" } on the last tool definition block
  • Set cache_control after the static system prompt sections (base boilerplate + context files)
  • Leave the per-turn user message uncached

Critical constraint: The cache breakpoint must be placed after all static content and before any dynamic content (timestamps, per-request variables). Moving a timestamp before a cache breakpoint defeats it on every call.

Cache hierarchy: Tools → system → messages. Changing a tool definition invalidates system and message caches. Tool definitions should be sorted deterministically (alphabetically) to prevent spurious cache misses.

Expected savings: 8090% reduction in input token cost for multi-turn sessions (the dominant cost pattern in SF auto-mode).


2. Observation Masking in the Message Pipeline

Current state: agent-loop.ts passes the full context.messages array to the LLM on every turn. Tool results from 50 turns ago are re-read in full on every subsequent call. The transformContext hook exists on AgentContext and fires before every LLM call, but has no default implementation — extensions are responsible for any pruning.

Opportunity: Replace old tool result content with lightweight placeholders after N turns. JetBrains Research tested this on SWE-bench Verified (500 tasks, up to 250-turn trajectories) and found:

  • 50%+ cost reduction vs. unmanaged history
  • Performance matched or slightly exceeded LLM summarization
  • Zero overhead (no extra LLM call required)

Proposed implementation (default transformContext in pi-agent-core):

// Keep last KEEP_RECENT_TURNS verbatim; mask older tool results
const KEEP_RECENT_TURNS = 8;

function defaultObservationMask(messages: AgentMessage[]): AgentMessage[] {
  const cutoff = findTurnBoundary(messages, KEEP_RECENT_TURNS);
  return messages.map((m, i) => {
    if (i >= cutoff) return m;
    if (m.type === "toolResult" || m.type === "bashExecution") {
      return { ...m, content: "[result masked — within summarized history]", excludeFromContext: false };
    }
    return m;
  });
}

Compaction interaction: Observation masking reduces the token accumulation rate, pushing the compaction threshold further out. The two mechanisms are complementary — masking handles the steady state, compaction handles the rare deep-session case.


3. Earlier Compaction Threshold

Current state (packages/pi-coding-agent/src/core/constants.ts):

COMPACTION_RESERVE_TOKENS = 16_384   // triggers at contextWindow - 16K
COMPACTION_KEEP_RECENT_TOKENS = 20_000

For a 200K context window, compaction fires at ~183K tokens — 91.5% utilization.

Problem: Context drift (not raw exhaustion) causes ~65% of enterprise agent failures. Performance degrades measurably beyond ~30K tokens per Zylos production data. The current threshold lets sessions run degraded for a long stretch before compaction fires.

Opportunity: Lower the trigger to 70% utilization. For a 200K window, this means compacting at ~140K tokens — 43K tokens earlier.

// Proposed
COMPACTION_THRESHOLD_PERCENT = 0.70   // fire at 70% of contextWindow
COMPACTION_RESERVE_TOKENS = contextWindow * (1 - COMPACTION_THRESHOLD_PERCENT)

Trade-off: More frequent compactions, each happening earlier when there's more "fresh" content to keep. Summary quality improves because less material needs to be discarded at each cut.


4. Tool Result Truncation at Write Time

Current state: TOOL_RESULT_MAX_CHARS = 2_000 in constants.ts, but this limit is only applied during compaction summarization, not when the tool result enters the message store. A bash result returning 50KB of log output is stored and re-sent verbatim until compaction fires.

Opportunity: Truncate at write time in messages.tsconvertToLlm() or in the tool result handler. Two strategies:

  • Hard truncation: Slice at N chars, append "\n[truncated — {original_length} chars]". Simple, zero overhead.
  • Semantic head/tail: Keep first 500 chars (context, command echo) + last 1000 chars (final output, errors). Better for bash results where the end contains the error.

Recommendation: Semantic head/tail as the default, configurable per tool type. File read results benefit from head; bash/test output benefits from head+tail.


5. Context File Deduplication and Trim

Current state (packages/pi-coding-agent/src/core/resource-loader.ts, lines 84109):

  • Searches from ~/.sf/agent/ → ancestor dirs → cwd
  • Deduplicates by file path but not by content
  • Entire file content concatenated verbatim into system prompt — no trimming, no summarization

Anti-pattern: A project with AGENTS.md at 3 ancestor levels (repo root, workspace, home) injects all three in full. If they share common boilerplate, that content is re-injected multiple times.

Opportunities:

  1. Content deduplication: Hash paragraph-level chunks; skip any chunk already seen in a previously-loaded file
  2. Section-aware loading: Parse ## headings in AGENTS.md; only include sections relevant to the current task type (e.g., ## Testing section only when running tests)
  3. Token budget enforcement: If total context files exceed N tokens, summarize oldest/most-distant file rather than including verbatim

6. Skill Content Lazy Loading and Summarization

Current state: When /skill:name is invoked, the full skill file content is injected inline as <skill>...</skill> in the user message. No chunking, no summarization. A 10KB skill file adds ~2,500 tokens to that turn.

Opportunity:

  • Cached skill injection: If the same skill is used across multiple turns (rare but possible), it's re-injected each time. Cache with cache_control after first injection.
  • Skill digest mode: Inject a 200-token summary of the skill on first reference; full content only if the model requests it via a get_skill_detail tool call. Reduces cost for skills that don't end up being followed.
  • Skill prefetching: Before a known long session (e.g., auto-mode start), pre-inject all likely skills with cache_control so they're cached for the entire session.

7. Token Estimation Accuracy

Current state (compaction.ts, line 216): chars / 4 heuristic. This overestimates token count for English prose (~3.5 chars/token) and underestimates for code with short identifiers or Unicode.

Opportunity: Use a proper tokenizer.

  • @anthropic-ai/tokenizer (tiktoken-compatible, ships with the SDK) — accurate but ~5ms per call
  • Tiered approach: use chars/4 for display; use proper tokenizer only for compaction threshold decisions (where accuracy matters)

Impact: More accurate compaction timing, fewer unnecessary compactions, slightly better COMPACTION_KEEP_RECENT_TOKENS boundary placement.


8. Format: Markdown over XML for Internal Context

Current state: The message pipeline uses <skill>, <summary>, <compaction> XML wrappers in several places. System prompt sections are largely prose Markdown.

Findings: XML tags carry 1540% more tokens than equivalent Markdown for the same semantic content, due to paired open/close tags. However, Claude was optimized for XML and shows higher accuracy on tasks requiring precise section parsing.

Recommendation: Audit XML usage in the pipeline and convert to Markdown where the content is:

  • Non-nested (flat instructions, status messages)
  • Human-readable rather than machine-parsed by the model
  • Not requiring precise boundary detection

Keep XML for: few-shot examples with ambiguous boundaries, skill content (requires precise isolation from surrounding text), compaction summaries that the model must treat as authoritative history.

Estimated savings: 515% reduction in system prompt token count.


9. Dynamic Tool Set Delivery

Current state: All tool definitions are included in every LLM request. Tool descriptions consume 6080% of input tokens in static configurations. As new extensions register tools, the baseline grows linearly.

Opportunity (higher complexity): Implement the three-function Dynamic Toolset pattern:

  1. search_tools(query) — semantic search over tool catalog
  2. describe_tools(ids[]) — fetch full schemas on demand
  3. execute_tool(id, params) — unchanged execution

Speakeasy measured 9197% token reduction with 100% task success rate. Trade-off: 23x more tool calls, ~50% longer wall time. Net cost dramatically lower.

Feasibility for pi: The tool registry (packages/pi-coding-agent/src/core/tool-registry.ts) already stores tool metadata separately from definitions. The primary engineering work is the semantic search index and the describe_tools / search_tools tool implementations.


10. Cost Attribution and Per-Phase Reporting

Current state: SessionManager.getUsageTotals() accumulates cost across the entire session. No per-phase or per-agent breakdown is stored. Cost visibility is limited to the footer total and SF_SHOW_TOKEN_COST=1 per-turn display.

Opportunity: Emit structured cost events that extensions can subscribe to:

interface CostCheckpointEvent {
  type: "cost_checkpoint";
  label: string;          // "discuss-phase", "execute-slice-3"
  deltaTokens: Usage;     // tokens since last checkpoint
  cumulativeTokens: Usage;
  cumulativeCost: number;
}

SF extension could consume these events to surface per-milestone cost in /sf stats and flag milestones that are disproportionately expensive — enabling budget-aware planning.


Implementation Ordering (if pursued)

Priority Item Effort Expected Impact
1 Prompt caching (cache_control) Low 8090% input cost reduction
2 Earlier compaction threshold (70%) Trivial Reduces drift in long sessions
3 Tool result truncation at write time Low Reduces context bloat between compactions
4 Context file deduplication Medium Variable — high for multi-level AGENTS.md setups
5 Observation masking (default transformContext) Medium 50%+ on long-running agents
6 Token estimation (proper tokenizer) Low Accuracy improvement, minor cost impact
7 Markdown over XML audit Low 515% system prompt reduction
8 Skill caching with cache_control Low Meaningful for skill-heavy sessions
9 Dynamic tool set delivery High 90%+ on large tool catalogs; major architecture change
10 Per-phase cost attribution events Medium Visibility only; enables future budget routing