Iouri Goussev a952391b33 chore: rename preferences.md to PREFERENCES.md for consistency (#2700 ) (#2738 )

All other .gsd/ state files use uppercase naming (DECISIONS.md,
REQUIREMENTS.md, PROJECT.md, etc). This renames the canonical
preferences file to PREFERENCES.md while keeping a migration
fallback — the loader checks PREFERENCES.md first, then falls
back to lowercase preferences.md for existing installations.

Closes #2700

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-26 16:09:59 -06:00

15 KiB

Raw Permalink Blame History

Plan: Dynamic Model Routing for Token Optimization

Issue: #575 — Token Consumption Optimization through Dynamic Model Selection Status: Draft Date: 2025-03-15

Problem Statement

Users on capped plans (e.g., Claude Pro) exhaust weekly token limits in 15-20 hours of GSD usage. Currently, GSD uses a single model per phase (research/planning/execution/completion), configured statically in preferences. Simple tasks consume the same tokens as complex ones.

Current Architecture

What Exists

Phase-based model config: Users can set different models per phase via PREFERENCES.md (research, planning, execution, completion)
Fallback chains: Each phase supports fallbacks: [model1, model2] for error recovery
Pre-dispatch hooks: PreDispatchResult has a model field but it's never applied in auto.ts — this is a ready-made extension point
Model registry: ModelRegistry.getAvailable() provides all configured models with metadata
Per-unit metrics: Token counts (input/output/cacheRead/cacheWrite), cost, and model tracked per unit
Budget enforcement: Real-time cost tracking with alerts at 75%/90%/100%

Key Files

File	Role
`src/resources/extensions/gsd/auto.ts`	Dispatch logic, model switching (lines 1791-1879)
`src/resources/extensions/gsd/preferences.ts`	Model resolution, `resolveModelWithFallbacksForUnit()`
`src/resources/extensions/gsd/post-unit-hooks.ts`	Pre-dispatch hooks (model field defined but unused)
`src/resources/extensions/gsd/types.ts`	Type definitions for hooks and model config
`src/resources/extensions/gsd/metrics.ts`	Token tracking, aggregation, cost projection
`src/resources/extensions/gsd/auto-prompts.ts`	Prompt builders per unit type
`packages/pi-coding-agent/src/core/model-registry.ts`	Model availability and metadata

Proposed Design

Core Concept: Task Complexity Classification

Before each unit dispatch, classify the task into a complexity tier and route to an appropriate model. This sits between preference resolution and model dispatch — it can downgrade but never upgrade beyond the user's configured model.

Complexity Tiers

Tier	Complexity	Example Tasks	Default Model
Tier 1 — Light	Low cognitive load, structured output	File reads, search aggregation, simple summaries, completion/summary units	Haiku / cheapest available
Tier 2 — Standard	Moderate reasoning, some creativity	Research synthesis, plan formatting, routine code generation, UAT checks	Sonnet / mid-tier
Tier 3 — Heavy	Complex reasoning, architecture, novel code	Complex execution tasks, replanning, multi-file refactors, debugging	Opus / user's configured model

Classification Signals

The classifier uses heuristic signals available before dispatch (no LLM call needed):

Unit type (strongest signal):
- complete-slice, run-uat → Tier 1 (structured summarization)
- research-milestone, research-slice → Tier 2 (synthesis)
- plan-milestone, plan-slice → Tier 2-3 (depends on scope)
- execute-task → Tier 2-3 (depends on task complexity)
- replan-slice → Tier 3 (requires understanding of failure)
Task metadata (for execution units):
- Lines of code estimated to change (from task plan)
- Number of files involved
- Dependency count
- Whether task involves new file creation vs. modification
- Tags/labels if present (e.g., "refactor", "test", "docs")
Historical performance (adaptive, Phase 2):
- If a Tier 2 model failed and escalated on similar tasks before, default to Tier 3
- Track success rate per tier per unit-type pattern

Architecture

User Preferences (phase → model)
        │
        ▼
resolveModelWithFallbacksForUnit()     ← existing
        │
        ▼
classifyUnitComplexity()               ← NEW: returns Tier 1/2/3
        │
        ▼
resolveModelForTier()                  ← NEW: maps tier → model from available set
        │
        ▼
maybeDowngradeModel()                  ← NEW: only downgrades from user's configured model
        │
        ▼
Model dispatch (existing auto.ts logic)

Key Design Decisions

Downgrade-only: The classifier can select a cheaper model than configured, never a more expensive one. The user's preference is the ceiling.
Opt-in with easy override: New preference key dynamic_model_routing: true|false (default: false). Users who want token savings enable it explicitly.
Escalation on failure: If a lower-tier model fails (tool errors, incomplete output, exceeds retries), automatically escalate to the next tier and retry the unit.
No LLM call for classification: Uses heuristics only — adding an LLM call to save tokens would be counterproductive.
Respects existing fallback chains: Dynamic routing integrates with existing fallbacks — if the dynamically selected model fails, it tries the fallback chain before escalating tiers.
Transparent to user: Dashboard shows which model was selected and why (tier badge in progress widget).

Implementation Phases

Phase 1: Foundation — Complexity Classifier & Routing (Core)

Goal: Build the classification and routing system, wire it into dispatch.

1a. Define types and configuration

File: src/resources/extensions/gsd/types.ts

Add ComplexityTier type: 'light' | 'standard' | 'heavy'

Add DynamicRoutingConfig interface:

interface DynamicRoutingConfig {
  enabled: boolean;
  tier_models?: {
    light?: string;    // model ID for light tasks
    standard?: string; // model ID for standard tasks
    heavy?: string;    // model ID for heavy tasks (default: user's configured model)
  };
  escalate_on_failure?: boolean; // default: true
}

File: src/resources/extensions/gsd/preferences.ts

Add dynamic_routing to preference schema
Add validation for the new config
Add loadDynamicRoutingConfig() function

1b. Build complexity classifier

New file: src/resources/extensions/gsd/complexity-classifier.ts

classifyUnitComplexity(unitType, unitId, metadata?) → ComplexityTier
Heuristic rules:
- Unit type mapping (see Tiers table above)
- Task plan analysis: parse task plan file for file count, estimated scope
- Dependency analysis: tasks with 3+ dependencies → bump to heavy
Export getClassificationReason() for dashboard display

1c. Build model router

New file: src/resources/extensions/gsd/model-router.ts

resolveModelForComplexity(tier, phaseConfig, availableModels) → ResolvedModelConfig
Logic:
1. Get user's configured model for phase (ceiling)
2. If tier_models configured, use tier-specific model
3. If not configured, use smart defaults from available models (cheapest for light, mid for standard, configured for heavy)
4. Validate selected model is available
5. Return with fallback chain: [tier_model, ...configured_fallbacks, configured_primary]

1d. Wire into dispatch

File: src/resources/extensions/gsd/auto.ts

In the model resolution block (lines 1791-1879):
1. After resolveModelWithFallbacksForUnit(), call classifier
2. If dynamic routing enabled, call router to potentially downgrade
3. Log tier and model selection to metrics
4. On unit failure: if using downgraded model, escalate tier and retry

1e. Wire the unused pre-dispatch hook model field

File: src/resources/extensions/gsd/auto.ts

Apply preDispatchResult.model when returned — this is already defined but unused
Allows hooks to override dynamic routing decisions

Tests

New file: src/resources/extensions/gsd/tests/complexity-classifier.test.ts

Test tier assignment for each unit type
Test metadata-based adjustments (file count, dependency count)
Test edge cases (missing metadata, unknown unit types)

New file: src/resources/extensions/gsd/tests/model-router.test.ts

Test downgrade-only behavior (never exceeds configured model)
Test tier-to-model mapping with various available model sets
Test fallback chain construction
Test when dynamic routing is disabled (passthrough)

New file: src/resources/extensions/gsd/tests/dynamic-routing-integration.test.ts

Test full flow: unit → classify → route → dispatch
Test escalation on failure
Test preference loading and validation

Phase 2: Observability & Dashboard

Goal: Make routing decisions visible to users.

2a. Metrics tracking

File: src/resources/extensions/gsd/metrics.ts

Add tier field to UnitMetrics
Add model_downgraded: boolean field
Add escalation_count field
Add aggregateByTier() function
Add formatTierSavings() — show estimated savings from downgrades

2b. Dashboard integration

File: src/resources/extensions/gsd/auto-dashboard.ts

Add tier badge to unit progress display (e.g., [L], [S], [H])
Add savings summary to completion stats: "Dynamic routing saved ~$X.XX (N units downgraded)"
Color-code tier in token widget

Tests

Test metrics aggregation by tier
Test savings calculation
Test dashboard formatting

Phase 3: Adaptive Learning (Future)

Goal: Improve classification accuracy over time based on outcomes.

3a. Outcome tracking

File: src/resources/extensions/gsd/complexity-classifier.ts

Track success/failure per tier per unit-type pattern
Store in .gsd/routing-history.json (project-level)
Simple structure: { "execute-task:docs": { light: { success: 12, fail: 1 }, ... } }

3b. Adaptive thresholds

If a tier has >20% failure rate for a pattern, auto-bump default tier
Decay old data (rolling window of last 50 units)
User can reset learning: dynamic_routing_reset: true in preferences

Tests

Test learning updates on success/failure
Test threshold bumping
Test decay logic
Test reset behavior

Phase 4: Task Plan Introspection (Future)

Goal: Deeper classification using task plan content analysis.

Parse task plan markdown for complexity signals:
- "Create new file" vs. "modify existing"
- Number of code blocks in plan
- Presence of keywords: "refactor", "migration", "architecture", "test", "docs", "config"
- Estimated lines of change (if specified)
Weight these signals alongside unit-type heuristics

Preference Configuration (User-Facing)

---
version: 1
models:
  research: claude-sonnet-4-6
  planning: claude-opus-4-6
  execution: claude-sonnet-4-6
  completion: claude-sonnet-4-6
dynamic_routing:
  enabled: true
  tier_models:
    light: claude-haiku-4-5
    standard: claude-sonnet-4-6
    # heavy: inherits from phase config (ceiling)
  escalate_on_failure: true
---

Risk Mitigation

Risk	Mitigation
Cheaper model produces low-quality output	Downgrade-only design; escalation on failure; user can disable
Classification overhead adds latency	Heuristics-only, no LLM call; <1ms classification time
Complex preferences confuse users	Disabled by default; works with zero config if enabled (uses smart defaults)
Model not available in user's provider	Validation at preference load; falls back to configured model
Escalation loops	Max 1 escalation per unit; after that, use configured model

Estimated Token Savings

Based on typical GSD session patterns:

~30% of units are completion/summary (Tier 1 candidates)
~40% are research/standard planning (Tier 2 candidates)
~30% are complex execution (Tier 3, no downgrade)

If Haiku is ~10x cheaper than Opus and Sonnet is ~5x cheaper:

Conservative estimate: 20-30% cost reduction with dynamic routing enabled
Aggressive estimate: 40-50% for projects with many small tasks

Resolved Design Decisions

All four open questions resolved as yes — folded into the plan as additional scope:

1. Post-unit hook classification — YES

Hooks get their own complexity classification. Most hooks are lightweight (validation, file checks) and should default to Tier 1. The existing model field on PostUnitHookConfig becomes the ceiling, same as phase models for units.

Implementation: Add to Phase 1d — extend classifyUnitComplexity() to accept hook metadata. Wire into hook dispatch at auto.ts lines 936-946.

2. Budget-pressure-aware routing — YES

As budget usage increases, the classifier becomes more aggressive about downgrading:

<50% budget used: Normal classification
50-75% budget used: Bump Tier 2 candidates down to Tier 1 where possible
75-90% budget used: Only Tier 3 tasks get the configured model; everything else goes to cheapest available
>90% budget used: Everything except replan-slice gets downgraded to cheapest

Implementation: Add to Phase 1b — classifyUnitComplexity() takes budgetPct parameter from existing getBudgetAlertLevel() logic. New function applyBudgetPressure(tier, budgetPct) adjusts the tier.

3. Multi-provider cost routing — YES

When multiple providers are configured, the router should consider cost differences. If a user has both Anthropic and OpenRouter, pick the cheapest option for the resolved tier.

Implementation:

Add cost_per_1k_tokens metadata to model registry (or maintain a lookup table for known models)
New file: src/resources/extensions/gsd/model-cost-table.ts — static cost table for known models, updatable via preferences
resolveModelForComplexity() ranks available models by cost within a tier's capability range
Preference key: dynamic_routing.cross_provider: true|false (default: true when enabled)

Risk: Cost data goes stale. Mitigate with a bundled cost table that gets updated with GSD releases + user override capability.

4. User feedback loop — YES

After each unit completes, users can flag the output quality to improve future classification.

Implementation (Phase 3 — Adaptive Learning):

Post-unit prompt option: user can react with /gsd:rate-unit [over|under|ok]
- over = "this could have used a simpler model" → records downgrade signal
- under = "this needed a better model" → records upgrade signal
- ok = confirms current tier was appropriate
Feedback stored alongside outcome data in .gsd/routing-history.json
Classifier weights feedback signals 2x vs. automatic success/failure detection
Skill: gsd:rate-unit — simple command that tags the last completed unit

Updated Preference Configuration

---
version: 1
models:
  research: claude-sonnet-4-6
  planning: claude-opus-4-6
  execution: claude-sonnet-4-6
  completion: claude-sonnet-4-6
dynamic_routing:
  enabled: true
  tier_models:
    light: claude-haiku-4-5
    standard: claude-sonnet-4-6
    # heavy: inherits from phase config (ceiling)
  escalate_on_failure: true
  budget_pressure: true        # more aggressive downgrading as budget fills
  cross_provider: true          # consider cost across providers
  hooks: true                   # classify hooks too
---

Updated Phase Summary

Phase	Scope	Includes
1 — Foundation	Classifier, router, dispatch, hook classification, budget pressure	Decisions 1 & 2
2 — Observability	Dashboard, tier badges, savings tracking, cost table	Decision 3
3 — Adaptive Learning	Outcome tracking, user feedback (`/gsd:rate-unit`), adaptive thresholds	Decision 4
4 — Task Introspection	Parse task plans for deeper complexity signals	—

15 KiB Raw Permalink Blame History

Plan: Dynamic Model Routing for Token Optimization

Problem Statement

Current Architecture

What Exists

Key Files

Proposed Design

Core Concept: Task Complexity Classification

Complexity Tiers

Classification Signals

Architecture

Key Design Decisions

Implementation Phases

Phase 1: Foundation — Complexity Classifier & Routing (Core)

1a. Define types and configuration

1b. Build complexity classifier

1c. Build model router

1d. Wire into dispatch

1e. Wire the unused pre-dispatch hook model field

Tests

Phase 2: Observability & Dashboard

2a. Metrics tracking

2b. Dashboard integration

Tests

Phase 3: Adaptive Learning (Future)

3a. Outcome tracking

3b. Adaptive thresholds

Tests

Phase 4: Task Plan Introspection (Future)

Preference Configuration (User-Facing)

Risk Mitigation

Estimated Token Savings

Resolved Design Decisions

1. Post-unit hook classification — YES

2. Budget-pressure-aware routing — YES

3. Multi-provider cost routing — YES

4. User feedback loop — YES

Updated Preference Configuration

Updated Phase Summary

15 KiB

Raw Permalink Blame History