singularity-forge/.plans/ollama-native-provider.md
Jeremy McSpadden 04ebe3f0a0 feat(extensions): add Ollama extension for first-class local LLM support (#3371)
Self-contained extension at src/resources/extensions/ollama/ that
auto-detects a running Ollama instance, discovers locally pulled models,
and registers them as a first-class provider with zero configuration.

Features:
- Auto-discovery of local models via /api/tags on session_start
- Capability detection (vision, reasoning, context window) for 40+ model families
- /ollama slash command with status, list, pull, remove, ps subcommands
- ollama_manage LLM-callable tool for agent-driven model operations
- Onboarding flow with auto-detect (no API key required)
- Non-blocking async probe — doesn't delay TUI paint
- Respects OLLAMA_HOST env var for non-default endpoints

Core changes (minimal):
- Add "ollama" to KnownProvider in pi-ai types
- Add "ollama" key resolution in env-api-keys.ts
- Add "ollama" default model in model-resolver.ts
- Add "Ollama (Local)" to onboarding wizard with probe flow
2026-04-01 08:37:31 -06:00

11 KiB

Ollama Extension — First-Class Local LLM Support

Status: DRAFT — Awaiting approval

Problem

Ollama support in GSD2 currently requires manual models.json configuration. Users must:

  1. Know the OpenAI-compatibility endpoint (localhost:11434/v1)
  2. Manually list every model they want to use
  3. Set compat flags (supportsDeveloperRole: false, etc.)
  4. Use a dummy API key

There's an ollama-cloud provider for hosted Ollama, and a discovery adapter that can list models, but no first-class local Ollama extension that "just works."

Goal

Make Ollama the easiest way to use GSD2 — zero config when Ollama is running locally. All Ollama functionality lives in a single extension: src/resources/extensions/ollama/.

Architecture

Everything is a self-contained extension under src/resources/extensions/ollama/. The extension:

  • Auto-detects Ollama on startup via health check
  • Discovers and registers local models with the model registry
  • Provides native Ollama API streaming (not OpenAI shim)
  • Exposes /ollama slash commands for model management
  • Registers an LLM-callable tool for model pull/status

Minimal core changes — only KnownProvider and KnownApi type additions in pi-ai, and env-api-keys.ts for key resolution. Everything else is in the extension.

File Structure

src/resources/extensions/ollama/
├── index.ts                  # Extension entry — wires everything on session_start
├── ollama-client.ts          # HTTP client for Ollama REST API (/api/*)
├── ollama-discovery.ts       # Model discovery + capability detection
├── ollama-provider.ts        # Native /api/chat streaming provider (registers with pi-ai)
├── ollama-commands.ts        # /ollama slash commands (status, pull, list, remove, ps)
├── ollama-tool.ts            # LLM-callable tool for model management
├── model-capabilities.ts     # Known model capability table (context window, vision, reasoning)
└── types.ts                  # Shared types for Ollama API responses

Scope

Phase 1: Auto-Discovery + OpenAI-Compat Routing

What: Extension that auto-detects Ollama, discovers models, registers them using the existing openai-completions API provider. Zero config needed.

Extension files:

  • ollama/index.ts — Main entry. On session_start:
    1. Probe localhost:11434 (or OLLAMA_HOST) with 1.5s timeout
    2. If reachable, discover models via /api/tags
    3. Register discovered models with ctx.modelRegistry using correct defaults
    4. Show status widget if Ollama is detected
  • ollama/ollama-client.ts — Low-level HTTP client:
    • isRunning()GET / health check
    • getVersion()GET /api/version
    • listModels()GET /api/tags
    • showModel(name)POST /api/show (details, template, parameters, size)
    • getRunningModels()GET /api/ps (loaded models, VRAM usage)
    • pullModel(name, onProgress)POST /api/pull (streaming progress)
    • deleteModel(name)DELETE /api/delete
    • copyModel(source, dest)POST /api/copy
    • Respects OLLAMA_HOST env var for non-default endpoints
  • ollama/ollama-discovery.ts — Enhanced model discovery:
    • Calls /api/tags to get model list
    • Calls /api/show per model (batch, cached) to get:
      • details.parameter_size → estimate context window
      • details.families → detect vision (clip), reasoning (deepseek-r1)
      • modelfile → extract default parameters
    • Returns enriched DiscoveredModel[] with proper capabilities
  • ollama/model-capabilities.ts — Known model lookup table:
    • Maps well-known model families to capabilities
    • e.g., llama3.1{ contextWindow: 131072, input: ["text"] }
    • e.g., llava{ contextWindow: 4096, input: ["text", "image"] }
    • e.g., deepseek-r1{ reasoning: true, contextWindow: 131072 }
    • e.g., qwen2.5-coder{ contextWindow: 131072, input: ["text"] }
    • Fallback: estimate from parameter count if not in table
  • ollama/types.ts — Ollama API response types

Core changes (minimal):

  • packages/pi-ai/src/types.ts — Add "ollama" to KnownProvider
  • packages/pi-ai/src/env-api-keys.ts — Add "ollama" key resolution (returns "ollama" placeholder — no real key needed)
  • src/onboarding.ts — Add "ollama" to provider selection list
  • src/wizard.ts — Add ollama entry (no key required)

Model registration details: Each discovered model registers as:

{
  id: "llama3.1:8b",           // from /api/tags
  name: "Llama 3.1 8B",        // humanized
  api: "openai-completions",    // uses existing provider
  provider: "ollama",
  baseUrl: "http://localhost:11434/v1",
  cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
  reasoning: false,             // from capabilities table
  input: ["text"],              // from capabilities table
  contextWindow: 131072,        // from capabilities table or /api/show
  maxTokens: 16384,             // conservative default
  compat: {
    supportsDeveloperRole: false,
    supportsReasoningEffort: false,
    supportsUsageInStreaming: false,
    maxTokensField: "max_tokens",
  },
}

Behavior:

  • gsd --list-models shows all locally-pulled Ollama models automatically
  • /model ollama/llama3.1:8b works without any config file
  • If Ollama isn't running, extension is silent — no errors, no models listed
  • models.json overrides still work (user config wins over auto-discovery)

Phase 2: Native Ollama API Provider (/api/chat)

What: A dedicated streaming provider that talks Ollama's native protocol instead of the OpenAI compatibility shim.

Extension files:

  • ollama/ollama-provider.ts — Native /api/chat streaming:
    • Registers "ollama-chat" API with registerApiProvider()
    • Implements stream() and streamSimple():
      • Maps GSD Context → Ollama messages format
      • Maps GSD Tool[] → Ollama tool format
      • Streams NDJSON responses, maps back to AssistantMessage events
      • Extracts <think> blocks for reasoning models (deepseek-r1, qwq)
    • Ollama-specific options:
      • keep_alive — control model memory retention (default: "5m")
      • num_ctx — pass through model's context window
      • num_predict — max output tokens
      • Temperature, top_p, top_k
    • Response metadata:
      • eval_count / eval_duration → tokens/sec in usage stats
      • total_duration, load_duration → performance visibility
    • Vision support: converts image content to base64 for multimodal models

Core changes:

  • packages/pi-ai/src/types.ts — Add "ollama-chat" to KnownApi

Phase 1 models switch to api: "ollama-chat" by default. Users can force OpenAI-compat via models.json override if needed.

Why native over OpenAI-compat:

  • Full keep_alive / num_ctx control
  • Better error messages (Ollama-native vs generic OpenAI)
  • More reliable tool calling on Ollama's native format
  • Performance metrics in response (tokens/sec)
  • Foundation for model management commands

Phase 3: Local LLM Management UX

What: /ollama slash commands and an LLM tool for model management.

Extension files:

  • ollama/ollama-commands.ts — Slash commands registered via pi.registerCommand():
    • /ollama — Status overview:
      Ollama v0.5.7 — running (localhost:11434)
      
      Loaded:
        llama3.1:8b       4.7 GB VRAM   idle 3m
      
      Available:
        llama3.1:8b       (4.7 GB)
        qwen2.5-coder:7b  (4.4 GB)
        deepseek-r1:8b    (4.9 GB)
      
    • /ollama pull <model> — Pull with streaming progress via ctx.ui.setWidget()
    • /ollama list — List all local models with sizes and families
    • /ollama remove <model> — Delete a model (with confirmation)
    • /ollama ps — Running models + VRAM usage
  • ollama/ollama-tool.ts — LLM-callable tool registered via pi.registerTool():
    • ollama_manage tool — lets the agent pull/list/check models
    • Parameters: { action: "list" | "pull" | "status" | "ps", model?: string }
    • Use case: agent detects it needs a model, pulls it automatically

UX Flow:

$ gsd
> /ollama
Ollama v0.5.7 — running (localhost:11434)
Loaded:
  llama3.1:8b    — 4.7 GB VRAM, idle 3m
Available:
  llama3.1:8b    (4.7 GB)
  qwen2.5-coder:7b (4.4 GB)
  deepseek-r1:8b (4.9 GB)

> /ollama pull codestral:22b
Pulling codestral:22b...
████████████████████████████░░░░ 78% (14.2 GB / 18.1 GB)
✓ codestral:22b ready

> /model ollama/codestral:22b
Switched to codestral:22b (local, Ollama)

Implementation Order

  1. Phase 1 — Auto-discovery with OpenAI-compat routing. Biggest user impact, smallest risk.
  2. Phase 3 — Management UX (/ollama commands). Valuable even before native API.
  3. Phase 2 — Native /api/chat provider. Optimization over OpenAI-compat; do last.

Core Changes Summary (minimal)

File Change
packages/pi-ai/src/types.ts Add "ollama" to KnownProvider, "ollama-chat" to KnownApi (Phase 2)
packages/pi-ai/src/env-api-keys.ts Add "ollama" → always returns "ollama" placeholder
src/onboarding.ts Add "ollama" to provider picker
src/wizard.ts Add "ollama" key mapping (no key required)

Everything else lives in src/resources/extensions/ollama/.

Risks & Mitigations

Risk Mitigation
Ollama not running — startup probe latency 1.5s timeout; cache result; probe async so it doesn't block TUI paint
Model capabilities unknown Known-model table + /api/show fallback + parameter_size estimation
Tool calling unreliable on small models Detect param count; warn on <7B models
Ollama API changes between versions Version detect via /api/version; stable endpoints only
Conflicts with models.json Ollama config User config always wins; auto-discovered models merge beneath manual config
Extension disabled — no impact on core Extension is additive; disabling removes all Ollama features cleanly

Testing Strategy

  • Unit tests: ollama-client.ts with mocked fetch responses
  • Unit tests: ollama-discovery.ts model capability parsing
  • Unit tests: ollama-provider.ts message format mapping + NDJSON stream parsing
  • Unit tests: model-capabilities.ts known model lookups
  • Integration test: mock HTTP server simulating Ollama /api/tags, /api/chat, /api/pull
  • Manual test: real Ollama instance with llama3.1, qwen2.5-coder, deepseek-r1

Open Questions

  1. Startup probe — Probe Ollama on session_start (adds ~1.5s if not running) or lazy on first /model? Recommendation: async probe on session_start (non-blocking), eager if OLLAMA_HOST is set.
  2. Auto-start — Try to launch Ollama if installed but not running? Recommendation: no — too invasive. Show helpful message in /ollama status.
  3. Vision support — Support multimodal models (llava, etc.) in Phase 2 native API? Recommendation: yes, detected via capabilities table.
  4. Model refresh — How often to re-probe Ollama for new models? Recommendation: on /ollama list, on /model command, and every 5 min (existing TTL).