Ollama LLM provider

Local-inference LLM provider for Piko. It speaks to an Ollama daemon over HTTP, optionally spawns the daemon as a managed subprocess, and pulls missing models on demand. SHA256 digest pinning defends against model substitution.

Overview

This provider implements both llm.ProviderPort (completions) and llm.EmbeddingProviderPort (embeddings) against the Ollama HTTP API. One WithLLMProvider call wires up both. Piko auto-detects the embedding capability, so a separate WithEmbeddingProvider registration is not needed.

Ollama wraps llama.cpp and similar runtimes behind a uniform pull, run, and serve interface. A developer who runs ollama serve on their laptop can host Llama, Qwen, Mistral, Phi, Gemma, and embedding models like nomic-embed-text or all-minilm with a single command. The provider plugs that local daemon into Piko's LLM service.

Ollama's character is local inference. No data leaves the host, there is no per-token cost, and it works offline. The application runs on whatever box you have, so model quality and latency depend on hardware. Quantised 7-billion and 8-billion parameter models run on a laptop CPU. Larger checkpoints want a GPU. Ollama serves only open-weight checkpoints, so closed frontier models are unavailable.

Two defaults change where inference cost lands. With AutoStart=true (the default), the provider spawns ollama serve as a managed subprocess if the configured Host is not reachable. That keeps a single Go binary self-contained for local inference, but pins the cost of inference (CPU, GPU, RAM) to the application host. With AutoPull=true (also default), the provider downloads missing models the first time the application requests them, which can stall the first request for gigabytes of weights. Call EnsureModels at startup to pull the defaults ahead of the first request (see Startup). Pin model versions via ModelWithDigest(name, digest) when supply chain integrity matters.

The provider is a separate Go module, piko.sh/piko/wdk/llm/llm_provider_ollama, with its own go.mod. Add it with go get and inherit its github.com/ollama/ollama SDK dependency. The provider is pure Go, so it needs no build tag or CGO and runs in interpreted dev mode (dev-i).

Requirements

  • The ollama binary either on $PATH or pointed at via BinaryPath, required when AutoStart is true and the daemon is not already running.
  • Or a reachable Ollama server at Host (default http://localhost:11434) with the desired models already pulled.
  • Disk space for model weights (3 to 70 GB per model depending on size and quantisation).
  • Optional GPU drivers (NVIDIA CUDA, Apple Metal). Ollama runs CPU-only without them, slower.

Configuration

import (
    "time"

    "piko.sh/piko/wdk/llm/llm_provider_ollama"
)

autoStart := true
autoPull := true

provider, err := llm_provider_ollama.NewOllamaProvider(llm_provider_ollama.Config{
    Host:                  "http://localhost:11434",
    BinaryPath:            "", // auto-detect from $PATH when empty
    AutoStart:             &autoStart,
    AutoPull:              &autoPull,
    DefaultModel:          llm_provider_ollama.Model("llama3.2"),
    DefaultEmbeddingModel: llm_provider_ollama.Model("all-minilm"),
    HTTPTimeout:           10 * time.Minute,
    ImageFetch:            nil, // nil skips URL-referenced images
})
if err != nil {
    return err
}

Every field has a default. Host defaults to http://localhost:11434, DefaultModel to llama3.2, DefaultEmbeddingModel to all-minilm, HTTPTimeout to 10 minutes, and both AutoStart and AutoPull to true. The HTTPTimeout applies to non-streaming requests only. Streaming calls use per-request context cancellation.

ImageFetch stays nil (disabled) unless you set it, because it makes outbound HTTP requests to arbitrary URLs supplied in messages. When enabled, the zero fields default to a 20 MiB MaxBytes cap and a 30-second Timeout per image. With ImageFetch unset, the provider skips image URL content parts.

For supply chain pinning, use ModelWithDigest:

DefaultModel: llm_provider_ollama.ModelWithDigest("llama3.2", "1b226e2802db"),

The provider verifies a pinned digest on every request that uses the default model. Each Complete and Embed call queries the installed model's digest through the Ollama List API and refuses a mismatch as a possible supply chain compromise. Matching is prefix-based, so the truncated digest shown by ollama list works. Per-request model overrides carry no digest and skip verification, so set the digest on the default model you want to lock down.

Bootstrap

ssr := piko.New(
    piko.WithLLMProvider("ollama", provider),
    piko.WithDefaultLLMProvider("ollama"),
)

For mixed setups (Ollama for development, OpenAI for production), wire two providers and pick the default by environment:

ssr := piko.New(
    piko.WithLLMProvider("ollama", ollamaProvider),
    piko.WithLLMProvider("openai", openaiProvider),
    piko.WithDefaultLLMProvider(defaultProvider),
)

Startup

EnsureModels pulls the default completion and embedding models if they are not already present, so the first request does not stall on a multi-gigabyte download:

if err := provider.EnsureModels(ctx); err != nil {
    return err
}

It also queries the embedding model's vector dimension, so EmbeddingDimensions returns the correct value right away. Call it once at boot.

When AutoStart is true and EnsureModels runs, the provider owns the ollama serve child process. On Close, the provider cancels in-flight stream goroutines, waits up to 30 seconds for them to drain, releases idle HTTP connections, and terminates the managed subprocess. Every Complete, Stream, Embed, and model-pull path records OpenTelemetry counters and histograms under the piko/llm/llm_provider_ollama namespace, so local inference reports the same metrics as the hosted providers.

Tradeoffs

The compute is yours. A 70-billion parameter model needs a large amount of GPU memory, and even a 7-billion-parameter quantised model loads the CPU when you push concurrent requests. Inference latency is also yours. Local Llama runs are slower per token than a hosted frontier API, especially without a GPU. Open-weight model quality has improved fast but still trails the frontier hosted models on hard reasoning, long-context document analysis, and tool-use accuracy. Use Ollama where its strengths (privacy, offline, no marginal cost) buy you something the hosted providers cannot.

See also

Other LLM providers:

  • Anthropic Claude, long-context, strong tool use.
  • OpenAI, widest model selection, native structured-output API.
  • Gemini, cheapest competent option, multimodal-native.
  • Mistral, open-weight European provider (also runs on Ollama).
  • Grok, xAI provider.
  • Voyage, embeddings-only specialist.
  • Zoltai, deterministic fake for tests.

Framework docs:

External:

  • Ollama, installer, model library, and CLI.
  • Ollama API, HTTP shape this provider speaks.