← Documents
Reference · AI Models

AI Models

Every AI model Sponic can call — the local stack on ALPUCA's external SSD plus every API-hosted model wired into the apps. One page to answer "what should I use for X, and what does it cost?"

Prepared 2026-05-06 · Sponic Gardens · Sibling of AI Dinner Host · Live Translation
Local: 168 GB on PortoSams2T Cloud: 7 providers wired (OpenRouter fronts ~300 more) Storage layout consolidated 2026-05-06
Contents
  1. Summary — what to use for what
  2. Local on ALPUCA (PortoSams2T)
  3. Cloud APIs
  4. Proposed storage consolidation
  5. How to refresh this doc

Summary — what to use for what

Pick the smallest, cheapest tool that hits the quality bar. Local is free at the margin but slower; cloud is faster and smarter but priced per token / per second / per image.

WorkloadFirst choiceWhyFallback
Catalog image generation (hero, audition, food)Azure gpt-image-2Already wired through image-gen.ts wrapper. $1k Azure sponsorship credit until 2026-07-31.Google gemini-2.5-flash-image-preview
Local image generation (private / no quota)FLUX.1-dev via mflux venv31 GB on PortoSams2T, runs on ALPUCA's M-series GPU.
Live dinner-host conversationAnthropic Claude Sonnet 4.61–2 sentence replies in 9 languages, persona-tuned.Workers AI Kimi K2.6
Wake-gate (should host respond?)Anthropic Claude Haiku 4.5~70% cost cut vs always-Sonnet.Local gemma4:e4b
Live translation (Android app + venue subtitles)Local Whisper (STT) → gemini-3-flash-preview (translation)Whisper-server runs on ALPUCA :8089/:8090; Gemini does language detection + translation in one call.Gemini 2.5-flash (was the prior model)
Real-time STT for AI dinner-host conversationDeepgram Nova-3Lower latency than local Whisper for sub-second response paths.Local whisper.cpp ggml-large-v3
Batch / offline transcriptionLocal Whisper large-v33 GB on PortoSams2T, runs on ALPUCA — zero per-minute cost.Deepgram batch
Photo embeddings & searchLocal siglip2-so400m4.3 GB. Drives the moondream-indexer photo DB.
Multimodal embeddings (text + image)Local gme-Qwen2-VL-2B8.2 GB. For unified text/image vector search.
Text-to-speech (AI host)gemini-3.1-flash-tts-previewAudio tags ([warmly], [whispers]) for per-line expressiveness; ~70 languages; reuses our existing Google AI key. Wired in apps/ai-host/agent.py via custom gemini_tts.py wrapper.Cartesia Sonic if Gemini TTS rate-limits
Coding assistant (long-context, agentic)Anthropic Claude Opus 4.7 (this CLI)1M context, tool use, sub-agents.Local qwen3-coder:30b
Frontier LLM experiments / researchOpenRouter — one key, ~300 modelsProvider cost + ~5–10% markup. Instant model swaps via slug change.Workers AI Kimi K2.6 · NVIDIA NIM (Nemotron)

Local on ALPUCA (PortoSams2T)

All local models live on the always-connected Samsung T7 external SSD /Volumes/PortoSams2T/. ALPUCA's internal 245 GB drive does not hold any models. Three runtimes load them: Ollama (LLMs), HuggingFace transformers / diffusers (image gen, embeddings), whisper.cpp (audio).

How to tell where models actually live (post-2026-05-06 consolidation): everything is under /Volumes/PortoSams2T/models/{llm,stt,embeddings,image-gen}/. Home-dir entry points: ~/.ollama/models is a directory hard-link to /Volumes/PortoSams2T/models/llm/ollama (same inode); ~/.cache/huggingface symlinks through the legacy back-compat path to /Volumes/PortoSams2T/models/image-gen/huggingface. The old paths (alpuca-offload/ollama/models, huggingface-cache/hub, alpuca-offload/huggingface/hub, models/whisper) still resolve as back-compat symlinks — see the layout section.

Ollama LLMs — 107 GB

Stored at /Volumes/PortoSams2T/models/llm/ollama/ (legacy alias alpuca-offload/ollama/models still resolves). List with ollama list. Run with ollama run <name> or via the OpenAI-compatible HTTP API on http://alpuca:11434/v1/.

ModelSizeParamsStrengthsNotes
deepseek-r1:32b19 GB32 B denseReasoning, chain-of-thoughtDistill from R1, no native vision/tools.
qwen3-vl:30b-a3b19 GB30 B MoE / 3 B activeVision, reasoning, toolsClosest local equivalent to Kimi K2.6 in capability matrix.
qwen3:30b-a3b18 GB30 B MoE / 3 B activeReasoning, function calling, 256k ctxDefault text workhorse.
qwen3-coder:30b18 GB30 BCode generation, repo-awareUse for offline pair-programming.
gemma4:26b17 GB26 BGeneral chatBase for the gemma4 finetune family below.
hermes-gemma4:latest17 GB26 BChat + tools (Nous Hermes finetune)Stronger instruction-following.
gemma4-opencode:latest17 GB26 BOpenCode-style code completion26 B variant.
qwen2.5-coder:14b9.0 GB14 BCode, smaller / fasterUse when 30 B coder is too slow.
g4f:latest · hermes-gemma4-fast9.6 GB ea~9 BFast chatSame blob ID — one is an alias.
gemma4:e4b · gemma4-e4b-opencode9.6 GB ea~4 B effectiveLightweight chat / codeGood wake-gate / classifier candidate.
glm-ocr:latest2.2 GB~2 BDocument OCRSmallest model in the cabinet.

Whisper.cpp — speech-to-text, 5.2 GB

Stored at /Volumes/PortoSams2T/models/stt/whisper/ (legacy alias models/whisper still resolves — the live-translation server's hardcoded path keeps working) as ggml-*.bin files. Loaded directly by whisper.cpp / pywhispercpp consumers. Pick the smallest tier that meets accuracy needs — quality climbs steeply, but so does latency on CPU.

ModelSizeSpeed (M-series)When to use
ggml-large-v3.bin3.0 GB~1× real-time on CPU, faster on MetalProduction transcription, multilingual events. Best accuracy.
ggml-medium.bin1.5 GB~2× real-timeGood accuracy / latency balance.
ggml-small.bin488 MB~5× real-timeQuick batch jobs, English-mostly.
ggml-base.bin148 MB~10× real-timeSmoke tests, embedded scenarios.

HuggingFace cache (active) — 12.5 GB

Stored at /Volumes/PortoSams2T/models/embeddings/huggingface/hub/ (legacy alias huggingface-cache/hub still resolves). These are the embedding models the photo-indexer venv loads at runtime.

RepoSizeTypeUsed by
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct8.2 GBMultimodal embeddings (text + image, unified)Image-text vector search.
google/siglip2-so400m-patch14-3844.3 GBImage embeddingsmoondream-indexer/photo_embeddings_siglip.db — the photo search index.

HuggingFace cache (offload, image gen) — 31 GB

Stored at /Volumes/PortoSams2T/models/image-gen/huggingface/hub/ (legacy alias alpuca-offload/huggingface/hub still resolves). Reached via the ~/.cache/huggingface symlink so any HF library on ALPUCA finds them transparently.

RepoSizeTypeStatus
black-forest-labs/FLUX.1-dev31 GBText-to-image (12 B params)Downloaded — runs via the mflux venv on Apple Silicon.
black-forest-labs/FLUX.1-schnell~0Fast T2I (4-step)Placeholder dir only — not downloaded.
briaai/FIBO~0Background removalPlaceholder dir only — not downloaded.

Image-tools venvs (Python runtimes that load the above)

Stored at /Volumes/PortoSams2T/image-tools/venvs/. Activate then run; full READMEs in each venv dir.

Cloud APIs

Every paid / hosted model the apps actually call. Bold rows are wired into production code; non-bold rows are accessible but not yet integrated.

Cloudflare Workers AI

Account wingsiebird (9cd3a280a54ce2a5b382602f0247b577). Catalog: developers.cloudflare.com/workers-ai/models. Pay-as-you-go, billed per million tokens.

ModelUsePricing (in / cached / out, per M tokens)Notes
@cf/moonshotai/kimi-k2.6Frontier LLM, vision, tools, 262k ctx$0.95 / $0.16 / $4.001 T params. Closest "premium" tier we have through Cloudflare.
@cf/meta/llama-4-* familyGeneral LLMvariesCheap workhorse tier on Workers AI.
@cf/openai/whisper-*STT (cloud-hosted)per-secondFallback if local Whisper is down.
TODO — we don't yet call Workers AI from any app. Add a thin client and document a pattern when the first real use case lands.

Azure OpenAI

Subscription 785e237b-… (Microsoft for Startups Founders Hub Phase 1, $1k credit, expires 2026-07-31). Resource group sponic-ai. Service principal claude-code-sp in BW item 40b98339-3dce-4cec-9eaf-b43f00f43e80.

ModelUsePricing (per image, hi-quality)Wired in
gpt-image-2All Sponic image generation (catalog, hero, food, audition)$0.167 (1024²) · $0.25 (1536×1024)apps/control/src/lib/image-gen.ts — all image generation MUST go through this wrapper.

Google AI (Gemini) — the most-used cloud LLM in this stack

Key in env var GOOGLE_AI_API_KEY. BW item TODO — key currently lives in each app's .env.local. Gemini is doing more work in Sponic than any other cloud LLM — live translation, OCR, voice-note transcription, email classification, image generation. Three families in active use: 3.x (newest, live translation), 2.5 (production workhorse), 2.0 (legacy classifiers).

ModelUsePricingWired in
gemini-3-flash-previewLive translation — ALPUCA subtitle server (powers Android live-translation app + venue subtitles)~$0.30 / $2.50 per M tokenslive-subtitles/server.js on ALPUCA, lines 250 & 301. Replaced gemini-2.5-flash on 2026-05-06.
gemini-3.1-pro-previewHighest-quality 3.1, REST-callable todaypreview tierAvailable; upgrade path from gemini-3-pro-preview. Also has a -customtools variant for tool use.
gemini-3.1-flash-lite-previewSmaller, faster, cheaper 3.1 flash — closest "3.1 flash" REST optionpreview tierCandidate upgrade for live translation; A/B against current gemini-3-flash-preview when ready.
gemini-3.1-flash-image-previewImage gen, Gemini 3.1preview tierREST-callable. Newer than gemini-2.5-flash-image-preview.
gemini-3.1-flash-tts-previewAI dinner-host TTS — audio tags ([warmly], [whispers], [excited]), ~70 languages, multi-speaker, SynthID watermarkingpreview tier (24 kHz mono PCM)apps/ai-host/agent.py via custom apps/ai-host/gemini_tts.py wrapper (no upstream LiveKit plugin yet). Replaced the never-provisioned ElevenLabs path on 2026-05-06.
gemini-3.1-flash-live-previewLive API streaming (WebSocket bidiGenerateContent only)preview tierFuture bidirectional voice path (e.g. AI host streaming). Not REST.
~~gemini-3.1-flash~~ · ~~gemini-3.1-flash-preview~~— not in Google's catalog —Google shipped 3.1 as lite/image/tts/live + a separate pro line, but no general-purpose flagship 3.1-flash yet. Verified via https://generativelanguage.googleapis.com/v1beta/models on 2026-05-06.
gemini-3-pro-previewHigher-quality reasoning when 3-flash isn't enough~$1.25 / $10 per M tokensAvailable, not yet called from any production path.
gemini-2.5-proReceipt OCR — payment record extraction~$1.25 / $10 per M tokensapps/control/supabase/functions/record-payment/gemini-client.ts.
gemini-2.5-flashVoice-note transcription, sponic-pai email reasoning, daily-fact gen, weather Q&A, identity verification, sonos control intent, edit-email-template, generate-whispers~$0.075 / $0.30 per M tokens~10 Supabase functions under apps/control/supabase/functions/.
gemini-2.5-flash-liteCheapest, fastest classifier path~$0.04 / $0.15 per M tokensUsed selectively in _shared/email-classifier.ts.
gemini-2.5-flash-preview-ttsNative TTS (preview)preview tierAvailable but not wired — could replace the (also-not-wired) ElevenLabs path for AI host.
gemini-2.5-flash-image-previewImage-gen fallback when Azure gpt-image-2 quota hits~$0.04 (1024²)Configured in config/project.config.ts; wrapper integration pending in image-gen.ts.
gemini-3-pro-image-previewImage gen, Gemini 3 generationpreview tierAvailable, not wired.
gemini-2.0-flashLegacy sender classification on inbound mail~$0.10 / $0.40 per M tokensresend-inbound-webhook · record-payment/tenant-matcher.ts. Migrate to 2.5-flash when next touched.
Why Gemini for so much of the stack? (a) Gemini 2.5 Flash is the cheapest competent multilingual model available — ~10× cheaper than Sonnet for the same OCR/classification quality. (b) Gemini's REST API speaks plain JSON — no streaming, no SSE, no tool-result loops — which makes it trivial to drop into Supabase edge functions. (c) Native multilingual: live translation works in 9+ languages without per-language tuning.

Anthropic (Claude)

Key in env var ANTHROPIC_API_KEY. Used by AI host, Vapi server, prompt-runner worker, and this Claude Code CLI.

ModelUsePricing (per M tokens, in / out)Wired in
claude-opus-4-7 (1M ctx)Claude Code CLI (this session)$15 / $75This terminal.
claude-sonnet-4-6AI dinner-host brain, prompt-runner default$3 / $15apps/ai-host/agent.py.
claude-haiku-4-5AI host wake-gate (Phase 3), Vapi classifier$1 / $5Pending Phase 3 of AI host.

Deepgram

Key in env var DEEPGRAM_API_KEY.

ModelUsePricingWired in
Nova-3 (streaming)Live STT for venue translation + AI host~$0.0043/min streamingapps/ai-host/agent.py; live-subtitles server on ALPUCA.

ElevenLabs Replaced by Gemini TTS

The AI dinner-host originally scaffolded around elevenlabs.TTS, but Sponic never provisioned an ElevenLabs account. On 2026-05-06 the TTS was swapped to gemini-3.1-flash-tts-preview (see Google AI section above) — reuses the existing Google AI key, supports inline audio tags, ~70 languages. The remaining ElevenLabs reference in apps/control/supabase/functions/vapi-server/index.ts:224 is just a config string passed to the Vapi platform (provider: "11labs"); Vapi handles its own ElevenLabs auth on their side — not used.

OpenRouter

One OpenAI-compatible endpoint (https://openrouter.ai/api/v1) that fronts ~300+ models from Anthropic, OpenAI, Google, Meta, Mistral, DeepSeek, Qwen, MiniMax, xAI/Grok, NVIDIA Nemotron, Cohere, Moonshot, and more — with one API key. Pricing is provider-cost + a small markup; many models have free tiers. This is the right place to reach for "I just want to try model X" without provisioning a new account.

Model slugProvider / familyPricing (in / out, per M tokens)Notes
minimax/minimax-m2MiniMax frontier reasoning~$0.30 / $1.20Likely the "2.5 minimax" reference — confirm slug at openrouter.ai/models.
moonshotai/kimi-k2Kimi K2 (1T params, 256k ctx)~$0.60 / $2.50Cheaper than the Cloudflare-hosted Kimi K2.6 above; check freshness.
deepseek/deepseek-r1DeepSeek R1 reasoning~$0.55 / $2.19Hosted full-fat R1 without running 600 GB locally.
deepseek/deepseek-v3.1DeepSeek V3~$0.27 / $1.10General-purpose, very cheap.
x-ai/grok-4xAI Grok 4~$3 / $15Long context, real-time-aware.
google/gemini-2.5-proGemini 2.5 Pro~$1.25 / $10Alternative to direct Google AI key when convenient.
qwen/qwen3-235b-a22bQwen3 flagship MoE~$0.20 / $0.60Hosted version of the Qwen3 family we run locally.
nvidia/llama-3.1-nemotron-70b-instructNVIDIA Nemotron~$0.12 / $0.30Often free on the OR free tier.
mistralai/mistral-large-2411Mistral Large~$2 / $6European-hosted option for data-residency-sensitive work.
Why route through OpenRouter even when a direct provider key exists? (a) one bill, one usage dashboard, one key to rotate; (b) instant model swaps via slug change — useful for the apps/control/worker/prompt-runner/ pattern; (c) automatic fallback routing when a primary provider is rate-limited. Drawbacks: ~5–10% markup vs direct, and sub-second-latency-sensitive paths (live AI host) should still go direct (Anthropic, Deepgram, ElevenLabs).
Action item: store the OpenRouter API key in BW under DevOps-sponicgarden and add an OPENROUTER_API_KEY entry to the relevant .env.example files. Update config/project.config.ts with a llmGateway section so apps have a documented fallback path.

NVIDIA build.nvidia.com (NIM)

NVIDIA also hosts a catalog of open-weight models directly behind an OpenAI-compatible API at integrate.api.nvidia.com/v1 — useful when you want NVIDIA-tuned variants (Nemotron) or want to avoid the OpenRouter middleman. Free dev tier, usage-based after.

ModelUseNotes
minimaxai/minimax-m2Frontier reasoning + tool-use, very long contextSame model also reachable via OpenRouter above; pick whichever has better latency / lower price for your case.
nvidia/llama-3.1-nemotron-*NVIDIA-tuned Llama variantsOnly on NVIDIA — not all are mirrored to OpenRouter.
nvidia/nemotron-4-340b-instructNVIDIA's largest open instruct modelFree dev tier, useful for batch synthetic-data generation.
Action item: create a BW item under DevOps-sponicgarden for the NVIDIA API key only if you need Nemotron-specific variants — otherwise OpenRouter covers the same MiniMax / DeepSeek / etc. surface with one key.

Storage layout — executed 2026-05-06

Models used to be spread across four sibling directories on PortoSams2T (organic growth, not design). On 2026-05-06 they were consolidated into a single canonical models/ root, with non-model state moved to a sibling runtime/ root. Every old path was preserved as a back-compat symlink so existing consumers (live-translation server, mflux, photo-indexer, anything with hardcoded paths) keep working unchanged.

Current layout (post-consolidation)

/Volumes/PortoSams2T/models/      ← canonical root for every model blob
├── llm/
│   └── ollama/                   107 GB — Ollama LLMs
├── stt/
│   └── whisper/                  4.9 GB — whisper.cpp ggml files
├── embeddings/
│   └── huggingface/{hub,modules,xet}    13 GB — siglip2 + gme-Qwen2-VL
└── image-gen/
    └── huggingface/{hub,xet,token,…}    31 GB — FLUX.1-dev

/Volumes/PortoSams2T/runtime/     ← non-model state
├── image-tools/                  2.9 GB — mflux/moondream/photo-indexer Python venvs
└── moondream-indexer/             10 GB — photo embedding DB

/Volumes/PortoSams2T/             ← back-compat symlinks at every old path
├── models/whisper                  → models/stt/whisper
├── huggingface-cache/{hub,modules,xet}    → models/embeddings/huggingface/*
├── alpuca-offload/ollama/models    → models/llm/ollama
├── alpuca-offload/huggingface/{hub,xet,token,stored_tokens} → models/image-gen/huggingface/*
├── alpuca-offload/moondream-indexer → runtime/moondream-indexer
└── image-tools                     → runtime/image-tools

What broke / what didn't

Why back-compat symlinks instead of env-var swaps? Every consumer that hardcoded an old path now keeps working without any further coordination. The canonical structure is the new models/ tree; the symlinks at old paths are explicit "this is legacy, find me at the new location" markers. Future code should use the new paths directly — the symlinks are a one-way ratchet, not a long-term API.

How to refresh this doc

Run these on ALPUCA when local-model state changes; paste new totals into the tables above.

ssh alpuca@alpuca '
  ollama list
  du -sh /Volumes/PortoSams2T/models/llm/ollama
  du -sh /Volumes/PortoSams2T/models/embeddings/huggingface/hub/*
  du -sh /Volumes/PortoSams2T/models/image-gen/huggingface/hub/*
  ls -la /Volumes/PortoSams2T/models/stt/whisper/
'

For cloud entries, check each provider's pricing page and the relevant BW item under DevOps-sponicgarden:

infra/bin/bw-sponic search anthropic
infra/bin/bw-sponic search azure
infra/bin/bw-sponic search nvidia
infra/bin/bw-sponic search deepgram
infra/bin/bw-sponic search elevenlabs