AI Models
Every AI model Sponic can call — the local stack on ALPUCA's external SSD plus every API-hosted model wired into the apps. One page to answer "what should I use for X, and what does it cost?"
Summary — what to use for what
Pick the smallest, cheapest tool that hits the quality bar. Local is free at the margin but slower; cloud is faster and smarter but priced per token / per second / per image.
| Workload | First choice | Why | Fallback |
|---|---|---|---|
| Catalog image generation (hero, audition, food) | Azure gpt-image-2 | Already wired through image-gen.ts wrapper. $1k Azure sponsorship credit until 2026-07-31. | Google gemini-2.5-flash-image-preview |
| Local image generation (private / no quota) | FLUX.1-dev via mflux venv | 31 GB on PortoSams2T, runs on ALPUCA's M-series GPU. | — |
| Live dinner-host conversation | Anthropic Claude Sonnet 4.6 | 1–2 sentence replies in 9 languages, persona-tuned. | Workers AI Kimi K2.6 |
| Wake-gate (should host respond?) | Anthropic Claude Haiku 4.5 | ~70% cost cut vs always-Sonnet. | Local gemma4:e4b |
| Live translation (Android app + venue subtitles) | Local Whisper (STT) → gemini-3-flash-preview (translation) | Whisper-server runs on ALPUCA :8089/:8090; Gemini does language detection + translation in one call. | Gemini 2.5-flash (was the prior model) |
| Real-time STT for AI dinner-host conversation | Deepgram Nova-3 | Lower latency than local Whisper for sub-second response paths. | Local whisper.cpp ggml-large-v3 |
| Batch / offline transcription | Local Whisper large-v3 | 3 GB on PortoSams2T, runs on ALPUCA — zero per-minute cost. | Deepgram batch |
| Photo embeddings & search | Local siglip2-so400m | 4.3 GB. Drives the moondream-indexer photo DB. | — |
| Multimodal embeddings (text + image) | Local gme-Qwen2-VL-2B | 8.2 GB. For unified text/image vector search. | — |
| Text-to-speech (AI host) | gemini-3.1-flash-tts-preview | Audio tags ([warmly], [whispers]) for per-line expressiveness; ~70 languages; reuses our existing Google AI key. Wired in apps/ai-host/agent.py via custom gemini_tts.py wrapper. | Cartesia Sonic if Gemini TTS rate-limits |
| Coding assistant (long-context, agentic) | Anthropic Claude Opus 4.7 (this CLI) | 1M context, tool use, sub-agents. | Local qwen3-coder:30b |
| Frontier LLM experiments / research | OpenRouter — one key, ~300 models | Provider cost + ~5–10% markup. Instant model swaps via slug change. | Workers AI Kimi K2.6 · NVIDIA NIM (Nemotron) |
Local on ALPUCA (PortoSams2T)
All local models live on the always-connected Samsung T7 external SSD /Volumes/PortoSams2T/. ALPUCA's internal 245 GB drive does not hold any models. Three runtimes load them: Ollama (LLMs), HuggingFace transformers / diffusers (image gen, embeddings), whisper.cpp (audio).
/Volumes/PortoSams2T/models/{llm,stt,embeddings,image-gen}/. Home-dir entry points: ~/.ollama/models is a directory hard-link to /Volumes/PortoSams2T/models/llm/ollama (same inode); ~/.cache/huggingface symlinks through the legacy back-compat path to /Volumes/PortoSams2T/models/image-gen/huggingface. The old paths (alpuca-offload/ollama/models, huggingface-cache/hub, alpuca-offload/huggingface/hub, models/whisper) still resolve as back-compat symlinks — see the layout section.
Ollama LLMs — 107 GB
Stored at /Volumes/PortoSams2T/models/llm/ollama/ (legacy alias alpuca-offload/ollama/models still resolves). List with ollama list. Run with ollama run <name> or via the OpenAI-compatible HTTP API on http://alpuca:11434/v1/.
| Model | Size | Params | Strengths | Notes |
|---|---|---|---|---|
deepseek-r1:32b | 19 GB | 32 B dense | Reasoning, chain-of-thought | Distill from R1, no native vision/tools. |
qwen3-vl:30b-a3b | 19 GB | 30 B MoE / 3 B active | Vision, reasoning, tools | Closest local equivalent to Kimi K2.6 in capability matrix. |
qwen3:30b-a3b | 18 GB | 30 B MoE / 3 B active | Reasoning, function calling, 256k ctx | Default text workhorse. |
qwen3-coder:30b | 18 GB | 30 B | Code generation, repo-aware | Use for offline pair-programming. |
gemma4:26b | 17 GB | 26 B | General chat | Base for the gemma4 finetune family below. |
hermes-gemma4:latest | 17 GB | 26 B | Chat + tools (Nous Hermes finetune) | Stronger instruction-following. |
gemma4-opencode:latest | 17 GB | 26 B | OpenCode-style code completion | 26 B variant. |
qwen2.5-coder:14b | 9.0 GB | 14 B | Code, smaller / faster | Use when 30 B coder is too slow. |
g4f:latest · hermes-gemma4-fast | 9.6 GB ea | ~9 B | Fast chat | Same blob ID — one is an alias. |
gemma4:e4b · gemma4-e4b-opencode | 9.6 GB ea | ~4 B effective | Lightweight chat / code | Good wake-gate / classifier candidate. |
glm-ocr:latest | 2.2 GB | ~2 B | Document OCR | Smallest model in the cabinet. |
Whisper.cpp — speech-to-text, 5.2 GB
Stored at /Volumes/PortoSams2T/models/stt/whisper/ (legacy alias models/whisper still resolves — the live-translation server's hardcoded path keeps working) as ggml-*.bin files. Loaded directly by whisper.cpp / pywhispercpp consumers. Pick the smallest tier that meets accuracy needs — quality climbs steeply, but so does latency on CPU.
| Model | Size | Speed (M-series) | When to use |
|---|---|---|---|
ggml-large-v3.bin | 3.0 GB | ~1× real-time on CPU, faster on Metal | Production transcription, multilingual events. Best accuracy. |
ggml-medium.bin | 1.5 GB | ~2× real-time | Good accuracy / latency balance. |
ggml-small.bin | 488 MB | ~5× real-time | Quick batch jobs, English-mostly. |
ggml-base.bin | 148 MB | ~10× real-time | Smoke tests, embedded scenarios. |
HuggingFace cache (active) — 12.5 GB
Stored at /Volumes/PortoSams2T/models/embeddings/huggingface/hub/ (legacy alias huggingface-cache/hub still resolves). These are the embedding models the photo-indexer venv loads at runtime.
| Repo | Size | Type | Used by |
|---|---|---|---|
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 8.2 GB | Multimodal embeddings (text + image, unified) | Image-text vector search. |
google/siglip2-so400m-patch14-384 | 4.3 GB | Image embeddings | moondream-indexer/photo_embeddings_siglip.db — the photo search index. |
HuggingFace cache (offload, image gen) — 31 GB
Stored at /Volumes/PortoSams2T/models/image-gen/huggingface/hub/ (legacy alias alpuca-offload/huggingface/hub still resolves). Reached via the ~/.cache/huggingface symlink so any HF library on ALPUCA finds them transparently.
| Repo | Size | Type | Status |
|---|---|---|---|
black-forest-labs/FLUX.1-dev | 31 GB | Text-to-image (12 B params) | Downloaded — runs via the mflux venv on Apple Silicon. |
black-forest-labs/FLUX.1-schnell | ~0 | Fast T2I (4-step) | Placeholder dir only — not downloaded. |
briaai/FIBO | ~0 | Background removal | Placeholder dir only — not downloaded. |
Image-tools venvs (Python runtimes that load the above)
Stored at /Volumes/PortoSams2T/image-tools/venvs/. Activate then run; full READMEs in each venv dir.
venvs/mflux/— FLUX.1 image gen (Apple Silicon MLX port). Output to~/mflux-output/.venvs/moondream/— Moondream VLM for photo captioning.venvs/photo-indexer/— Face recognition + Flask search API. Loads siglip2 + facenet-pytorch.
Cloud APIs
Every paid / hosted model the apps actually call. Bold rows are wired into production code; non-bold rows are accessible but not yet integrated.
Cloudflare Workers AI
Account wingsiebird (9cd3a280a54ce2a5b382602f0247b577). Catalog: developers.cloudflare.com/workers-ai/models. Pay-as-you-go, billed per million tokens.
| Model | Use | Pricing (in / cached / out, per M tokens) | Notes |
|---|---|---|---|
@cf/moonshotai/kimi-k2.6 | Frontier LLM, vision, tools, 262k ctx | $0.95 / $0.16 / $4.00 | 1 T params. Closest "premium" tier we have through Cloudflare. |
@cf/meta/llama-4-* family | General LLM | varies | Cheap workhorse tier on Workers AI. |
@cf/openai/whisper-* | STT (cloud-hosted) | per-second | Fallback if local Whisper is down. |
Azure OpenAI
Subscription 785e237b-… (Microsoft for Startups Founders Hub Phase 1, $1k credit, expires 2026-07-31). Resource group sponic-ai. Service principal claude-code-sp in BW item 40b98339-3dce-4cec-9eaf-b43f00f43e80.
| Model | Use | Pricing (per image, hi-quality) | Wired in |
|---|---|---|---|
gpt-image-2 | All Sponic image generation (catalog, hero, food, audition) | $0.167 (1024²) · $0.25 (1536×1024) | apps/control/src/lib/image-gen.ts — all image generation MUST go through this wrapper. |
Google AI (Gemini) — the most-used cloud LLM in this stack
Key in env var GOOGLE_AI_API_KEY. BW item TODO — key currently lives in each app's .env.local. Gemini is doing more work in Sponic than any other cloud LLM — live translation, OCR, voice-note transcription, email classification, image generation. Three families in active use: 3.x (newest, live translation), 2.5 (production workhorse), 2.0 (legacy classifiers).
| Model | Use | Pricing | Wired in |
|---|---|---|---|
gemini-3-flash-preview | Live translation — ALPUCA subtitle server (powers Android live-translation app + venue subtitles) | ~$0.30 / $2.50 per M tokens | live-subtitles/server.js on ALPUCA, lines 250 & 301. Replaced gemini-2.5-flash on 2026-05-06. |
gemini-3.1-pro-preview | Highest-quality 3.1, REST-callable today | preview tier | Available; upgrade path from gemini-3-pro-preview. Also has a -customtools variant for tool use. |
gemini-3.1-flash-lite-preview | Smaller, faster, cheaper 3.1 flash — closest "3.1 flash" REST option | preview tier | Candidate upgrade for live translation; A/B against current gemini-3-flash-preview when ready. |
gemini-3.1-flash-image-preview | Image gen, Gemini 3.1 | preview tier | REST-callable. Newer than gemini-2.5-flash-image-preview. |
gemini-3.1-flash-tts-preview | AI dinner-host TTS — audio tags ([warmly], [whispers], [excited]), ~70 languages, multi-speaker, SynthID watermarking | preview tier (24 kHz mono PCM) | apps/ai-host/agent.py via custom apps/ai-host/gemini_tts.py wrapper (no upstream LiveKit plugin yet). Replaced the never-provisioned ElevenLabs path on 2026-05-06. |
gemini-3.1-flash-live-preview | Live API streaming (WebSocket bidiGenerateContent only) | preview tier | Future bidirectional voice path (e.g. AI host streaming). Not REST. |
~~gemini-3.1-flash~~ · ~~gemini-3.1-flash-preview~~ | — not in Google's catalog — | — | Google shipped 3.1 as lite/image/tts/live + a separate pro line, but no general-purpose flagship 3.1-flash yet. Verified via https://generativelanguage.googleapis.com/v1beta/models on 2026-05-06. |
gemini-3-pro-preview | Higher-quality reasoning when 3-flash isn't enough | ~$1.25 / $10 per M tokens | Available, not yet called from any production path. |
gemini-2.5-pro | Receipt OCR — payment record extraction | ~$1.25 / $10 per M tokens | apps/control/supabase/functions/record-payment/gemini-client.ts. |
gemini-2.5-flash | Voice-note transcription, sponic-pai email reasoning, daily-fact gen, weather Q&A, identity verification, sonos control intent, edit-email-template, generate-whispers | ~$0.075 / $0.30 per M tokens | ~10 Supabase functions under apps/control/supabase/functions/. |
gemini-2.5-flash-lite | Cheapest, fastest classifier path | ~$0.04 / $0.15 per M tokens | Used selectively in _shared/email-classifier.ts. |
gemini-2.5-flash-preview-tts | Native TTS (preview) | preview tier | Available but not wired — could replace the (also-not-wired) ElevenLabs path for AI host. |
gemini-2.5-flash-image-preview | Image-gen fallback when Azure gpt-image-2 quota hits | ~$0.04 (1024²) | Configured in config/project.config.ts; wrapper integration pending in image-gen.ts. |
gemini-3-pro-image-preview | Image gen, Gemini 3 generation | preview tier | Available, not wired. |
gemini-2.0-flash | Legacy sender classification on inbound mail | ~$0.10 / $0.40 per M tokens | resend-inbound-webhook · record-payment/tenant-matcher.ts. Migrate to 2.5-flash when next touched. |
Anthropic (Claude)
Key in env var ANTHROPIC_API_KEY. Used by AI host, Vapi server, prompt-runner worker, and this Claude Code CLI.
| Model | Use | Pricing (per M tokens, in / out) | Wired in |
|---|---|---|---|
claude-opus-4-7 (1M ctx) | Claude Code CLI (this session) | $15 / $75 | This terminal. |
claude-sonnet-4-6 | AI dinner-host brain, prompt-runner default | $3 / $15 | apps/ai-host/agent.py. |
claude-haiku-4-5 | AI host wake-gate (Phase 3), Vapi classifier | $1 / $5 | Pending Phase 3 of AI host. |
Deepgram
Key in env var DEEPGRAM_API_KEY.
| Model | Use | Pricing | Wired in |
|---|---|---|---|
| Nova-3 (streaming) | Live STT for venue translation + AI host | ~$0.0043/min streaming | apps/ai-host/agent.py; live-subtitles server on ALPUCA. |
ElevenLabs Replaced by Gemini TTS
The AI dinner-host originally scaffolded around elevenlabs.TTS, but Sponic never provisioned an ElevenLabs account. On 2026-05-06 the TTS was swapped to gemini-3.1-flash-tts-preview (see Google AI section above) — reuses the existing Google AI key, supports inline audio tags, ~70 languages. The remaining ElevenLabs reference in apps/control/supabase/functions/vapi-server/index.ts:224 is just a config string passed to the Vapi platform (provider: "11labs"); Vapi handles its own ElevenLabs auth on their side — not used.
OpenRouter
One OpenAI-compatible endpoint (https://openrouter.ai/api/v1) that fronts ~300+ models from Anthropic, OpenAI, Google, Meta, Mistral, DeepSeek, Qwen, MiniMax, xAI/Grok, NVIDIA Nemotron, Cohere, Moonshot, and more — with one API key. Pricing is provider-cost + a small markup; many models have free tiers. This is the right place to reach for "I just want to try model X" without provisioning a new account.
| Model slug | Provider / family | Pricing (in / out, per M tokens) | Notes |
|---|---|---|---|
minimax/minimax-m2 | MiniMax frontier reasoning | ~$0.30 / $1.20 | Likely the "2.5 minimax" reference — confirm slug at openrouter.ai/models. |
moonshotai/kimi-k2 | Kimi K2 (1T params, 256k ctx) | ~$0.60 / $2.50 | Cheaper than the Cloudflare-hosted Kimi K2.6 above; check freshness. |
deepseek/deepseek-r1 | DeepSeek R1 reasoning | ~$0.55 / $2.19 | Hosted full-fat R1 without running 600 GB locally. |
deepseek/deepseek-v3.1 | DeepSeek V3 | ~$0.27 / $1.10 | General-purpose, very cheap. |
x-ai/grok-4 | xAI Grok 4 | ~$3 / $15 | Long context, real-time-aware. |
google/gemini-2.5-pro | Gemini 2.5 Pro | ~$1.25 / $10 | Alternative to direct Google AI key when convenient. |
qwen/qwen3-235b-a22b | Qwen3 flagship MoE | ~$0.20 / $0.60 | Hosted version of the Qwen3 family we run locally. |
nvidia/llama-3.1-nemotron-70b-instruct | NVIDIA Nemotron | ~$0.12 / $0.30 | Often free on the OR free tier. |
mistralai/mistral-large-2411 | Mistral Large | ~$2 / $6 | European-hosted option for data-residency-sensitive work. |
apps/control/worker/prompt-runner/ pattern; (c) automatic fallback routing when a primary provider is rate-limited. Drawbacks: ~5–10% markup vs direct, and sub-second-latency-sensitive paths (live AI host) should still go direct (Anthropic, Deepgram, ElevenLabs).
DevOps-sponicgarden and add an OPENROUTER_API_KEY entry to the relevant .env.example files. Update config/project.config.ts with a llmGateway section so apps have a documented fallback path.
NVIDIA build.nvidia.com (NIM)
NVIDIA also hosts a catalog of open-weight models directly behind an OpenAI-compatible API at integrate.api.nvidia.com/v1 — useful when you want NVIDIA-tuned variants (Nemotron) or want to avoid the OpenRouter middleman. Free dev tier, usage-based after.
| Model | Use | Notes |
|---|---|---|
minimaxai/minimax-m2 | Frontier reasoning + tool-use, very long context | Same model also reachable via OpenRouter above; pick whichever has better latency / lower price for your case. |
nvidia/llama-3.1-nemotron-* | NVIDIA-tuned Llama variants | Only on NVIDIA — not all are mirrored to OpenRouter. |
nvidia/nemotron-4-340b-instruct | NVIDIA's largest open instruct model | Free dev tier, useful for batch synthetic-data generation. |
DevOps-sponicgarden for the NVIDIA API key only if you need Nemotron-specific variants — otherwise OpenRouter covers the same MiniMax / DeepSeek / etc. surface with one key.
Storage layout — executed 2026-05-06
Models used to be spread across four sibling directories on PortoSams2T (organic growth, not design). On 2026-05-06 they were consolidated into a single canonical models/ root, with non-model state moved to a sibling runtime/ root. Every old path was preserved as a back-compat symlink so existing consumers (live-translation server, mflux, photo-indexer, anything with hardcoded paths) keep working unchanged.
Current layout (post-consolidation)
/Volumes/PortoSams2T/models/ ← canonical root for every model blob
├── llm/
│ └── ollama/ 107 GB — Ollama LLMs
├── stt/
│ └── whisper/ 4.9 GB — whisper.cpp ggml files
├── embeddings/
│ └── huggingface/{hub,modules,xet} 13 GB — siglip2 + gme-Qwen2-VL
└── image-gen/
└── huggingface/{hub,xet,token,…} 31 GB — FLUX.1-dev
/Volumes/PortoSams2T/runtime/ ← non-model state
├── image-tools/ 2.9 GB — mflux/moondream/photo-indexer Python venvs
└── moondream-indexer/ 10 GB — photo embedding DB
/Volumes/PortoSams2T/ ← back-compat symlinks at every old path
├── models/whisper → models/stt/whisper
├── huggingface-cache/{hub,modules,xet} → models/embeddings/huggingface/*
├── alpuca-offload/ollama/models → models/llm/ollama
├── alpuca-offload/huggingface/{hub,xet,token,stored_tokens} → models/image-gen/huggingface/*
├── alpuca-offload/moondream-indexer → runtime/moondream-indexer
└── image-tools → runtime/image-tools
What broke / what didn't
- Ollama: nothing —
~/.ollama/modelsis a directory hard-link (same inode 1983236 asalpuca-offload/ollama/models). Moving the SSD-side directory entry to the new location preserved the inode, so the hard-link from home still resolves. Daemon restarted, all 13 models listed. - HF caches: nothing — both old paths are now symlinks to the new locations.
~/.cache/huggingfacestill resolves throughalpuca-offload/huggingface→ new image-gen path. - Whisper: nothing — the live-translation server's hardcoded
/Volumes/PortoSams2T/models/whisper/...path still resolves via the symlink tomodels/stt/whisper. - Python venvs (mflux, moondream, photo-indexer): nothing — the
image-toolstop-level path is still a valid entry (now a symlink).
models/ tree; the symlinks at old paths are explicit "this is legacy, find me at the new location" markers. Future code should use the new paths directly — the symlinks are a one-way ratchet, not a long-term API.
How to refresh this doc
Run these on ALPUCA when local-model state changes; paste new totals into the tables above.
ssh alpuca@alpuca '
ollama list
du -sh /Volumes/PortoSams2T/models/llm/ollama
du -sh /Volumes/PortoSams2T/models/embeddings/huggingface/hub/*
du -sh /Volumes/PortoSams2T/models/image-gen/huggingface/hub/*
ls -la /Volumes/PortoSams2T/models/stt/whisper/
'
For cloud entries, check each provider's pricing page and the relevant BW item under DevOps-sponicgarden:
infra/bin/bw-sponic search anthropic
infra/bin/bw-sponic search azure
infra/bin/bw-sponic search nvidia
infra/bin/bw-sponic search deepgram
infra/bin/bw-sponic search elevenlabs