Reference · AI Models

AI Models

Every AI model Sponic can call — the local stack on ALPUCA's external SSD plus every API-hosted model wired into the apps. One page to answer "what should I use for X, and what does it cost?"

Prepared 2026-05-06 · Sponic Gardens · Sibling of AI Dinner Host · Live Translation

Local: 168 GB on PortoSams2T Cloud: 7 providers wired (OpenRouter fronts ~300 more) Storage layout consolidated 2026-05-06

Contents

Summary — what to use for what
Local on ALPUCA (PortoSams2T)
Cloud APIs
Proposed storage consolidation
How to refresh this doc

Summary — what to use for what

Pick the smallest, cheapest tool that hits the quality bar. Local is free at the margin but slower; cloud is faster and smarter but priced per token / per second / per image.

Workload	First choice	Why	Fallback
Catalog image generation (hero, audition, food)	Azure `gpt-image-2`	Already wired through `image-gen.ts` wrapper. $1k Azure sponsorship credit until 2026-07-31.	Google `gemini-2.5-flash-image-preview`
Local image generation (private / no quota)	`FLUX.1-dev` via mflux venv	31 GB on PortoSams2T, runs on ALPUCA's M-series GPU.	—
Live dinner-host conversation	Anthropic Claude Sonnet 4.6	1–2 sentence replies in 9 languages, persona-tuned.	Workers AI Kimi K2.6
Wake-gate (should host respond?)	Anthropic Claude Haiku 4.5	~70% cost cut vs always-Sonnet.	Local `gemma4:e4b`
Live translation (Android app + venue subtitles)	Local Whisper (STT) → `gemini-3-flash-preview` (translation)	Whisper-server runs on ALPUCA :8089/:8090; Gemini does language detection + translation in one call.	Gemini `2.5-flash` (was the prior model)
Real-time STT for AI dinner-host conversation	Deepgram Nova-3	Lower latency than local Whisper for sub-second response paths.	Local `whisper.cpp ggml-large-v3`
Batch / offline transcription	Local Whisper large-v3	3 GB on PortoSams2T, runs on ALPUCA — zero per-minute cost.	Deepgram batch
Photo embeddings & search	Local `siglip2-so400m`	4.3 GB. Drives the moondream-indexer photo DB.	—
Multimodal embeddings (text + image)	Local `gme-Qwen2-VL-2B`	8.2 GB. For unified text/image vector search.	—
Text-to-speech (AI host)	`gemini-3.1-flash-tts-preview`	Audio tags (`[warmly]`, `[whispers]`) for per-line expressiveness; ~70 languages; reuses our existing Google AI key. Wired in `apps/ai-host/agent.py` via custom `gemini_tts.py` wrapper.	Cartesia Sonic if Gemini TTS rate-limits
Coding assistant (long-context, agentic)	Anthropic Claude Opus 4.7 (this CLI)	1M context, tool use, sub-agents.	Local `qwen3-coder:30b`
Frontier LLM experiments / research	OpenRouter — one key, ~300 models	Provider cost + ~5–10% markup. Instant model swaps via slug change.	Workers AI Kimi K2.6 · NVIDIA NIM (Nemotron)

Local on ALPUCA (PortoSams2T)

All local models live on the always-connected Samsung T7 external SSD /Volumes/PortoSams2T/. ALPUCA's internal 245 GB drive does not hold any models. Three runtimes load them: Ollama (LLMs), HuggingFace transformers / diffusers (image gen, embeddings), whisper.cpp (audio).

How to tell where models actually live (post-2026-05-06 consolidation): everything is under /Volumes/PortoSams2T/models/{llm,stt,embeddings,image-gen}/. Home-dir entry points: ~/.ollama/models is a directory hard-link to /Volumes/PortoSams2T/models/llm/ollama (same inode); ~/.cache/huggingface symlinks through the legacy back-compat path to /Volumes/PortoSams2T/models/image-gen/huggingface. The old paths (alpuca-offload/ollama/models, huggingface-cache/hub, alpuca-offload/huggingface/hub, models/whisper) still resolve as back-compat symlinks — see the layout section.

Ollama LLMs — 107 GB

Stored at /Volumes/PortoSams2T/models/llm/ollama/ (legacy alias alpuca-offload/ollama/models still resolves). List with ollama list. Run with ollama run <name> or via the OpenAI-compatible HTTP API on http://alpuca:11434/v1/.

Model	Size	Params	Strengths	Notes
`deepseek-r1:32b`	19 GB	32 B dense	Reasoning, chain-of-thought	Distill from R1, no native vision/tools.
`qwen3-vl:30b-a3b`	19 GB	30 B MoE / 3 B active	Vision, reasoning, tools	Closest local equivalent to Kimi K2.6 in capability matrix.
`qwen3:30b-a3b`	18 GB	30 B MoE / 3 B active	Reasoning, function calling, 256k ctx	Default text workhorse.
`qwen3-coder:30b`	18 GB	30 B	Code generation, repo-aware	Use for offline pair-programming.
`gemma4:26b`	17 GB	26 B	General chat	Base for the gemma4 finetune family below.
`hermes-gemma4:latest`	17 GB	26 B	Chat + tools (Nous Hermes finetune)	Stronger instruction-following.
`gemma4-opencode:latest`	17 GB	26 B	OpenCode-style code completion	26 B variant.
`qwen2.5-coder:14b`	9.0 GB	14 B	Code, smaller / faster	Use when 30 B coder is too slow.
`g4f:latest` · `hermes-gemma4-fast`	9.6 GB ea	~9 B	Fast chat	Same blob ID — one is an alias.
`gemma4:e4b` · `gemma4-e4b-opencode`	9.6 GB ea	~4 B effective	Lightweight chat / code	Good wake-gate / classifier candidate.
`glm-ocr:latest`	2.2 GB	~2 B	Document OCR	Smallest model in the cabinet.

Whisper.cpp — speech-to-text, 5.2 GB

Stored at /Volumes/PortoSams2T/models/stt/whisper/ (legacy alias models/whisper still resolves — the live-translation server's hardcoded path keeps working) as ggml-*.bin files. Loaded directly by whisper.cpp / pywhispercpp consumers. Pick the smallest tier that meets accuracy needs — quality climbs steeply, but so does latency on CPU.

Model	Size	Speed (M-series)	When to use
`ggml-large-v3.bin`	3.0 GB	~1× real-time on CPU, faster on Metal	Production transcription, multilingual events. Best accuracy.
`ggml-medium.bin`	1.5 GB	~2× real-time	Good accuracy / latency balance.
`ggml-small.bin`	488 MB	~5× real-time	Quick batch jobs, English-mostly.
`ggml-base.bin`	148 MB	~10× real-time	Smoke tests, embedded scenarios.

HuggingFace cache (active) — 12.5 GB

Stored at /Volumes/PortoSams2T/models/embeddings/huggingface/hub/ (legacy alias huggingface-cache/hub still resolves). These are the embedding models the photo-indexer venv loads at runtime.

Repo	Size	Type	Used by
`Alibaba-NLP/gme-Qwen2-VL-2B-Instruct`	8.2 GB	Multimodal embeddings (text + image, unified)	Image-text vector search.
`google/siglip2-so400m-patch14-384`	4.3 GB	Image embeddings	`moondream-indexer/photo_embeddings_siglip.db` — the photo search index.

HuggingFace cache (offload, image gen) — 31 GB

Stored at /Volumes/PortoSams2T/models/image-gen/huggingface/hub/ (legacy alias alpuca-offload/huggingface/hub still resolves). Reached via the ~/.cache/huggingface symlink so any HF library on ALPUCA finds them transparently.

Repo	Size	Type	Status
`black-forest-labs/FLUX.1-dev`	31 GB	Text-to-image (12 B params)	Downloaded — runs via the mflux venv on Apple Silicon.
`black-forest-labs/FLUX.1-schnell`	~0	Fast T2I (4-step)	Placeholder dir only — not downloaded.
`briaai/FIBO`	~0	Background removal	Placeholder dir only — not downloaded.

Image-tools venvs (Python runtimes that load the above)

Stored at /Volumes/PortoSams2T/image-tools/venvs/. Activate then run; full READMEs in each venv dir.

venvs/mflux/ — FLUX.1 image gen (Apple Silicon MLX port). Output to ~/mflux-output/.
venvs/moondream/ — Moondream VLM for photo captioning.
venvs/photo-indexer/ — Face recognition + Flask search API. Loads siglip2 + facenet-pytorch.

Cloud APIs

Every paid / hosted model the apps actually call. Bold rows are wired into production code; non-bold rows are accessible but not yet integrated.

Cloudflare Workers AI

Account wingsiebird (9cd3a280a54ce2a5b382602f0247b577). Catalog: developers.cloudflare.com/workers-ai/models. Pay-as-you-go, billed per million tokens.

Model	Use	Pricing (in / cached / out, per M tokens)	Notes
`@cf/moonshotai/kimi-k2.6`	Frontier LLM, vision, tools, 262k ctx	$0.95 / $0.16 / $4.00	1 T params. Closest "premium" tier we have through Cloudflare.
`@cf/meta/llama-4-*` family	General LLM	varies	Cheap workhorse tier on Workers AI.
`@cf/openai/whisper-*`	STT (cloud-hosted)	per-second	Fallback if local Whisper is down.

TODO — we don't yet call Workers AI from any app. Add a thin client and document a pattern when the first real use case lands.

Azure OpenAI

Subscription 785e237b-… (Microsoft for Startups Founders Hub Phase 1, $1k credit, expires 2026-07-31). Resource group sponic-ai. Service principal claude-code-sp in BW item 40b98339-3dce-4cec-9eaf-b43f00f43e80.

Model	Use	Pricing (per image, hi-quality)	Wired in
`gpt-image-2`	All Sponic image generation (catalog, hero, food, audition)	$0.167 (1024²) · $0.25 (1536×1024)	`apps/control/src/lib/image-gen.ts` — all image generation MUST go through this wrapper.

Google AI (Gemini) — the most-used cloud LLM in this stack

Key in env var GOOGLE_AI_API_KEY. BW item TODO — key currently lives in each app's .env.local. Gemini is doing more work in Sponic than any other cloud LLM — live translation, OCR, voice-note transcription, email classification, image generation. Three families in active use: 3.x (newest, live translation), 2.5 (production workhorse), 2.0 (legacy classifiers).

Model	Use	Pricing	Wired in
`gemini-3-flash-preview`	Live translation — ALPUCA subtitle server (powers Android live-translation app + venue subtitles)	~$0.30 / $2.50 per M tokens	`live-subtitles/server.js` on ALPUCA, lines 250 & 301. Replaced `gemini-2.5-flash` on 2026-05-06.
`gemini-3.1-pro-preview`	Highest-quality 3.1, REST-callable today	preview tier	Available; upgrade path from `gemini-3-pro-preview`. Also has a `-customtools` variant for tool use.
`gemini-3.1-flash-lite-preview`	Smaller, faster, cheaper 3.1 flash — closest "3.1 flash" REST option	preview tier	Candidate upgrade for live translation; A/B against current `gemini-3-flash-preview` when ready.
`gemini-3.1-flash-image-preview`	Image gen, Gemini 3.1	preview tier	REST-callable. Newer than `gemini-2.5-flash-image-preview`.
`gemini-3.1-flash-tts-preview`	AI dinner-host TTS — audio tags (`[warmly]`, `[whispers]`, `[excited]`), ~70 languages, multi-speaker, SynthID watermarking	preview tier (24 kHz mono PCM)	`apps/ai-host/agent.py` via custom `apps/ai-host/gemini_tts.py` wrapper (no upstream LiveKit plugin yet). Replaced the never-provisioned ElevenLabs path on 2026-05-06.
`gemini-3.1-flash-live-preview`	Live API streaming (WebSocket `bidiGenerateContent` only)	preview tier	Future bidirectional voice path (e.g. AI host streaming). Not REST.
~~`gemini-3.1-flash`~~ · ~~`gemini-3.1-flash-preview`~~	— not in Google's catalog —	—	Google shipped 3.1 as `lite`/`image`/`tts`/`live` + a separate `pro` line, but no general-purpose flagship 3.1-flash yet. Verified via `https://generativelanguage.googleapis.com/v1beta/models` on 2026-05-06.
`gemini-3-pro-preview`	Higher-quality reasoning when 3-flash isn't enough	~$1.25 / $10 per M tokens	Available, not yet called from any production path.
`gemini-2.5-pro`	Receipt OCR — payment record extraction	~$1.25 / $10 per M tokens	`apps/control/supabase/functions/record-payment/gemini-client.ts`.
`gemini-2.5-flash`	Voice-note transcription, sponic-pai email reasoning, daily-fact gen, weather Q&A, identity verification, sonos control intent, edit-email-template, generate-whispers	~$0.075 / $0.30 per M tokens	~10 Supabase functions under `apps/control/supabase/functions/`.
`gemini-2.5-flash-lite`	Cheapest, fastest classifier path	~$0.04 / $0.15 per M tokens	Used selectively in `_shared/email-classifier.ts`.
`gemini-2.5-flash-preview-tts`	Native TTS (preview)	preview tier	Available but not wired — could replace the (also-not-wired) ElevenLabs path for AI host.
`gemini-2.5-flash-image-preview`	Image-gen fallback when Azure `gpt-image-2` quota hits	~$0.04 (1024²)	Configured in `config/project.config.ts`; wrapper integration pending in `image-gen.ts`.
`gemini-3-pro-image-preview`	Image gen, Gemini 3 generation	preview tier	Available, not wired.
`gemini-2.0-flash`	Legacy sender classification on inbound mail	~$0.10 / $0.40 per M tokens	`resend-inbound-webhook` · `record-payment/tenant-matcher.ts`. Migrate to 2.5-flash when next touched.

Why Gemini for so much of the stack? (a) Gemini 2.5 Flash is the cheapest competent multilingual model available — ~10× cheaper than Sonnet for the same OCR/classification quality. (b) Gemini's REST API speaks plain JSON — no streaming, no SSE, no tool-result loops — which makes it trivial to drop into Supabase edge functions. (c) Native multilingual: live translation works in 9+ languages without per-language tuning.

Anthropic (Claude)

Key in env var ANTHROPIC_API_KEY. Used by AI host, Vapi server, prompt-runner worker, and this Claude Code CLI.

Model	Use	Pricing (per M tokens, in / out)	Wired in
`claude-opus-4-7` (1M ctx)	Claude Code CLI (this session)	$15 / $75	This terminal.
`claude-sonnet-4-6`	AI dinner-host brain, prompt-runner default	$3 / $15	`apps/ai-host/agent.py`.
`claude-haiku-4-5`	AI host wake-gate (Phase 3), Vapi classifier	$1 / $5	Pending Phase 3 of AI host.

Deepgram

Key in env var DEEPGRAM_API_KEY.

Model	Use	Pricing	Wired in
Nova-3 (streaming)	Live STT for venue translation + AI host	~$0.0043/min streaming	`apps/ai-host/agent.py`; live-subtitles server on ALPUCA.

ElevenLabs Replaced by Gemini TTS

The AI dinner-host originally scaffolded around elevenlabs.TTS, but Sponic never provisioned an ElevenLabs account. On 2026-05-06 the TTS was swapped to gemini-3.1-flash-tts-preview (see Google AI section above) — reuses the existing Google AI key, supports inline audio tags, ~70 languages. The remaining ElevenLabs reference in apps/control/supabase/functions/vapi-server/index.ts:224 is just a config string passed to the Vapi platform (provider: "11labs"); Vapi handles its own ElevenLabs auth on their side — not used.

OpenRouter

One OpenAI-compatible endpoint (https://openrouter.ai/api/v1) that fronts ~300+ models from Anthropic, OpenAI, Google, Meta, Mistral, DeepSeek, Qwen, MiniMax, xAI/Grok, NVIDIA Nemotron, Cohere, Moonshot, and more — with one API key. Pricing is provider-cost + a small markup; many models have free tiers. This is the right place to reach for "I just want to try model X" without provisioning a new account.

Model slug	Provider / family	Pricing (in / out, per M tokens)	Notes
`minimax/minimax-m2`	MiniMax frontier reasoning	~$0.30 / $1.20	Likely the "2.5 minimax" reference — confirm slug at openrouter.ai/models.
`moonshotai/kimi-k2`	Kimi K2 (1T params, 256k ctx)	~$0.60 / $2.50	Cheaper than the Cloudflare-hosted Kimi K2.6 above; check freshness.
`deepseek/deepseek-r1`	DeepSeek R1 reasoning	~$0.55 / $2.19	Hosted full-fat R1 without running 600 GB locally.
`deepseek/deepseek-v3.1`	DeepSeek V3	~$0.27 / $1.10	General-purpose, very cheap.
`x-ai/grok-4`	xAI Grok 4	~$3 / $15	Long context, real-time-aware.
`google/gemini-2.5-pro`	Gemini 2.5 Pro	~$1.25 / $10	Alternative to direct Google AI key when convenient.
`qwen/qwen3-235b-a22b`	Qwen3 flagship MoE	~$0.20 / $0.60	Hosted version of the Qwen3 family we run locally.
`nvidia/llama-3.1-nemotron-70b-instruct`	NVIDIA Nemotron	~$0.12 / $0.30	Often free on the OR free tier.
`mistralai/mistral-large-2411`	Mistral Large	~$2 / $6	European-hosted option for data-residency-sensitive work.

Why route through OpenRouter even when a direct provider key exists? (a) one bill, one usage dashboard, one key to rotate; (b) instant model swaps via slug change — useful for the apps/control/worker/prompt-runner/ pattern; (c) automatic fallback routing when a primary provider is rate-limited. Drawbacks: ~5–10% markup vs direct, and sub-second-latency-sensitive paths (live AI host) should still go direct (Anthropic, Deepgram, ElevenLabs).

Action item: store the OpenRouter API key in BW under DevOps-sponicgarden and add an OPENROUTER_API_KEY entry to the relevant .env.example files. Update config/project.config.ts with a llmGateway section so apps have a documented fallback path.

NVIDIA `build.nvidia.com` (NIM)

NVIDIA also hosts a catalog of open-weight models directly behind an OpenAI-compatible API at integrate.api.nvidia.com/v1 — useful when you want NVIDIA-tuned variants (Nemotron) or want to avoid the OpenRouter middleman. Free dev tier, usage-based after.

Model	Use	Notes
`minimaxai/minimax-m2`	Frontier reasoning + tool-use, very long context	Same model also reachable via OpenRouter above; pick whichever has better latency / lower price for your case.
`nvidia/llama-3.1-nemotron-*`	NVIDIA-tuned Llama variants	Only on NVIDIA — not all are mirrored to OpenRouter.
`nvidia/nemotron-4-340b-instruct`	NVIDIA's largest open instruct model	Free dev tier, useful for batch synthetic-data generation.

Action item: create a BW item under DevOps-sponicgarden for the NVIDIA API key only if you need Nemotron-specific variants — otherwise OpenRouter covers the same MiniMax / DeepSeek / etc. surface with one key.

Storage layout — executed 2026-05-06

Models used to be spread across four sibling directories on PortoSams2T (organic growth, not design). On 2026-05-06 they were consolidated into a single canonical models/ root, with non-model state moved to a sibling runtime/ root. Every old path was preserved as a back-compat symlink so existing consumers (live-translation server, mflux, photo-indexer, anything with hardcoded paths) keep working unchanged.

Current layout (post-consolidation)

/Volumes/PortoSams2T/models/      ← canonical root for every model blob
├── llm/
│   └── ollama/                   107 GB — Ollama LLMs
├── stt/
│   └── whisper/                  4.9 GB — whisper.cpp ggml files
├── embeddings/
│   └── huggingface/{hub,modules,xet}    13 GB — siglip2 + gme-Qwen2-VL
└── image-gen/
    └── huggingface/{hub,xet,token,…}    31 GB — FLUX.1-dev

/Volumes/PortoSams2T/runtime/     ← non-model state
├── image-tools/                  2.9 GB — mflux/moondream/photo-indexer Python venvs
└── moondream-indexer/             10 GB — photo embedding DB

/Volumes/PortoSams2T/             ← back-compat symlinks at every old path
├── models/whisper                  → models/stt/whisper
├── huggingface-cache/{hub,modules,xet}    → models/embeddings/huggingface/*
├── alpuca-offload/ollama/models    → models/llm/ollama
├── alpuca-offload/huggingface/{hub,xet,token,stored_tokens} → models/image-gen/huggingface/*
├── alpuca-offload/moondream-indexer → runtime/moondream-indexer
└── image-tools                     → runtime/image-tools

What broke / what didn't

Ollama: nothing — ~/.ollama/models is a directory hard-link (same inode 1983236 as alpuca-offload/ollama/models). Moving the SSD-side directory entry to the new location preserved the inode, so the hard-link from home still resolves. Daemon restarted, all 13 models listed.
HF caches: nothing — both old paths are now symlinks to the new locations. ~/.cache/huggingface still resolves through alpuca-offload/huggingface → new image-gen path.
Whisper: nothing — the live-translation server's hardcoded /Volumes/PortoSams2T/models/whisper/... path still resolves via the symlink to models/stt/whisper.
Python venvs (mflux, moondream, photo-indexer): nothing — the image-tools top-level path is still a valid entry (now a symlink).

Why back-compat symlinks instead of env-var swaps? Every consumer that hardcoded an old path now keeps working without any further coordination. The canonical structure is the new models/ tree; the symlinks at old paths are explicit "this is legacy, find me at the new location" markers. Future code should use the new paths directly — the symlinks are a one-way ratchet, not a long-term API.

How to refresh this doc

Run these on ALPUCA when local-model state changes; paste new totals into the tables above.

ssh alpuca@alpuca '
  ollama list
  du -sh /Volumes/PortoSams2T/models/llm/ollama
  du -sh /Volumes/PortoSams2T/models/embeddings/huggingface/hub/*
  du -sh /Volumes/PortoSams2T/models/image-gen/huggingface/hub/*
  ls -la /Volumes/PortoSams2T/models/stt/whisper/
'

For cloud entries, check each provider's pricing page and the relevant BW item under DevOps-sponicgarden:

infra/bin/bw-sponic search anthropic
infra/bin/bw-sponic search azure
infra/bin/bw-sponic search nvidia
infra/bin/bw-sponic search deepgram
infra/bin/bw-sponic search elevenlabs

AI Models

Summary — what to use for what

Local on ALPUCA (PortoSams2T)

Ollama LLMs — 107 GB

Whisper.cpp — speech-to-text, 5.2 GB

HuggingFace cache (active) — 12.5 GB

HuggingFace cache (offload, image gen) — 31 GB

Image-tools venvs (Python runtimes that load the above)

Cloud APIs

Cloudflare Workers AI

Azure OpenAI

Google AI (Gemini) — the most-used cloud LLM in this stack

Anthropic (Claude)

Deepgram

ElevenLabs Replaced by Gemini TTS

OpenRouter

NVIDIA build.nvidia.com (NIM)

Storage layout — executed 2026-05-06

Current layout (post-consolidation)

What broke / what didn't

How to refresh this doc

NVIDIA `build.nvidia.com` (NIM)