← All docs

AI host — LiveKit Agents + Simli avatar

A speaking, listening AI host for Agentic Gatherings — that doubles as an outbound phone interviewer over SIP. One agent codebase, two transports.

2026-05-06 Architecture & build plan Touches: LiveKit · Gemini Live · Simli · Twilio · ALPUCA · Sponic translation app

TL;DR

Current production stack: LiveKit Cloud rooms · self-hosted LiveKit Agents worker on Oracle Phoenix · Gemini Live native audio (gemini-3.1-flash-live-preview) · optional Simli avatar. The earlier Deepgram → Claude → ElevenLabs cascade below is historical planning context, not the current runtime.

Costs: Gemini Live + LiveKit participant minutes are about $0.24–$0.28 for a 10-minute interview / Agentic Gathering without Simli. Simli adds about $3.00 per 10 minutes per avatar session.

Polish support: current path depends on Gemini Live's native multilingual audio. Whisper usage in repo: none in the LiveKit host path; ALPUCA's subtitle server uses Whisper.cpp as a local STT fallback for live subtitles.

Build: 7 phases for the Agentic Gathering host (~1–2 weeks), Phase 8 adds outbound phone calls (~1 week).

Web client — no install, works on iPhone

A vanilla HTML/JS page at sponicgardens.com/gather/ mirrors the native Android DinnerHostScreen: Google Sign-In, silent token re‑auth, livekit-token Edge Function exchange, mic publish, animated mic-level ring, auto-end when the agent leaves. Source: apps/garden/gather/. Same Web OAuth client ID as Android (801803827261-259t…); the production origin must be authorized in GCP → Credentials.

This unblocks iPhone users immediately while the native iOS+macOS app is built. iOS Safari note: keep the tab in the foreground — locking the screen suspends WebRTC. AirPods or wired headphones recommended; built-in echo cancellation handles speakerphone OK.

Use cases

Agentic Gathering host — AI character at the head of the table, displayed on a TV/tablet, listens to all guest mics through the existing Sponic translation app, jumps in to ask questions, propose toasts, surface guest connections, narrate course transitions. Programmatically promptable from an iPhone control app (“now toast the chef”).
Phone interviewer (Phase 8) — the same agent persona makes outbound calls for staff recruiting, partner BD intake (MOST, Bujna Warszawa, etc.), member onboarding chats, and post-event follow-ups. Posts the structured transcript + summary back into the intranet's relations / recruiting tables.
Future: shared persona & memory across both modalities — a guest the agent met at an Agentic Gathering can be recalled in a follow-up phone interview.

Architecture — Agentic Gathering host

AGENTIC GATHERING ROOM [Mic 1] [Mic 2] ... [Mic N] [TV / tablet] [Speaker] | ^ ^ v | | Phones / web / native clients Host display device joining the same LiveKit room (audio/video subscriber) +--------------------------- LIVEKIT ROOM ----------------------------+ | Participants: guest-mic-1..N · host-display · ai-host-agent | | Data channel: programmatic prompts ("toast the chef now") | +------------------------------+--------------------------------------+ | v +-------------- AI HOST AGENT (Python on Oracle Phoenix) -------------+ | | | Gemini Live native audio (gemini-3.1-flash-live-preview) | | speech in ⇄ reasoning ⇄ speech out | | | | Prompt library persona + optional participant dossiers | | Transcript capture → Supabase | | Optional Simli AvatarSession → video track back into the room | +----------------------------------------------------------------------+

Component picks

Layer	Pick	Why
WebRTC SFU	LiveKit Cloud	Native multi-participant rooms + SIP bridge for Phase 8. Free tier covers prototyping; paid is ~$0.50 per 1000 participant-minutes.
Capture / display	Android, Apple, web, and table display clients	All join the same LiveKit room with per-room JWTs minted by Supabase Edge Functions.
Realtime AI	Gemini Live native audio	One model handles speech input, reasoning, and speech output. Current default is `gemini-3.1-flash-live-preview` with voice `Puck`.
Prompting	Supabase prompt library + dossiers	Persona prompts load at session start from `public.prompts`; Agentic Gathering sessions can append participant dossiers.
Avatar	Simli optional	Native LiveKit integration, driven from the generated Gemini Live audio. Current tracked rate is `SIMLI_USD_PER_MINUTE=$0.30`.
Programmatic channel	LiveKit data channels	JSON commands over the same WebRTC connection (sub-100 ms). HTTP webhook on the agent as a fallback for external integrations.
Subtitle integration	`https://subs.sponicgardens.com`	Separate live-subtitle backend for subtitles and recordings. It is no longer the core Agentic Gathering host STT path.
Compute	Oracle Phoenix VM (already paid)	Agent process is mostly API-call orchestration — 2 vCPU / 2–4 GB RAM is plenty. Local MacBook is fine for the first prototype.

Historical note: this section originally compared a separate STT → LLM → TTS cascade against OpenAI Realtime. Production has since moved to Gemini Live native audio, so cost/model decisions should use the current cost table below and the AI host reference.

Translation-app integration

The existing Sponic live-translation pipeline already does per-mic ASR + translation and broadcasts subtitles over a WebSocket. We don’t need a parallel STT pipeline — the agent subscribes to the same WebSocket as a consumer.

What the existing app already gives us

Per-mic source language — each phone running the translation app picks its source language manually (no auto-detect). The selection lives in AudioSettings.kt and is sent to the server. See apps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/network/SubtitleClient.kt:100.
Streaming transcripts with translation — the ALPUCA subtitle server (Mac mini in Poland, http://Alpuca.local:8910 on LAN, https://subs.sponicgardens.com public) runs Deepgram Nova-2 EU as primary STT (Whisper.cpp local fallback) and DeepL / Azure Translator as translator. It emits one WebSocket per output language.
9 supported languages: en, pl, es, fr, de, pt, it, hi, ar (see Models.kt:34-44).

WebSocket message schema (consumed verbatim)

// wss://subs.sponicgardens.com/subtitles?lang=en
{
  "id": "seg_001",
  "text": "Welcome to Sponic Gardens",   // already translated to ?lang=
  "lang": "en",
  "source_lang": "pl",                    // what the speaker actually said
  "source_text": "Witamy w Sponic Gardens",
  "speaker": "mic-3",                     // optional, present when known
  "timestamp": 1711800000,
  "is_partial": false
}

How the agent consumes it

The agent opens one consumer WebSocket per active language (typically ?lang=en + ?lang=pl for a Polish/English Agentic Gathering). Behavior:

Use source_text + source_lang + speaker to build the rolling transcript fed to Claude. (Translated text is reference, not ground truth for reasoning.)
Tag each entry: [mic-3, pl] Witamy w Sponic Gardens (en: "Welcome to Sponic Gardens").
De-dupe: drop is_partial: true messages once a final-segment message with the same id arrives.
When the agent speaks, it can optionally publish its reply text back into the translation pipeline (via the existing POST /subtitles/inject endpoint), so the translation app fans it out to non-English guests via their existing earpieces — no new transport needed.

Net result

Zero new STT spend if the Agentic Gathering is already running the translation app, because we reuse its transcripts. The agent only pays for LLM, TTS, and avatar.

Multi-language reply — how the agent picks a language

Three contexts the agent has to handle differently:

Context	Language rule	Implementation
Replying to a specific guest (their question, their name addressed)	Reply in their source language, identified by `source_lang` on their incoming segment.	System prompt rule: “Reply in the language of the most recent speaker you are addressing. Their language tag will be in the transcript.” Claude handles this natively.
Broadcast / table announcement (toasts, course transitions, opening remarks)	Reply in the configured `table_language` (default: dominant active language; configurable per Agentic Gathering).	Programmatic command from control app: `{ “cmd”: “toast”, “language”: “pl”, “subject”: “Maria’s promotion” }`. Or system prompt default if unspecified.
Bilingual moment (mixed table, host wants both languages)	Speak twice — once in each language — or pick one and let the translation app fan out the rest to earpieces.	Tool call `broadcast_in_languages([“en”, “pl”])`. Generates two TTS clips back-to-back.

The TTS voice stays the same across languages. ElevenLabs Multilingual v2 holds a single voice ID through Polish, English, Spanish, etc. — the host has one identity that switches language without sounding like a different person. (Cartesia Sonic was the cheaper TTS option but its Polish quality is noticeably weaker; on a Polish-speaking household with Sonia, ElevenLabs is worth the spend.)

Polish support — quality assessment

Component	Polish quality	Notes
Deepgram Nova-3 STT (or Nova-2 via ALPUCA)	excellent	Both Nova-2 and Nova-3 are first-class on Polish. Nova-3 multilingual streaming is the path forward; Nova-2 EU is what ALPUCA currently uses.
DeepL translation (in the existing pipeline)	excellent	Polish is one of DeepL’s strongest languages — called out as “standout” in `LIVE-TRANSLATION.html`. We mostly bypass it because the agent reasons over `source_text` directly, but it stays useful for back-translating the agent’s replies to non-Polish guests.
Claude Sonnet 4.6 (generation)	excellent	Native Polish output is fluent, idiomatic, and culturally aware. No prompt-engineering tricks required — just “reply in Polish.”
ElevenLabs Multilingual v2 TTS	very good	Polish is officially supported, prosody is natural. Some slight stress-pattern errors on rare loanwords; 95% indistinguishable from native.
Cartesia Sonic TTS	limited	Polish support is recent and weaker than English. Mentioned only because it’s cheaper — not the recommended pick for a Polish-speaking host.
Simli avatar lip-sync	excellent	Lip-sync is phoneme-driven from the TTS audio waveform — language-agnostic. Works on Polish identically to English.

Whisper usage in the repo

Direct grep of the codebase:

No Whisper in the live-translation pipeline on the mobile app side. The Kotlin app uses Android SpeechRecognizer for local fallback and streams raw PCM to the ALPUCA backend for primary transcription. See apps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/audio/AudioRecorder.kt:94-100.
ALPUCA subtitle server uses Whisper.cpp as a fallback — not as the primary STT. Deepgram Nova-2 EU is primary; Whisper.cpp (medium / large-v3) runs locally on the Mac mini when Deepgram is unreachable or the user wants offline mode. The server’s code lives outside this repo (on ALPUCA itself).
One unrelated hit: apps/control/supabase/functions/generate-whispers/index.ts — this generates “whisper templates” for the Pakucha spirit-AI feature. It uses Gemini, not OpenAI Whisper. The shared word is incidental.

Conclusion: Whisper is not on the critical path for either the existing translation app or the AI host. Don’t introduce it.

Current 10-minute cost snapshot

Scenario	Humans	Simli	Cost
Interview	1	No	$0.24
Agentic Gathering	4	No	$0.25
Agentic Gathering	10	No	$0.28
Agentic Gathering	4	Yes	$3.26
Agentic Gathering	10	Yes	$3.29

Assumptions: Gemini Live $0.005/min audio input + $0.018/min audio output, LiveKit WebRTC $0.0005/participant-min, Simli $0.30/min, and the worker remains self-hosted on Oracle Phoenix.

Compute & room hardware

Agent process (where the Python program runs)

2 vCPU, 2–4 GB RAM, ~1 GB disk. All heavy lifting is API calls to STT/LLM/TTS/avatar; the agent itself is glue.
Where to run: Oracle Phoenix VM (already running other Sponic workers). Adds zero monthly cost.
Latency to providers: Phoenix is in US West — check round-trip to Deepgram (us-east) and Anthropic (us-west) before locking in. If Deepgram us-east is too slow, switch to Deepgram EU (the ALPUCA default).
Fallback: Fly.io / Railway machine (~$5/mo) or local MacBook during the actual Agentic Gathering.

Room hardware (per Agentic Gathering)

Mics: existing NEEWER CM31s + Android phones running the Sponic translation app (already validated path).
Display: any device that can run the LiveKit web client — iPad, Android tablet, Mac mini → TV, or a phone Chromecasting to a TV. Avatar video is 1–3 Mbps so any modern device works.
Speaker: built-in display speakers, or a Bluetooth speaker, or route the agent’s voice through the existing translation-app earpieces (handles per-language fan-out automatically).
Bandwidth: ~2.5 Mbps total (8 mics × 32 kbps up + 2 Mbps avatar down). Any home WiFi.

Latency budget (cascade path)

mic capture        ~30 ms
WebRTC up          ~50 ms
Deepgram STT       ~300 ms
Claude LLM         ~500 ms (first token, prompt cache hit)
ElevenLabs Flash   ~75 ms (first audio chunk)
Simli avatar       ~200 ms
WebRTC down        ~50 ms
─────────────────────────
total              ~1.2 s round-trip

Comfortable for Agentic Gathering banter (people pause that long anyway). Sub-1 s would require switching to OpenAI Realtime + avatar — pricier and not necessary.

Build phases

~300–500 lines of Python on top of livekit-agents. Reference implementations exist for room joining, plugin wiring, and the SIP integration; the original work is the wake-gate, the persona prompt, the per-mic transcript merging, and the translation-app bridge.

#	Phase	Scope	Time
1 ✅	One-mic spike	Local laptop. Python agent joins a LiveKit room, transcribes one mic via Deepgram, replies via ElevenLabs. No avatar yet. ✅ scaffold + entrypoint shipped 2026-05-06 — ops guide at /docs/reference/AI-HOST.	½ day
2	Multi-participant	Subscribe to all participant audio tracks separately. Per-mic STT streams. Shared rolling transcript with `[speaker, lang]` tags.	1 day
3	Wake-gate	Rule-based first (silence ≥ N seconds + recent question, host name mentioned). Then upgrade to Haiku 4.5 classifier on the rolling transcript.	1–2 days
4	Programmatic prompt channel	Listen on LiveKit data channel for `{cmd, ...}` messages. Build a tiny iOS/web control UI (could live in the intranet at `/en/relations/dinner-host`) that joins the room as a control participant.	1 day
5	Avatar (Simli + Live2D-style art)	Generate or commission the character art (single PNG, front-facing, neutral expression). Plug Simli LiveKit plugin into the TTS audio stream. Iterate on persona system prompt and voice.	2–3 days
6	Translation app integration	Subscribe to `wss://subs.sponicgardens.com/subtitles?lang=...` consumers (one per active language). Wire `source_text` + `source_lang` into the rolling transcript. Optionally publish agent replies via `POST /subtitles/inject` for back-translation.	1 day
7	Production deploy	Move agent process to Oracle Phoenix. Systemd unit. Real Agentic Gathering dry-run with ~3 people. Record + review for persona polish.	1–2 days
8	Phone interview transport (SIP)	See dedicated section below.	~1 week

Estimated total to working Agentic Gathering host prototype (Phases 1–7): 1–2 weeks.

Phase 8 — outbound phone interviewer

Vapi shortcut — evaluate before building LiveKit SIP from scratch

The Sponic repo already contains a Vapi integration at apps/control/supabase/functions/vapi-server/index.ts + apps/control/public/spaces/admin/voice.html + Supabase tables vapi_config / voice_assistants. However: per the user (2026-05-06), this code was ported from the alpacapps repo and is currently tied to a different Vapi account that’s lightly used. Directive: Sponic must own all code and credentials — no runtime dependencies on alpacapps. So: either provision a Sponic-owned Vapi account and reuse the existing handler (faster path, ~2 days to wire), or build LiveKit SIP from scratch (cleaner architecture, ~1 week per the plan below). Decide before starting Phase 8. The LiveKit-SIP plan below is unchanged but optional.

Same agent process, second transport. The agent doesn’t care whether audio comes from a WebRTC participant or a SIP call — LiveKit Agents abstracts both behind the same room model.

Components added

Component	Pick	Notes
SIP bridge	LiveKit SIP service (Cloud)	Bridges PSTN audio in/out of LiveKit rooms. Officially supported.
SIP trunk	Twilio Elastic SIP Trunking	Default choice; great docs and the rest of Sponic already uses Resend (different, but same operational style). Telnyx is the cheaper alternative if call volume justifies switching.
Phone number	Twilio US local number	~$1.15/month for the number. Pick a US area code matching where most candidates live; for Poland-resident candidates, a Polish local DID (~$3–5/mo).
Outbound dialer	Internal HTTP endpoint on the agent: `POST /interview/start`	Body: `{ phone, persona, prelude, candidate_id }`. Agent creates a fresh LiveKit room, places the SIP call, joins as the only other participant.
Inbound handler	LiveKit SIP “dispatch rule”	Incoming calls to our number create a room and the agent joins. Useful for “call us back” flows.
Recording & storage	LiveKit room recording → R2 bucket	WAV per call, one transcript JSON per call, deposited under `r2://sponic-call-recordings/<date>/<candidate_id>.{wav,json}`.
Post-call processing	Edge function: transcript → Claude summary → structured fields → intranet	Writes back to `relations_contacts` or `staff_recruiting` (depending on call type). Triggers a notification email to the assignee.

Code changes to the agent

No avatar. Audio-only on phone calls — skip Simli entirely, save ~$0.10/min.
Tighter latency budget. Switch to ElevenLabs Flash v2.5 (lower latency, slightly lower quality but better for phone audio).
Different system prompt. Interviewer persona, structured question flow per call type (recruiting interview, partner intake, member onboarding).
New tools: mark_question_complete, move_to_next_topic, flag_for_human_review, save_interview_summary, schedule_followup.
Recording-consent prologue. Required first turn of every call: “This call will be recorded for note-taking purposes — is that OK?” Wait for affirmative before continuing.
Hangup tool. Agent calls end_call() when the interview is complete or the candidate asks to stop.

Per-call cost (30 minutes outbound)

Item	Calc	Cost
Twilio SIP outbound	30 min × $0.0085/min	$0.26
Twilio phone number	amortized $1.15/mo over say 20 calls/month	$0.06
LiveKit Cloud	2 ppt × 30 min	$0.03
Deepgram Nova-3 multilingual	30 min × $0.0077/min	$0.23
Claude Sonnet 4.6	~5K cached input + ~3K output (interviewer is more talkative than Agentic Gathering host)	~$0.10
ElevenLabs Flash v2.5	~1500 words spoken (~7.5K chars)	~$0.50
Total per 30-min interview		~$1.20

For comparison: a human interviewer at $50/hour costs $25 for the same call — 20× the AI agent.

Compliance & safety

Call recording consent. US two-party consent states (CA, FL, IL, MD, MA, MT, NH, PA, WA — among others) require both parties to consent. Always include the recording prologue. Polish/EU calls fall under GDPR — explicit consent + right-to-deletion compliant storage.
Time-of-day restrictions. US TCPA: no outbound calls before 8 AM or after 9 PM in the recipient’s local time. Encode this in the dialer; refuse to place calls outside the window.
Do Not Call list. For cold outbound only (sales). Sponic interviews are opt-in (recruiting candidates and known partner contacts), so DNC doesn’t apply — but maintain an internal opt-out list anyway.
Identification. First turn after consent: “Hi, this is the Sponic Gardens AI assistant calling on behalf of Rahul / Sonia —”. Do not pretend to be human if asked.

Phase 8 build steps

Set up Twilio Elastic SIP trunk + provision a US phone number. Store credentials in BW DevOps-sponicgarden. (~2 hours)
Configure LiveKit SIP integration (Cloud config + dispatch rule for inbound, outbound trunk for outbound). (~4 hours)
Add POST /interview/start endpoint on the agent. Refactor agent code to take a transport parameter (room-only vs SIP-call). (~1 day)
Build inbound handler (incoming call → new room → agent joins). (~½ day)
Author interviewer persona system prompt + tool definitions per call type. (~1 day)
Post-call processing edge function: transcript → Claude summary → relations_contacts/staff_recruiting update. (~1 day)
End-to-end testing with own phone numbers. Record consent prologue. Sanity-check time-of-day enforcement. (~1 day)
Rollout: first batch of 5 candidate interviews monitored manually, iterate. (~ongoing)

Total Phase 8: ~1 week of focused work, plus a monitored rollout.

Decisions locked

Avatar style: Simli + Live2D-style character art (single illustrated PNG, anime-leaning, distinct Sponic personality — not photoreal).
Voice: ElevenLabs Multilingual v2 (single voice ID across all 9 languages). Generic library voice for v1; can clone a custom voice once the persona stabilizes.
Persona swappability: per Agentic Gathering system prompt (e.g., “Agentic Gathering host: warm, curious, leans into food and gardens” vs “Interview host: structured, professional, fewer jokes”). Trivial to support and worth it.
Transport unification: Agentic Gathering host (Phase 1–7) and phone interviewer (Phase 8) share the same agent code, persona system, and memory layer. One codebase, two transports.
STT cost-share: reuse the existing translation app’s WebSocket transcripts during Agentic Gatherings; only pay for Deepgram directly during phone calls (where the translation app isn’t involved).

Risks & open questions

Wake-gate UX is the main hard problem. If the host responds to every utterance, it dominates the table. If it’s too quiet, why is it there. Budget time for prompt iteration on the gating logic, and add a manual “mute / unmute” toggle on the control app.
Live2D-style illustrated avatars on Simli — verify before committing. Simli’s “custom character” mode is best on photoreal portraits; results on stylized art vary. Run a one-off test with a single character PNG before locking in Phase 5. If results are poor, fallbacks: (a) more realistic illustrated portraits, (b) Tavus photoreal humans, (c) Live2D-rigged VTuber model with audio-driven lip-sync (more work, full control).
Translation-app coupling. The agent depends on ALPUCA being up. If ALPUCA goes down mid-gathering, the agent loses the transcript stream. Mitigation: agent falls back to its own per-mic Deepgram subscription using the LiveKit audio tracks (slightly higher cost, same quality).
Voice cloning for the persona. If we want a specific Sponic-branded voice (matching Sonia or Rahul, or a fully synthetic personality), need ~30 minutes of clean recordings. Defer to v2.
Phone-call abuse / robocall liability. Twilio numbers used for outbound “AI calls” can get flagged as spam by carriers (STIR/SHAKEN attestation matters). Use Twilio Voice Insights to monitor flagging rate; rotate numbers if reputation degrades.
Memory across modalities. Storing “the agent met Maria at last week’s Agentic Gathering” for use in a phone interview requires a per-person memory layer. Likely a Supabase table ai_host_memory(person_id, fact, source_call_id, recorded_at) — deferred until basic voice loops work, but worth designing the schema early.

Future work

Group video calls (not just Agentic Gathering + phone). LiveKit handles this natively — same agent code joins a Zoom-like web meeting.
Slack/WhatsApp text bridge for follow-up messages from the agent (à la Boardy.ai’s multi-channel setup).
Cross-call memory layer — Supabase table referenced above.
Persona library — multiple named hosts (Mr. Sponic the gardener, Spirit-of-the-Garden Pakucha, Interview Mode, etc.) selectable per session.
Self-hosted LiveKit — if call volume justifies, run LiveKit OSS on Oracle Phoenix to drop the cloud SFU bill. Worth it past ~50K participant-minutes/mo.

Reference links

LIVE-TRANSLATION.html — existing translation pipeline spec (WebSocket schema, ALPUCA architecture, language list).
apps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/network/SubtitleClient.kt — Kotlin client for the subtitle WebSocket (lines 19–88 for connect logic, 63–73 for message format).
apps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/audio/AudioRecorder.kt — capture pipeline (lines 94–100 for mic source, 308–313 for AAC encoding).
LiveKit Agents docs: docs.livekit.io/agents
LiveKit SIP docs: docs.livekit.io/sip
Simli LiveKit plugin: docs.livekit.io/agents/integrations/avatar/simli
Boardy.ai (comparable product) — cascading STT/LLM/TTS architecture, OpenAI brain, undisclosed telephony stack. Good UX reference for phone-interview persona & multi-channel follow-up.

Doc owner: Rahul. Drafted 2026-05-06. Operational secrets in BW collection DevOps-sponicgarden (ALPU.CA org); full token / endpoint / ID index lives in the auto-memory at ~/.claude/projects/-Users-rahulio-Documents-CodingProjects-sponic/memory/service-access.md.