← All docs

AI host — LiveKit Agents + Simli avatar

A speaking, listening AI host for Agentic Gatherings — that doubles as an outbound phone interviewer over SIP. One agent codebase, two transports.
2026-05-06 Architecture & build plan Touches: LiveKit ยท Gemini Live ยท Simli ยท Twilio ยท ALPUCA ยท Sponic translation app

TL;DR

Current production stack: LiveKit Cloud rooms ยท self-hosted LiveKit Agents worker on Oracle Phoenix ยท Gemini Live native audio (gemini-3.1-flash-live-preview) ยท optional Simli avatar. The earlier Deepgram → Claude → ElevenLabs cascade below is historical planning context, not the current runtime.

Costs: Gemini Live + LiveKit participant minutes are about $0.24–$0.28 for a 10-minute interview / Agentic Gathering without Simli. Simli adds about $3.00 per 10 minutes per avatar session.

Polish support: current path depends on Gemini Live's native multilingual audio. Whisper usage in repo: none in the LiveKit host path; ALPUCA's subtitle server uses Whisper.cpp as a local STT fallback for live subtitles.

Build: 7 phases for the Agentic Gathering host (~1–2 weeks), Phase 8 adds outbound phone calls (~1 week).

Web client — no install, works on iPhone

A vanilla HTML/JS page at sponicgardens.com/gather/ mirrors the native Android DinnerHostScreen: Google Sign-In, silent token re‑auth, livekit-token Edge Function exchange, mic publish, animated mic-level ring, auto-end when the agent leaves. Source: apps/garden/gather/. Same Web OAuth client ID as Android (801803827261-259tโ€ฆ); the production origin must be authorized in GCP → Credentials.

This unblocks iPhone users immediately while the native iOS+macOS app is built. iOS Safari note: keep the tab in the foreground — locking the screen suspends WebRTC. AirPods or wired headphones recommended; built-in echo cancellation handles speakerphone OK.

Use cases

Architecture — Agentic Gathering host

AGENTIC GATHERING ROOM [Mic 1] [Mic 2] ... [Mic N] [TV / tablet] [Speaker] | ^ ^ v | | Phones / web / native clients Host display device joining the same LiveKit room (audio/video subscriber) +--------------------------- LIVEKIT ROOM ----------------------------+ | Participants: guest-mic-1..N ยท host-display ยท ai-host-agent | | Data channel: programmatic prompts ("toast the chef now") | +------------------------------+--------------------------------------+ | v +-------------- AI HOST AGENT (Python on Oracle Phoenix) -------------+ | | | Gemini Live native audio (gemini-3.1-flash-live-preview) | | speech in โ‡„ reasoning โ‡„ speech out | | | | Prompt library persona + optional participant dossiers | | Transcript capture โ†’ Supabase | | Optional Simli AvatarSession โ†’ video track back into the room | +----------------------------------------------------------------------+

Component picks

LayerPickWhy
WebRTC SFULiveKit CloudNative multi-participant rooms + SIP bridge for Phase 8. Free tier covers prototyping; paid is ~$0.50 per 1000 participant-minutes.
Capture / displayAndroid, Apple, web, and table display clientsAll join the same LiveKit room with per-room JWTs minted by Supabase Edge Functions.
Realtime AIGemini Live native audioOne model handles speech input, reasoning, and speech output. Current default is gemini-3.1-flash-live-preview with voice Puck.
PromptingSupabase prompt library + dossiersPersona prompts load at session start from public.prompts; Agentic Gathering sessions can append participant dossiers.
AvatarSimli optionalNative LiveKit integration, driven from the generated Gemini Live audio. Current tracked rate is SIMLI_USD_PER_MINUTE=$0.30.
Programmatic channelLiveKit data channelsJSON commands over the same WebRTC connection (sub-100 ms). HTTP webhook on the agent as a fallback for external integrations.
Subtitle integrationhttps://subs.sponicgardens.comSeparate live-subtitle backend for subtitles and recordings. It is no longer the core Agentic Gathering host STT path.
ComputeOracle Phoenix VM (already paid)Agent process is mostly API-call orchestration — 2 vCPU / 2–4 GB RAM is plenty. Local MacBook is fine for the first prototype.

Historical note: this section originally compared a separate STT → LLM → TTS cascade against OpenAI Realtime. Production has since moved to Gemini Live native audio, so cost/model decisions should use the current cost table below and the AI host reference.

Translation-app integration

The existing Sponic live-translation pipeline already does per-mic ASR + translation and broadcasts subtitles over a WebSocket. We don’t need a parallel STT pipeline — the agent subscribes to the same WebSocket as a consumer.

What the existing app already gives us

WebSocket message schema (consumed verbatim)

// wss://subs.sponicgardens.com/subtitles?lang=en
{
  "id": "seg_001",
  "text": "Welcome to Sponic Gardens",   // already translated to ?lang=
  "lang": "en",
  "source_lang": "pl",                    // what the speaker actually said
  "source_text": "Witamy w Sponic Gardens",
  "speaker": "mic-3",                     // optional, present when known
  "timestamp": 1711800000,
  "is_partial": false
}

How the agent consumes it

The agent opens one consumer WebSocket per active language (typically ?lang=en + ?lang=pl for a Polish/English Agentic Gathering). Behavior:

Net result

Zero new STT spend if the Agentic Gathering is already running the translation app, because we reuse its transcripts. The agent only pays for LLM, TTS, and avatar.

Multi-language reply — how the agent picks a language

Three contexts the agent has to handle differently:

ContextLanguage ruleImplementation
Replying to a specific guest (their question, their name addressed) Reply in their source language, identified by source_lang on their incoming segment. System prompt rule: “Reply in the language of the most recent speaker you are addressing. Their language tag will be in the transcript.” Claude handles this natively.
Broadcast / table announcement (toasts, course transitions, opening remarks) Reply in the configured table_language (default: dominant active language; configurable per Agentic Gathering). Programmatic command from control app: { “cmd”: “toast”, “language”: “pl”, “subject”: “Maria’s promotion” }. Or system prompt default if unspecified.
Bilingual moment (mixed table, host wants both languages) Speak twice — once in each language — or pick one and let the translation app fan out the rest to earpieces. Tool call broadcast_in_languages([“en”, “pl”]). Generates two TTS clips back-to-back.

The TTS voice stays the same across languages. ElevenLabs Multilingual v2 holds a single voice ID through Polish, English, Spanish, etc. — the host has one identity that switches language without sounding like a different person. (Cartesia Sonic was the cheaper TTS option but its Polish quality is noticeably weaker; on a Polish-speaking household with Sonia, ElevenLabs is worth the spend.)

Polish support — quality assessment

ComponentPolish qualityNotes
Deepgram Nova-3 STT (or Nova-2 via ALPUCA)excellentBoth Nova-2 and Nova-3 are first-class on Polish. Nova-3 multilingual streaming is the path forward; Nova-2 EU is what ALPUCA currently uses.
DeepL translation (in the existing pipeline)excellentPolish is one of DeepL’s strongest languages — called out as “standout” in LIVE-TRANSLATION.html. We mostly bypass it because the agent reasons over source_text directly, but it stays useful for back-translating the agent’s replies to non-Polish guests.
Claude Sonnet 4.6 (generation)excellentNative Polish output is fluent, idiomatic, and culturally aware. No prompt-engineering tricks required — just “reply in Polish.”
ElevenLabs Multilingual v2 TTSvery goodPolish is officially supported, prosody is natural. Some slight stress-pattern errors on rare loanwords; 95% indistinguishable from native.
Cartesia Sonic TTSlimitedPolish support is recent and weaker than English. Mentioned only because it’s cheaper — not the recommended pick for a Polish-speaking host.
Simli avatar lip-syncexcellentLip-sync is phoneme-driven from the TTS audio waveform — language-agnostic. Works on Polish identically to English.

Whisper usage in the repo

Direct grep of the codebase:

Conclusion: Whisper is not on the critical path for either the existing translation app or the AI host. Don’t introduce it.

Current 10-minute cost snapshot

ScenarioHumansSimliCost
Interview1No$0.24
Agentic Gathering4No$0.25
Agentic Gathering10No$0.28
Agentic Gathering4Yes$3.26
Agentic Gathering10Yes$3.29

Assumptions: Gemini Live $0.005/min audio input + $0.018/min audio output, LiveKit WebRTC $0.0005/participant-min, Simli $0.30/min, and the worker remains self-hosted on Oracle Phoenix.

Compute & room hardware

Agent process (where the Python program runs)

Room hardware (per Agentic Gathering)

Latency budget (cascade path)

mic capture        ~30 ms
WebRTC up          ~50 ms
Deepgram STT       ~300 ms
Claude LLM         ~500 ms (first token, prompt cache hit)
ElevenLabs Flash   ~75 ms (first audio chunk)
Simli avatar       ~200 ms
WebRTC down        ~50 ms
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
total              ~1.2 s round-trip

Comfortable for Agentic Gathering banter (people pause that long anyway). Sub-1 s would require switching to OpenAI Realtime + avatar — pricier and not necessary.

Build phases

~300–500 lines of Python on top of livekit-agents. Reference implementations exist for room joining, plugin wiring, and the SIP integration; the original work is the wake-gate, the persona prompt, the per-mic transcript merging, and the translation-app bridge.

#PhaseScopeTime
1 โœ… One-mic spike Local laptop. Python agent joins a LiveKit room, transcribes one mic via Deepgram, replies via ElevenLabs. No avatar yet. โœ… scaffold + entrypoint shipped 2026-05-06 — ops guide at /docs/reference/AI-HOST. ½ day
2 Multi-participant Subscribe to all participant audio tracks separately. Per-mic STT streams. Shared rolling transcript with [speaker, lang] tags. 1 day
3 Wake-gate Rule-based first (silence ≥ N seconds + recent question, host name mentioned). Then upgrade to Haiku 4.5 classifier on the rolling transcript. 1–2 days
4 Programmatic prompt channel Listen on LiveKit data channel for {cmd, ...} messages. Build a tiny iOS/web control UI (could live in the intranet at /en/relations/dinner-host) that joins the room as a control participant. 1 day
5 Avatar (Simli + Live2D-style art) Generate or commission the character art (single PNG, front-facing, neutral expression). Plug Simli LiveKit plugin into the TTS audio stream. Iterate on persona system prompt and voice. 2–3 days
6 Translation app integration Subscribe to wss://subs.sponicgardens.com/subtitles?lang=... consumers (one per active language). Wire source_text + source_lang into the rolling transcript. Optionally publish agent replies via POST /subtitles/inject for back-translation. 1 day
7 Production deploy Move agent process to Oracle Phoenix. Systemd unit. Real Agentic Gathering dry-run with ~3 people. Record + review for persona polish. 1–2 days
8 Phone interview transport (SIP) See dedicated section below. ~1 week

Estimated total to working Agentic Gathering host prototype (Phases 1–7): 1–2 weeks.

Phase 8 — outbound phone interviewer

Vapi shortcut — evaluate before building LiveKit SIP from scratch

The Sponic repo already contains a Vapi integration at apps/control/supabase/functions/vapi-server/index.ts + apps/control/public/spaces/admin/voice.html + Supabase tables vapi_config / voice_assistants. However: per the user (2026-05-06), this code was ported from the alpacapps repo and is currently tied to a different Vapi account that’s lightly used. Directive: Sponic must own all code and credentials — no runtime dependencies on alpacapps. So: either provision a Sponic-owned Vapi account and reuse the existing handler (faster path, ~2 days to wire), or build LiveKit SIP from scratch (cleaner architecture, ~1 week per the plan below). Decide before starting Phase 8. The LiveKit-SIP plan below is unchanged but optional.

Same agent process, second transport. The agent doesn’t care whether audio comes from a WebRTC participant or a SIP call — LiveKit Agents abstracts both behind the same room model.

Components added

ComponentPickNotes
SIP bridgeLiveKit SIP service (Cloud)Bridges PSTN audio in/out of LiveKit rooms. Officially supported.
SIP trunkTwilio Elastic SIP TrunkingDefault choice; great docs and the rest of Sponic already uses Resend (different, but same operational style). Telnyx is the cheaper alternative if call volume justifies switching.
Phone numberTwilio US local number~$1.15/month for the number. Pick a US area code matching where most candidates live; for Poland-resident candidates, a Polish local DID (~$3–5/mo).
Outbound dialerInternal HTTP endpoint on the agent: POST /interview/startBody: { phone, persona, prelude, candidate_id }. Agent creates a fresh LiveKit room, places the SIP call, joins as the only other participant.
Inbound handlerLiveKit SIP “dispatch rule”Incoming calls to our number create a room and the agent joins. Useful for “call us back” flows.
Recording & storageLiveKit room recording → R2 bucketWAV per call, one transcript JSON per call, deposited under r2://sponic-call-recordings/<date>/<candidate_id>.{wav,json}.
Post-call processingEdge function: transcript → Claude summary → structured fields → intranetWrites back to relations_contacts or staff_recruiting (depending on call type). Triggers a notification email to the assignee.

Code changes to the agent

Per-call cost (30 minutes outbound)

ItemCalcCost
Twilio SIP outbound30 min × $0.0085/min$0.26
Twilio phone numberamortized $1.15/mo over say 20 calls/month$0.06
LiveKit Cloud2 ppt × 30 min$0.03
Deepgram Nova-3 multilingual30 min × $0.0077/min$0.23
Claude Sonnet 4.6~5K cached input + ~3K output (interviewer is more talkative than Agentic Gathering host)~$0.10
ElevenLabs Flash v2.5~1500 words spoken (~7.5K chars)~$0.50
Total per 30-min interview~$1.20

For comparison: a human interviewer at $50/hour costs $25 for the same call — 20× the AI agent.

Compliance & safety

Phase 8 build steps

  1. Set up Twilio Elastic SIP trunk + provision a US phone number. Store credentials in BW DevOps-sponicgarden. (~2 hours)
  2. Configure LiveKit SIP integration (Cloud config + dispatch rule for inbound, outbound trunk for outbound). (~4 hours)
  3. Add POST /interview/start endpoint on the agent. Refactor agent code to take a transport parameter (room-only vs SIP-call). (~1 day)
  4. Build inbound handler (incoming call → new room → agent joins). (~½ day)
  5. Author interviewer persona system prompt + tool definitions per call type. (~1 day)
  6. Post-call processing edge function: transcript → Claude summary → relations_contacts/staff_recruiting update. (~1 day)
  7. End-to-end testing with own phone numbers. Record consent prologue. Sanity-check time-of-day enforcement. (~1 day)
  8. Rollout: first batch of 5 candidate interviews monitored manually, iterate. (~ongoing)

Total Phase 8: ~1 week of focused work, plus a monitored rollout.

Decisions locked

Risks & open questions

Future work

Reference links

Doc owner: Rahul. Drafted 2026-05-06. Operational secrets in BW collection DevOps-sponicgarden (ALPU.CA org); full token / endpoint / ID index lives in the auto-memory at ~/.claude/projects/-Users-rahulio-Documents-CodingProjects-sponic/memory/service-access.md.