AI host — LiveKit Agents + Simli avatar
TL;DR
Current production stack: LiveKit Cloud rooms ยท self-hosted LiveKit Agents worker on Oracle Phoenix ยท Gemini Live native audio (gemini-3.1-flash-live-preview) ยท optional Simli avatar. The earlier Deepgram → Claude → ElevenLabs cascade below is historical planning context, not the current runtime.
Costs: Gemini Live + LiveKit participant minutes are about $0.24–$0.28 for a 10-minute interview / Agentic Gathering without Simli. Simli adds about $3.00 per 10 minutes per avatar session.
Polish support: current path depends on Gemini Live's native multilingual audio. Whisper usage in repo: none in the LiveKit host path; ALPUCA's subtitle server uses Whisper.cpp as a local STT fallback for live subtitles.
Build: 7 phases for the Agentic Gathering host (~1–2 weeks), Phase 8 adds outbound phone calls (~1 week).
Web client — no install, works on iPhone
A vanilla HTML/JS page at sponicgardens.com/gather/ mirrors the native Android DinnerHostScreen: Google Sign-In, silent token re‑auth, livekit-token Edge Function exchange, mic publish, animated mic-level ring, auto-end when the agent leaves. Source: apps/garden/gather/. Same Web OAuth client ID as Android (801803827261-259tโฆ); the production origin must be authorized in GCP → Credentials.
This unblocks iPhone users immediately while the native iOS+macOS app is built. iOS Safari note: keep the tab in the foreground — locking the screen suspends WebRTC. AirPods or wired headphones recommended; built-in echo cancellation handles speakerphone OK.
Use cases
- Agentic Gathering host — AI character at the head of the table, displayed on a TV/tablet, listens to all guest mics through the existing Sponic translation app, jumps in to ask questions, propose toasts, surface guest connections, narrate course transitions. Programmatically promptable from an iPhone control app (“now toast the chef”).
- Phone interviewer (Phase 8) — the same agent persona makes outbound calls for staff recruiting, partner BD intake (MOST, Bujna Warszawa, etc.), member onboarding chats, and post-event follow-ups. Posts the structured transcript + summary back into the intranet's relations / recruiting tables.
- Future: shared persona & memory across both modalities — a guest the agent met at an Agentic Gathering can be recalled in a follow-up phone interview.
Architecture — Agentic Gathering host
Component picks
| Layer | Pick | Why |
|---|---|---|
| WebRTC SFU | LiveKit Cloud | Native multi-participant rooms + SIP bridge for Phase 8. Free tier covers prototyping; paid is ~$0.50 per 1000 participant-minutes. |
| Capture / display | Android, Apple, web, and table display clients | All join the same LiveKit room with per-room JWTs minted by Supabase Edge Functions. |
| Realtime AI | Gemini Live native audio | One model handles speech input, reasoning, and speech output. Current default is gemini-3.1-flash-live-preview with voice Puck. |
| Prompting | Supabase prompt library + dossiers | Persona prompts load at session start from public.prompts; Agentic Gathering sessions can append participant dossiers. |
| Avatar | Simli optional | Native LiveKit integration, driven from the generated Gemini Live audio. Current tracked rate is SIMLI_USD_PER_MINUTE=$0.30. |
| Programmatic channel | LiveKit data channels | JSON commands over the same WebRTC connection (sub-100 ms). HTTP webhook on the agent as a fallback for external integrations. |
| Subtitle integration | https://subs.sponicgardens.com | Separate live-subtitle backend for subtitles and recordings. It is no longer the core Agentic Gathering host STT path. |
| Compute | Oracle Phoenix VM (already paid) | Agent process is mostly API-call orchestration — 2 vCPU / 2–4 GB RAM is plenty. Local MacBook is fine for the first prototype. |
Historical note: this section originally compared a separate STT → LLM → TTS cascade against OpenAI Realtime. Production has since moved to Gemini Live native audio, so cost/model decisions should use the current cost table below and the AI host reference.
Translation-app integration
The existing Sponic live-translation pipeline already does per-mic ASR + translation and broadcasts subtitles over a WebSocket. We don’t need a parallel STT pipeline — the agent subscribes to the same WebSocket as a consumer.
What the existing app already gives us
- Per-mic source language — each phone running the translation app picks its source language manually (no auto-detect). The selection lives in
AudioSettings.ktand is sent to the server. Seeapps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/network/SubtitleClient.kt:100. - Streaming transcripts with translation — the ALPUCA subtitle server (Mac mini in Poland,
http://Alpuca.local:8910on LAN,https://subs.sponicgardens.compublic) runs Deepgram Nova-2 EU as primary STT (Whisper.cpp local fallback) and DeepL / Azure Translator as translator. It emits one WebSocket per output language. - 9 supported languages: en, pl, es, fr, de, pt, it, hi, ar (see
Models.kt:34-44).
WebSocket message schema (consumed verbatim)
// wss://subs.sponicgardens.com/subtitles?lang=en
{
"id": "seg_001",
"text": "Welcome to Sponic Gardens", // already translated to ?lang=
"lang": "en",
"source_lang": "pl", // what the speaker actually said
"source_text": "Witamy w Sponic Gardens",
"speaker": "mic-3", // optional, present when known
"timestamp": 1711800000,
"is_partial": false
}
How the agent consumes it
The agent opens one consumer WebSocket per active language (typically ?lang=en + ?lang=pl for a Polish/English Agentic Gathering). Behavior:
- Use
source_text+source_lang+speakerto build the rolling transcript fed to Claude. (Translated text is reference, not ground truth for reasoning.) - Tag each entry:
[mic-3, pl] Witamy w Sponic Gardens (en: "Welcome to Sponic Gardens"). - De-dupe: drop
is_partial: truemessages once a final-segment message with the sameidarrives. - When the agent speaks, it can optionally publish its reply text back into the translation pipeline (via the existing
POST /subtitles/injectendpoint), so the translation app fans it out to non-English guests via their existing earpieces — no new transport needed.
Net result
Zero new STT spend if the Agentic Gathering is already running the translation app, because we reuse its transcripts. The agent only pays for LLM, TTS, and avatar.
Multi-language reply — how the agent picks a language
Three contexts the agent has to handle differently:
| Context | Language rule | Implementation |
|---|---|---|
| Replying to a specific guest (their question, their name addressed) | Reply in their source language, identified by source_lang on their incoming segment. |
System prompt rule: “Reply in the language of the most recent speaker you are addressing. Their language tag will be in the transcript.” Claude handles this natively. |
| Broadcast / table announcement (toasts, course transitions, opening remarks) | Reply in the configured table_language (default: dominant active language; configurable per Agentic Gathering). |
Programmatic command from control app: { “cmd”: “toast”, “language”: “pl”, “subject”: “Maria’s promotion” }. Or system prompt default if unspecified. |
| Bilingual moment (mixed table, host wants both languages) | Speak twice — once in each language — or pick one and let the translation app fan out the rest to earpieces. | Tool call broadcast_in_languages([“en”, “pl”]). Generates two TTS clips back-to-back. |
The TTS voice stays the same across languages. ElevenLabs Multilingual v2 holds a single voice ID through Polish, English, Spanish, etc. — the host has one identity that switches language without sounding like a different person. (Cartesia Sonic was the cheaper TTS option but its Polish quality is noticeably weaker; on a Polish-speaking household with Sonia, ElevenLabs is worth the spend.)
Polish support — quality assessment
| Component | Polish quality | Notes |
|---|---|---|
| Deepgram Nova-3 STT (or Nova-2 via ALPUCA) | excellent | Both Nova-2 and Nova-3 are first-class on Polish. Nova-3 multilingual streaming is the path forward; Nova-2 EU is what ALPUCA currently uses. |
| DeepL translation (in the existing pipeline) | excellent | Polish is one of DeepL’s strongest languages — called out as “standout” in LIVE-TRANSLATION.html. We mostly bypass it because the agent reasons over source_text directly, but it stays useful for back-translating the agent’s replies to non-Polish guests. |
| Claude Sonnet 4.6 (generation) | excellent | Native Polish output is fluent, idiomatic, and culturally aware. No prompt-engineering tricks required — just “reply in Polish.” |
| ElevenLabs Multilingual v2 TTS | very good | Polish is officially supported, prosody is natural. Some slight stress-pattern errors on rare loanwords; 95% indistinguishable from native. |
| Cartesia Sonic TTS | limited | Polish support is recent and weaker than English. Mentioned only because it’s cheaper — not the recommended pick for a Polish-speaking host. |
| Simli avatar lip-sync | excellent | Lip-sync is phoneme-driven from the TTS audio waveform — language-agnostic. Works on Polish identically to English. |
Whisper usage in the repo
Direct grep of the codebase:
- No Whisper in the live-translation pipeline on the mobile app side. The Kotlin app uses Android
SpeechRecognizerfor local fallback and streams raw PCM to the ALPUCA backend for primary transcription. Seeapps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/audio/AudioRecorder.kt:94-100. - ALPUCA subtitle server uses Whisper.cpp as a fallback — not as the primary STT. Deepgram Nova-2 EU is primary; Whisper.cpp (medium / large-v3) runs locally on the Mac mini when Deepgram is unreachable or the user wants offline mode. The server’s code lives outside this repo (on ALPUCA itself).
- One unrelated hit:
apps/control/supabase/functions/generate-whispers/index.ts— this generates “whisper templates” for the Pakucha spirit-AI feature. It uses Gemini, not OpenAI Whisper. The shared word is incidental.
Conclusion: Whisper is not on the critical path for either the existing translation app or the AI host. Don’t introduce it.
Current 10-minute cost snapshot
| Scenario | Humans | Simli | Cost |
|---|---|---|---|
| Interview | 1 | No | $0.24 |
| Agentic Gathering | 4 | No | $0.25 |
| Agentic Gathering | 10 | No | $0.28 |
| Agentic Gathering | 4 | Yes | $3.26 |
| Agentic Gathering | 10 | Yes | $3.29 |
Assumptions: Gemini Live $0.005/min audio input + $0.018/min audio output, LiveKit WebRTC $0.0005/participant-min, Simli $0.30/min, and the worker remains self-hosted on Oracle Phoenix.
Compute & room hardware
Agent process (where the Python program runs)
- 2 vCPU, 2–4 GB RAM, ~1 GB disk. All heavy lifting is API calls to STT/LLM/TTS/avatar; the agent itself is glue.
- Where to run: Oracle Phoenix VM (already running other Sponic workers). Adds zero monthly cost.
- Latency to providers: Phoenix is in US West — check round-trip to Deepgram (us-east) and Anthropic (us-west) before locking in. If Deepgram us-east is too slow, switch to Deepgram EU (the ALPUCA default).
- Fallback: Fly.io / Railway machine (~$5/mo) or local MacBook during the actual Agentic Gathering.
Room hardware (per Agentic Gathering)
- Mics: existing NEEWER CM31s + Android phones running the Sponic translation app (already validated path).
- Display: any device that can run the LiveKit web client — iPad, Android tablet, Mac mini → TV, or a phone Chromecasting to a TV. Avatar video is 1–3 Mbps so any modern device works.
- Speaker: built-in display speakers, or a Bluetooth speaker, or route the agent’s voice through the existing translation-app earpieces (handles per-language fan-out automatically).
- Bandwidth: ~2.5 Mbps total (8 mics × 32 kbps up + 2 Mbps avatar down). Any home WiFi.
Latency budget (cascade path)
mic capture ~30 ms
WebRTC up ~50 ms
Deepgram STT ~300 ms
Claude LLM ~500 ms (first token, prompt cache hit)
ElevenLabs Flash ~75 ms (first audio chunk)
Simli avatar ~200 ms
WebRTC down ~50 ms
โโโโโโโโโโโโโโโโโโโโโโโโโ
total ~1.2 s round-trip
Comfortable for Agentic Gathering banter (people pause that long anyway). Sub-1 s would require switching to OpenAI Realtime + avatar — pricier and not necessary.
Build phases
~300–500 lines of Python on top of livekit-agents. Reference implementations exist for room joining, plugin wiring, and the SIP integration; the original work is the wake-gate, the persona prompt, the per-mic transcript merging, and the translation-app bridge.
| # | Phase | Scope | Time |
|---|---|---|---|
| 1 โ | One-mic spike | Local laptop. Python agent joins a LiveKit room, transcribes one mic via Deepgram, replies via ElevenLabs. No avatar yet. โ scaffold + entrypoint shipped 2026-05-06 — ops guide at /docs/reference/AI-HOST. | ½ day |
| 2 | Multi-participant | Subscribe to all participant audio tracks separately. Per-mic STT streams. Shared rolling transcript with [speaker, lang] tags. |
1 day |
| 3 | Wake-gate | Rule-based first (silence ≥ N seconds + recent question, host name mentioned). Then upgrade to Haiku 4.5 classifier on the rolling transcript. | 1–2 days |
| 4 | Programmatic prompt channel | Listen on LiveKit data channel for {cmd, ...} messages. Build a tiny iOS/web control UI (could live in the intranet at /en/relations/dinner-host) that joins the room as a control participant. |
1 day |
| 5 | Avatar (Simli + Live2D-style art) | Generate or commission the character art (single PNG, front-facing, neutral expression). Plug Simli LiveKit plugin into the TTS audio stream. Iterate on persona system prompt and voice. | 2–3 days |
| 6 | Translation app integration | Subscribe to wss://subs.sponicgardens.com/subtitles?lang=... consumers (one per active language). Wire source_text + source_lang into the rolling transcript. Optionally publish agent replies via POST /subtitles/inject for back-translation. |
1 day |
| 7 | Production deploy | Move agent process to Oracle Phoenix. Systemd unit. Real Agentic Gathering dry-run with ~3 people. Record + review for persona polish. | 1–2 days |
| 8 | Phone interview transport (SIP) | See dedicated section below. | ~1 week |
Estimated total to working Agentic Gathering host prototype (Phases 1–7): 1–2 weeks.
Phase 8 — outbound phone interviewer
Vapi shortcut — evaluate before building LiveKit SIP from scratch
The Sponic repo already contains a Vapi integration at apps/control/supabase/functions/vapi-server/index.ts + apps/control/public/spaces/admin/voice.html + Supabase tables vapi_config / voice_assistants. However: per the user (2026-05-06), this code was ported from the alpacapps repo and is currently tied to a different Vapi account that’s lightly used. Directive: Sponic must own all code and credentials — no runtime dependencies on alpacapps. So: either provision a Sponic-owned Vapi account and reuse the existing handler (faster path, ~2 days to wire), or build LiveKit SIP from scratch (cleaner architecture, ~1 week per the plan below). Decide before starting Phase 8. The LiveKit-SIP plan below is unchanged but optional.
Same agent process, second transport. The agent doesn’t care whether audio comes from a WebRTC participant or a SIP call — LiveKit Agents abstracts both behind the same room model.
Components added
| Component | Pick | Notes |
|---|---|---|
| SIP bridge | LiveKit SIP service (Cloud) | Bridges PSTN audio in/out of LiveKit rooms. Officially supported. |
| SIP trunk | Twilio Elastic SIP Trunking | Default choice; great docs and the rest of Sponic already uses Resend (different, but same operational style). Telnyx is the cheaper alternative if call volume justifies switching. |
| Phone number | Twilio US local number | ~$1.15/month for the number. Pick a US area code matching where most candidates live; for Poland-resident candidates, a Polish local DID (~$3–5/mo). |
| Outbound dialer | Internal HTTP endpoint on the agent: POST /interview/start | Body: { phone, persona, prelude, candidate_id }. Agent creates a fresh LiveKit room, places the SIP call, joins as the only other participant. |
| Inbound handler | LiveKit SIP “dispatch rule” | Incoming calls to our number create a room and the agent joins. Useful for “call us back” flows. |
| Recording & storage | LiveKit room recording → R2 bucket | WAV per call, one transcript JSON per call, deposited under r2://sponic-call-recordings/<date>/<candidate_id>.{wav,json}. |
| Post-call processing | Edge function: transcript → Claude summary → structured fields → intranet | Writes back to relations_contacts or staff_recruiting (depending on call type). Triggers a notification email to the assignee. |
Code changes to the agent
- No avatar. Audio-only on phone calls — skip Simli entirely, save ~$0.10/min.
- Tighter latency budget. Switch to ElevenLabs Flash v2.5 (lower latency, slightly lower quality but better for phone audio).
- Different system prompt. Interviewer persona, structured question flow per call type (recruiting interview, partner intake, member onboarding).
- New tools:
mark_question_complete,move_to_next_topic,flag_for_human_review,save_interview_summary,schedule_followup. - Recording-consent prologue. Required first turn of every call: “This call will be recorded for note-taking purposes — is that OK?” Wait for affirmative before continuing.
- Hangup tool. Agent calls
end_call()when the interview is complete or the candidate asks to stop.
Per-call cost (30 minutes outbound)
| Item | Calc | Cost |
|---|---|---|
| Twilio SIP outbound | 30 min × $0.0085/min | $0.26 |
| Twilio phone number | amortized $1.15/mo over say 20 calls/month | $0.06 |
| LiveKit Cloud | 2 ppt × 30 min | $0.03 |
| Deepgram Nova-3 multilingual | 30 min × $0.0077/min | $0.23 |
| Claude Sonnet 4.6 | ~5K cached input + ~3K output (interviewer is more talkative than Agentic Gathering host) | ~$0.10 |
| ElevenLabs Flash v2.5 | ~1500 words spoken (~7.5K chars) | ~$0.50 |
| Total per 30-min interview | ~$1.20 |
For comparison: a human interviewer at $50/hour costs $25 for the same call — 20× the AI agent.
Compliance & safety
- Call recording consent. US two-party consent states (CA, FL, IL, MD, MA, MT, NH, PA, WA — among others) require both parties to consent. Always include the recording prologue. Polish/EU calls fall under GDPR — explicit consent + right-to-deletion compliant storage.
- Time-of-day restrictions. US TCPA: no outbound calls before 8 AM or after 9 PM in the recipient’s local time. Encode this in the dialer; refuse to place calls outside the window.
- Do Not Call list. For cold outbound only (sales). Sponic interviews are opt-in (recruiting candidates and known partner contacts), so DNC doesn’t apply — but maintain an internal opt-out list anyway.
- Identification. First turn after consent: “Hi, this is the Sponic Gardens AI assistant calling on behalf of Rahul / Sonia —”. Do not pretend to be human if asked.
Phase 8 build steps
- Set up Twilio Elastic SIP trunk + provision a US phone number. Store credentials in BW
DevOps-sponicgarden. (~2 hours) - Configure LiveKit SIP integration (Cloud config + dispatch rule for inbound, outbound trunk for outbound). (~4 hours)
- Add
POST /interview/startendpoint on the agent. Refactor agent code to take a transport parameter (room-only vs SIP-call). (~1 day) - Build inbound handler (incoming call → new room → agent joins). (~½ day)
- Author interviewer persona system prompt + tool definitions per call type. (~1 day)
- Post-call processing edge function: transcript → Claude summary →
relations_contacts/staff_recruitingupdate. (~1 day) - End-to-end testing with own phone numbers. Record consent prologue. Sanity-check time-of-day enforcement. (~1 day)
- Rollout: first batch of 5 candidate interviews monitored manually, iterate. (~ongoing)
Total Phase 8: ~1 week of focused work, plus a monitored rollout.
Decisions locked
- Avatar style: Simli + Live2D-style character art (single illustrated PNG, anime-leaning, distinct Sponic personality — not photoreal).
- Voice: ElevenLabs Multilingual v2 (single voice ID across all 9 languages). Generic library voice for v1; can clone a custom voice once the persona stabilizes.
- Persona swappability: per Agentic Gathering system prompt (e.g., “Agentic Gathering host: warm, curious, leans into food and gardens” vs “Interview host: structured, professional, fewer jokes”). Trivial to support and worth it.
- Transport unification: Agentic Gathering host (Phase 1–7) and phone interviewer (Phase 8) share the same agent code, persona system, and memory layer. One codebase, two transports.
- STT cost-share: reuse the existing translation app’s WebSocket transcripts during Agentic Gatherings; only pay for Deepgram directly during phone calls (where the translation app isn’t involved).
Risks & open questions
- Wake-gate UX is the main hard problem. If the host responds to every utterance, it dominates the table. If it’s too quiet, why is it there. Budget time for prompt iteration on the gating logic, and add a manual “mute / unmute” toggle on the control app.
- Live2D-style illustrated avatars on Simli — verify before committing. Simli’s “custom character” mode is best on photoreal portraits; results on stylized art vary. Run a one-off test with a single character PNG before locking in Phase 5. If results are poor, fallbacks: (a) more realistic illustrated portraits, (b) Tavus photoreal humans, (c) Live2D-rigged VTuber model with audio-driven lip-sync (more work, full control).
- Translation-app coupling. The agent depends on ALPUCA being up. If ALPUCA goes down mid-gathering, the agent loses the transcript stream. Mitigation: agent falls back to its own per-mic Deepgram subscription using the LiveKit audio tracks (slightly higher cost, same quality).
- Voice cloning for the persona. If we want a specific Sponic-branded voice (matching Sonia or Rahul, or a fully synthetic personality), need ~30 minutes of clean recordings. Defer to v2.
- Phone-call abuse / robocall liability. Twilio numbers used for outbound “AI calls” can get flagged as spam by carriers (STIR/SHAKEN attestation matters). Use Twilio Voice Insights to monitor flagging rate; rotate numbers if reputation degrades.
- Memory across modalities. Storing “the agent met Maria at last week’s Agentic Gathering” for use in a phone interview requires a per-person memory layer. Likely a Supabase table
ai_host_memory(person_id, fact, source_call_id, recorded_at)— deferred until basic voice loops work, but worth designing the schema early.
Future work
- Group video calls (not just Agentic Gathering + phone). LiveKit handles this natively — same agent code joins a Zoom-like web meeting.
- Slack/WhatsApp text bridge for follow-up messages from the agent (ร la Boardy.ai’s multi-channel setup).
- Cross-call memory layer — Supabase table referenced above.
- Persona library — multiple named hosts (Mr. Sponic the gardener, Spirit-of-the-Garden Pakucha, Interview Mode, etc.) selectable per session.
- Self-hosted LiveKit — if call volume justifies, run LiveKit OSS on Oracle Phoenix to drop the cloud SFU bill. Worth it past ~50K participant-minutes/mo.
Reference links
- LIVE-TRANSLATION.html — existing translation pipeline spec (WebSocket schema, ALPUCA architecture, language list).
apps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/network/SubtitleClient.kt— Kotlin client for the subtitle WebSocket (lines 19–88 for connect logic, 63–73 for message format).apps/mobile/app/src/main/kotlin/com/sponicgardens/sponic/audio/AudioRecorder.kt— capture pipeline (lines 94–100 for mic source, 308–313 for AAC encoding).- LiveKit Agents docs: docs.livekit.io/agents
- LiveKit SIP docs: docs.livekit.io/sip
- Simli LiveKit plugin: docs.livekit.io/agents/integrations/avatar/simli
- Boardy.ai (comparable product) — cascading STT/LLM/TTS architecture, OpenAI brain, undisclosed telephony stack. Good UX reference for phone-interview persona & multi-channel follow-up.
Doc owner: Rahul. Drafted 2026-05-06. Operational secrets in BW collection DevOps-sponicgarden (ALPU.CA org); full token / endpoint / ID index lives in the auto-memory at ~/.claude/projects/-Users-rahulio-Documents-CodingProjects-sponic/memory/service-access.md.