ADR-001: LiveKit over raw WebRTC
Status: Accepted Date: 2026-06-10
Context
TheInterviews.ai needs real-time audio/video for two distinct surfaces:
- Human meetings — recruiter and candidate in a shared video room, with optional or mandatory recording depending on the job posting's recording policy.
- AI voice interviews — a candidate speaking with an AI interviewer (voice loop: streaming STT → LLM → neural TTS → lip-synced avatar), which also needs to be recordable.
The constraints that shaped the decision:
- Recordings must be server-authoritative. The platform's first AI-interview recorder was 100% client-side: a canvas composited the avatar and candidate picture-in-picture, WebAudio mixed TTS and microphone, and the browser's
MediaRecorderuploaded to S3 via presigned multipart. This was fragile — a browser crash lost the recording entirely — and gave the server no authority over what was captured, which is untenable for hiring compliance and audit. - The team had already lived the raw-WebRTC path. The legacy stack used custom socket.io signaling for peer-to-peer WebRTC connections, plus a headless-browser (Puppeteer) recording flow. Every concern — signaling, reconnection, recording, scaling beyond two peers — was bespoke code the team owned end to end.
- An AI participant has to join the call. The AI interviewer is not a human peer; it needs a programmatic way to be present in a room, receive audio, and publish audio back.
- Vendor secrets must stay server-side. Browser clients can never hold long-lived media-infrastructure credentials.
- Small team, production latency budget. The real-time path is latency-sensitive and effectively impossible to roll back mid-session, so the media layer needs to be boring and proven.
Decision
Build all real-time audio/video on LiveKit — an SFU (Selective Forwarding Unit) — rather than maintaining raw WebRTC peer connections with custom signaling and custom recording.
Concretely, the platform uses:
- LiveKit Cloud SFU as the media transport. Browsers connect with
livekit-client; the connection URL is configured per environment (<LIVEKIT_WS_URL>). - Token-based room access. Short-lived LiveKit JWTs are minted only on the server: the Java backend validates that the user is actually a participant of the meeting and that the meeting is in a joinable state before issuing a time-limited token, and the Node video-streaming service issues tokens for the AI-interview rooms. Token-grant logic is treated as security-critical code with restricted ownership.
- Server-side recording via LiveKit Egress. A room-composite egress writes an MP4 to S3 (
<S3_BUCKET>); a finalize step then copies it to the final recordings location, sends notifications, and cleans up the Redis recording state. Egress webhooks and recovery logic live in the video-streaming service. - The agents framework for the AI participant. The new AI worker (
bot-backend, Python/FastAPI) is built onlivekit-agents, which gives the AI interviewer a first-class way to join rooms, consume audio, and publish audio — instead of pretending to be a browser peer.
Alternatives considered
- Keep raw WebRTC + custom signaling + client-side recording. Rejected on lived experience, not theory: the socket.io-signaled P2P path and the client-side recorder are exactly what the platform is migrating away from. Client recording loses sessions on browser crashes and has no server authority; P2P does not scale past two peers without a media server anyway.
- Harden the client-side recorder instead of building server egress. Explicitly considered and rejected in the recording-v2 design: hardening cannot fix the fundamental problems (no server authority, fragile browser capture), and it bakes the licensed avatar likeness into stored files (see ADR-002).
- Per-track egress + server-side FFmpeg merge. This was the platform's earlier server-side recording approach: record individual tracks, then merge with FFmpeg. It was replaced by room-composite egress, which produces a single finished MP4 directly and eliminates the merge step and its failure modes. Track-composite egress remains documented as a fallback only for a future topology where it would be required.
Consequences
Easier:
- Recordings are server-authoritative: egress runs in infrastructure the platform controls, with consent gating enforced before recording starts.
- One transport for both human meetings and AI interviews; the AI worker joins a room the same way any participant does.
- Token issuance, room lifecycle, and egress are all driven from server code where secrets stay server-side.
Harder / new obligations:
- Operational ownership of the SFU layer. Even on LiveKit Cloud, the platform owns capacity planning (rooms + egress per session), cost guardrails, egress monitoring, and webhook handling. Note: the SFU itself is currently vendor-operated (LiveKit Cloud), so "ownership" here means the integration and its operational envelope, not running SFU servers.
- LiveKit-specific APIs everywhere the media plane is touched — token grants, room service calls, egress orchestration, webhooks, and the agents framework are all LiveKit idioms, spread across three frontends, two backend services, and the Python worker.
- Failure modes introduced: egress can fail to start (mitigated by cleanup logic and, during migration, the legacy client recorder as fallback); a token-grant bug is a security incident (mitigated by treating the token module as fenced, security-critical code); the media path degrades rather than crashes — if the LiveKit module fails to load, its routes return 503 while the rest of the service keeps serving.
Reversibility
- Recordings are not locked in. Stored recordings are standard MP4 files in S3. Playback, retention, and audit tooling have no LiveKit dependency; the media archive survives any migration untouched.
- The integration code is locked in. Token issuance, room lifecycle, egress orchestration, webhook handling, and the entire
livekit-agents-based AI worker are LiveKit-specific. Migrating to another SFU (or back to raw WebRTC) means rewriting the signaling/room layer in every client and server that touches media, plus re-solving server-side recording and the AI-participant problem on the new stack. - Intermediate exit exists. LiveKit's server is open source, so the first off-ramp from LiveKit Cloud is self-hosting the same SFU — an operational migration, not an API rewrite. A full vendor change is the expensive path and should be justified by a dimension that matters (cost, latency, capability), not novelty.