ADR-001: LiveKit over raw WebRTC

Status: Accepted Date: 2026-06-10

Context

TheInterviews.ai needs real-time audio/video for two distinct surfaces:

Human meetings — recruiter and candidate in a shared video room, with optional or mandatory recording depending on the job posting's recording policy.
AI voice interviews — a candidate speaking with an AI interviewer (voice loop: streaming STT → LLM → neural TTS → lip-synced avatar), which also needs to be recordable.

The constraints that shaped the decision:

Recordings must be server-authoritative. The platform's first AI-interview recorder was 100% client-side: a canvas composited the avatar and candidate picture-in-picture, WebAudio mixed TTS and microphone, and the browser's MediaRecorder uploaded to S3 via presigned multipart. This was fragile — a browser crash lost the recording entirely — and gave the server no authority over what was captured, which is untenable for hiring compliance and audit.
The team had already lived the raw-WebRTC path. The legacy stack used custom socket.io signaling for peer-to-peer WebRTC connections, plus a headless-browser (Puppeteer) recording flow. Every concern — signaling, reconnection, recording, scaling beyond two peers — was bespoke code the team owned end to end.
An AI participant has to join the call. The AI interviewer is not a human peer; it needs a programmatic way to be present in a room, receive audio, and publish audio back.
Vendor secrets must stay server-side. Browser clients can never hold long-lived media-infrastructure credentials.
Small team, production latency budget. The real-time path is latency-sensitive and effectively impossible to roll back mid-session, so the media layer needs to be boring and proven.

Decision

Build all real-time audio/video on LiveKit — an SFU (Selective Forwarding Unit) — rather than maintaining raw WebRTC peer connections with custom signaling and custom recording.

Concretely, the platform uses:

LiveKit Cloud SFU as the media transport. Browsers connect with livekit-client; the connection URL is configured per environment (<LIVEKIT_WS_URL>).
Token-based room access. Short-lived LiveKit JWTs are minted only on the server: the Java backend validates that the user is actually a participant of the meeting and that the meeting is in a joinable state before issuing a time-limited token, and the Node video-streaming service issues tokens for the AI-interview rooms. Token-grant logic is treated as security-critical code with restricted ownership.
Server-side recording via LiveKit Egress. A room-composite egress writes an MP4 to S3 (<S3_BUCKET>); a finalize step then copies it to the final recordings location, sends notifications, and cleans up the Redis recording state. Egress webhooks and recovery logic live in the video-streaming service.
The agents framework for the AI participant. The new AI worker (bot-backend, Python/FastAPI) is built on livekit-agents, which gives the AI interviewer a first-class way to join rooms, consume audio, and publish audio — instead of pretending to be a browser peer.

Alternatives considered

Keep raw WebRTC + custom signaling + client-side recording. Rejected on lived experience, not theory: the socket.io-signaled P2P path and the client-side recorder are exactly what the platform is migrating away from. Client recording loses sessions on browser crashes and has no server authority; P2P does not scale past two peers without a media server anyway.
Harden the client-side recorder instead of building server egress. Explicitly considered and rejected in the recording-v2 design: hardening cannot fix the fundamental problems (no server authority, fragile browser capture), and it bakes the licensed avatar likeness into stored files (see ADR-002).
Per-track egress + server-side FFmpeg merge. This was the platform's earlier server-side recording approach: record individual tracks, then merge with FFmpeg. It was replaced by room-composite egress, which produces a single finished MP4 directly and eliminates the merge step and its failure modes. Track-composite egress remains documented as a fallback only for a future topology where it would be required.

Consequences

Easier:

Recordings are server-authoritative: egress runs in infrastructure the platform controls, with consent gating enforced before recording starts.
One transport for both human meetings and AI interviews; the AI worker joins a room the same way any participant does.
Token issuance, room lifecycle, and egress are all driven from server code where secrets stay server-side.

Harder / new obligations:

Operational ownership of the SFU layer. Even on LiveKit Cloud, the platform owns capacity planning (rooms + egress per session), cost guardrails, egress monitoring, and webhook handling. Note: the SFU itself is currently vendor-operated (LiveKit Cloud), so "ownership" here means the integration and its operational envelope, not running SFU servers.
LiveKit-specific APIs everywhere the media plane is touched — token grants, room service calls, egress orchestration, webhooks, and the agents framework are all LiveKit idioms, spread across three frontends, two backend services, and the Python worker.
Failure modes introduced: egress can fail to start (mitigated by cleanup logic and, during migration, the legacy client recorder as fallback); a token-grant bug is a security incident (mitigated by treating the token module as fenced, security-critical code); the media path degrades rather than crashes — if the LiveKit module fails to load, its routes return 503 while the rest of the service keeps serving.

Reversibility

Recordings are not locked in. Stored recordings are standard MP4 files in S3. Playback, retention, and audit tooling have no LiveKit dependency; the media archive survives any migration untouched.
The integration code is locked in. Token issuance, room lifecycle, egress orchestration, webhook handling, and the entire livekit-agents-based AI worker are LiveKit-specific. Migrating to another SFU (or back to raw WebRTC) means rewriting the signaling/room layer in every client and server that touches media, plus re-solving server-side recording and the AI-participant problem on the new stack.
Intermediate exit exists. LiveKit's server is open source, so the first off-ramp from LiveKit Cloud is self-hosting the same SFU — an operational migration, not an API rewrite. A full vendor change is the expensive path and should be justified by a dimension that matters (cost, latency, capability), not novelty.

Context​

Decision​

Alternatives considered​

Consequences​

Reversibility​

Context

Decision

Alternatives considered

Consequences

Reversibility