LiveKit Infrastructure

Every live interview — AI-led or human-to-human — runs over LiveKit, a WebRTC media server. This page explains what the SFU actually does, who is allowed to mint the tokens that open a room, and how a session becomes a recording file.

What the SFU does (and why not peer-to-peer)

In a plain peer-to-peer WebRTC call, every participant sends their audio/video directly to every other participant. That works for two people; it falls apart fast as participants grow (each person uploads N−1 copies of their stream), and — critically for this platform — there is no server in the media path, so nothing can record the session or let an AI bot listen in.

A Selective Forwarding Unit (SFU) fixes both problems. Each participant uploads their stream once to the SFU; the SFU selectively forwards it to everyone else. That gives us:

One upload per participant, regardless of room size — predictable bandwidth.
A server in the media path, which is what makes server-side recording (egress) possible at all.
Non-human participants: the bot-backend AI worker — the AI interview brain that plans questions and drives evaluation — joins the room exactly like a person would, subscribing to the candidate's audio through the SFU.

Browsers connect to the SFU over secure WebSocket at <LIVEKIT_WS_URL>. If the LiveKit layer is unavailable, the rest of the platform degrades gracefully — LiveKit-dependent endpoints return errors, but other services keep serving.

Tokens: who mints them, and the room lifecycle

Joining a LiveKit room requires a signed access token (a JWT carrying the room name, the participant identity, and grants such as publish/subscribe permissions). The cardinal rule: tokens are minted server-side only — never in the browser, because minting requires the LiveKit API secret.

There are two minting paths, matching the two kinds of interviews:

AI interview sessions — video-streaming-server

The session room (smart-interview-ui) requests a token from video-streaming-server, passing the room and identity. The token-issuance module is treated as security-critical code — changes to grant logic are coordinated, not casual. Beyond the basic participant token, the same service issues:

Observer tokens — subscriber-only tokens (no publish grant) for telemetry/observation, gated by a server-side shared secret.
Telemetry tokens — a participant can exchange their LiveKit token for a short-lived telemetry JWT, which then authorizes the recording-control and client-telemetry endpoints described below.

Human-to-human meetings — user-management

For scheduled person-to-person meetings, the Java backend authorizes access: it validates that the requester is actually a participant of the meeting and that the meeting is in a joinable state, then returns a time-limited token (on the order of hours) plus room details. Identity is tied to the user–meeting pair so a URL can't be tampered into someone else's room.

Room lifecycle

Rooms are named per session/meeting. The bot-backend worker watches for rooms following the meeting naming convention and auto-joins as a silent participant; auto-join can be suppressed per-room via a name suffix and is additionally gated by a backend flag. LiveKit reports room and egress lifecycle events back to video-streaming-server through a signature-verified webhook endpoint.

The recording pipeline

Recording is policy-driven: a job posting carries a recording policy (off / optional / mandatory), and the session room enforces it — including consent UX for the candidate. When recording runs, video-streaming-server orchestrates it end to end:

Step by step:

Start/stop control. The session room calls record start/stop endpoints on video-streaming-server, authenticated with the short-lived telemetry JWT. Stopping a recording can require approval from whoever started it.
Egress to storage. LiveKit egress writes the recorded media into <S3_BUCKET>.
Webhook → finalize. When egress completes, LiveKit calls video-streaming-server's webhook. The finalize step runs ffmpeg to merge the recorded tracks into the final recording and places it in a per-customer final-recordings folder in <S3_BUCKET> (ffmpeg is provisioned on the server hosts for exactly this purpose).
State and timeline. Throughout, recording state and a client-event timeline (logs, presence, session markers) are kept in Redis, which is what the recording-status endpoints serve and what makes sessions debuggable after the fact.
Metadata. For human meetings, user-management tracks recording status, timestamps, and the recording reference on the meeting record, with retry/backoff around the upload path.

Playback URLs for stored recordings should be signed/expiring rather than public.

The avatar audio path

The AI interviewer's face and voice ride alongside the LiveKit media path rather than through it:

Voice (TTS). The conversation itself — what the interviewer asks and how answers are scored — is driven by bot-backend; video-streaming-server provides the media plumbing. The interviewer's speech is synthesized by neural OpenAI text-to-speech behind a video-streaming-server endpoint (so the OpenAI key stays server-side), and the browser fetches the audio as an MP3 blob played through a transient audio element. (Moving TTS into bot-backend is a later TI-340 migration wave.)
Face (Simli). Paid tiers render a photo-real Simli avatar via the Simli client SDK over its own WebRTC connection, in "direct mode": the browser drives Simli using a short-lived session token minted by video-streaming-server (the vendor key never reaches the browser; tokens are pooled and queued under capacity pressure). Free tier uses an in-browser 3D avatar, with a simple animated fallback while loading.
Lip sync. Feeding the platform's own TTS audio into Simli for true synchronized lip movement is wired as a planned enhancement, not yet the live path.
A second avatar vendor (Spatius) is integrated behind the same mint-a-session-token pattern as an alternative provider.

Routing the avatar through bot-backend (host-mode) instead of the browser belongs to a later migration wave; today the avatar path is browser-driven with video-streaming-server as the token authority.

Failure isolation

Two operating rules worth internalizing early:

Layers fail independently. An avatar vendor timeout must not cascade into the media or rendering layers; if an optional integration (STT, avatar) is unavailable, its endpoint degrades gracefully and the UI falls back (e.g. to a simple orb) instead of crashing the room.
Liveness is heartbeat-based. Clients heartbeat the session; cleanup of abandoned ("zombie") sessions is idempotent, and short reconnect gaps never reap a live session.

What the SFU does (and why not peer-to-peer)​

Tokens: who mints them, and the room lifecycle​

AI interview sessions — video-streaming-server​

Human-to-human meetings — user-management​

Room lifecycle​

The recording pipeline​

The avatar audio path​

Failure isolation​