Troubleshooting

Symptom → likely cause → where to look, for the failures new engineers hit most often. Each entry comes from a real incident or a documented repo gotcha.

Live session room (smart-interview-ui)

Room loads but never receives a `sessionId`

Likely cause: the session room is embedded by interviews-ui in an iframe — it is not a standalone destination. The parent app passes sessionId via query param or an origin-validated postMessage. If the embedding origin isn't listed in the room app's allowed parent-origins config, the postMessage is silently dropped.
Where to look: the parent-origins env var in the session-room app's config; src/livekit/correlation.js (how sessionId is received and forwarded as X-Session-Id). Remember the dev server intentionally disables host-checking to support iframe embedding — don't "fix" that.

Session room connects to the wrong (or no) media server

Likely cause: the LiveKit WebSocket URL env var is unset, so the code falls back to a hardcoded fallback address in src/livekit/endpoints.js that is almost certainly wrong for your environment. The resolver also forces wss:// when the page is HTTPS.
Where to look: set the WebSocket URL env var explicitly per environment; never rely on the fallback. Check src/livekit/endpoints.js for resolution order.

Tokens look wrong / auth fails in the room

Likely cause: the session room never mints tokens — they are minted by video-streaming-server and fetched via src/livekit/tokenService.js. A failure here is usually the token API URL config or the streaming server itself.
Where to look: the token API URL env var, then the streaming server's logs.

CORS and cross-service calls

Browser calls to the streaming server fail at preflight after adding a request header

Likely cause: the streaming server keeps an explicit CORS allowed-headers list. Adding a new custom header on the client (e.g. a request-id or session-id header) without adding it to that server-side list makes the OPTIONS preflight reject the request — the actual request never fires.
Where to look: the CORS configuration near the top of the streaming server's index.js (the allowed-headers array). Any time a client interceptor starts attaching a new header, update the list in the same change.

`/api/internal/*` calls between services return 401/503

Likely cause: service-to-service calls use a shared HS256 JWT secret that must be byte-identical across bot-backend, video-streaming-server, and user-management, and at least 32 characters. A mismatch or missing value fails fast with a clear log line at boot.
Where to look: the internal-service JWT secret in each service's environment config; user-management logs (the auth filter validates at startup). Note: bot-backend's .env.example does not list this required variable — the app won't boot or will 401 upstream without it.

Feature flags and cached config

A flag was flipped in platform config but behavior didn't change

Likely cause: some platform-config values are cached in the backend with no TTL. The database row is updated but the running app still serves the old value until the app server is restarted.
Where to look: verify the row in platform_config, then restart the backend app server in that environment and re-check via the public-flags endpoint.

A feature returns "Coming soon" or a 503-style response

Likely cause: that's usually not an outage — it's an env/flag gate doing its job. Several features (e.g. feedback PDF export) sit behind an *.enabled platform-config flag that returns a 503 / "coming soon" response when off. Separately, optional integrations in the streaming server degrade gracefully to 503 instead of crashing, and the frontend falls back (lite orb, no STT).
Where to look: the relevant *.enabled key in platform_config for that environment first. If the flag is on and you still get a 5xx, then it's a real error — read the response body and server logs (a real bug can hide behind the same UI message as a closed gate).

`/livekit/*` routes return 503 but the rest of the streaming server works

Likely cause: the LiveKit module failed to load at startup; the server deliberately keeps serving everything else.
Where to look: streaming server startup logs for the module-load error.

Recording and camera

Camera light stays on after leaving an interview

Likely cause: teardown ordering. Media tracks (including any cloned tracks created for recording) must be stopped and the room disconnected before any untimed "stop" network call — doing it after leaves the camera lit if that call hangs. A pagehide handler covers hard tab-closes.
Where to look: the publish/teardown hook in interviews-ui (useAiRoomPublish), and any recording code that clones tracks — clones need explicit stopping too.

Avatar/streaming layer failure takes down the whole room

Likely cause: a known cascade pattern: avatar vendor timeout → CDN fetch failure → WebGL context loss. The layers must be isolated defensively; one layer's failure must not cascade.
Where to look: the streaming server's external-call timeout middleware (tight per-vendor budgets are deliberate — tune them consciously, not casually) and the avatar route handlers.

Backend boot and schema

Backend refuses to boot after a deploy: schema validation error

Likely cause: the backend runs with schema validation on (ddl-auto=validate) and no automatic migrations — boot fails if any entity references a column missing from that environment's database. Classic cause: code deployed before its migration, or dev/prod schema drift.
Where to look: the boot log names the missing column. Apply the pending migration (manually, migration-first — see Deploy & Release), and verify dev↔prod schema parity before promoting.

Endpoint behaves strangely after returning a reactive type

Likely cause: the backend has the reactive web stack on the classpath only for its HTTP client. Returning Mono/Flux from a controller silently switches the whole stack to reactive.
Where to look: the controller's return type — return plain DTOs.

Environment files and local dev quirks

interviews-ui: your .env got overwritten. .env is a derived, transient file — the env-swapping npm scripts copy a target env over it. Canonical sources are .env.local, .env.dev, .env.production. Run npm run restore:local after a swapped run. Also note: production hosting does not read .env files at all — env vars are configured per-branch in the hosting service.
interviews-ui: build fails immediately. scripts/check-env.js validates required env vars before every build; the error names the missing var.
interviews-ui: app detects the wrong environment. Detection priority is NEXT_PUBLIC_ENVIRONMENT → hosting branch → hostname → NODE_ENV. Override explicitly rather than fighting the heuristics.
smart-interview-ui: npm test never exits. The test runner defaults to watch mode. Use CI=true npm test to run once.
bot-backend: app won't start or behaves differently between app and worker. Two separate env loaders exist: the FastAPI app uses pydantic-settings (.env, env.local); the worker uses dotenv (env.{ENV}). Keep both fed. Also: the Python 3.13 pin is load-bearing — don't bump the interpreter or livekit-agents casually.
bot-backend: a vendor call raises NotImplementedError. Most vendor adapters are intentional stubs pending later migration waves; only the OpenAI chat/TTS paths are live. Check the wave plan before assuming a bug.

Where logs live

Generic map — no console links here, ask a teammate for access:

Backend services (user-management, video-streaming-server, bot-backend): each runs in its own cloud app environment, which exposes the application log stream and recent log bundles per environment (dev and prod are separate environments). Boot failures (schema validation, missing required env vars) appear here first.
Frontend apps (interviews-ui, smart-interview-ui): the frontend hosting service keeps per-branch build logs (env validation and compile errors live there). Runtime issues are client-side — start with the browser devtools console and network tab, especially for CORS preflights and WebSocket connections.
Locally: every service logs to the terminal it runs in; the streaming server and bot-backend use structured logging, so grep by request-id / session-id to follow one interview across services.

Live session room (smart-interview-ui)​

Room loads but never receives a sessionId​

Session room connects to the wrong (or no) media server​

Tokens look wrong / auth fails in the room​

CORS and cross-service calls​

Browser calls to the streaming server fail at preflight after adding a request header​

/api/internal/* calls between services return 401/503​

Feature flags and cached config​

A flag was flipped in platform config but behavior didn't change​

A feature returns "Coming soon" or a 503-style response​

/livekit/* routes return 503 but the rest of the streaming server works​

Recording and camera​

Camera light stays on after leaving an interview​

Avatar/streaming layer failure takes down the whole room​

Backend boot and schema​

Backend refuses to boot after a deploy: schema validation error​

Endpoint behaves strangely after returning a reactive type​

Environment files and local dev quirks​

Where logs live​