Troubleshooting
Symptom → likely cause → where to look, for the failures new engineers hit most often. Each entry comes from a real incident or a documented repo gotcha.
Live session room (smart-interview-ui)
Room loads but never receives a sessionId
- Likely cause: the session room is embedded by interviews-ui in an iframe — it is not a standalone destination. The parent app passes
sessionIdvia query param or an origin-validatedpostMessage. If the embedding origin isn't listed in the room app's allowed parent-origins config, thepostMessageis silently dropped. - Where to look: the parent-origins env var in the session-room app's config;
src/livekit/correlation.js(howsessionIdis received and forwarded asX-Session-Id). Remember the dev server intentionally disables host-checking to support iframe embedding — don't "fix" that.
Session room connects to the wrong (or no) media server
- Likely cause: the LiveKit WebSocket URL env var is unset, so the code falls back to a hardcoded fallback address in
src/livekit/endpoints.jsthat is almost certainly wrong for your environment. The resolver also forceswss://when the page is HTTPS. - Where to look: set the WebSocket URL env var explicitly per environment; never rely on the fallback. Check
src/livekit/endpoints.jsfor resolution order.
Tokens look wrong / auth fails in the room
- Likely cause: the session room never mints tokens — they are minted by video-streaming-server and fetched via
src/livekit/tokenService.js. A failure here is usually the token API URL config or the streaming server itself. - Where to look: the token API URL env var, then the streaming server's logs.
CORS and cross-service calls
Browser calls to the streaming server fail at preflight after adding a request header
- Likely cause: the streaming server keeps an explicit CORS allowed-headers list. Adding a new custom header on the client (e.g. a request-id or session-id header) without adding it to that server-side list makes the
OPTIONSpreflight reject the request — the actual request never fires. - Where to look: the CORS configuration near the top of the streaming server's
index.js(the allowed-headers array). Any time a client interceptor starts attaching a new header, update the list in the same change.
/api/internal/* calls between services return 401/503
- Likely cause: service-to-service calls use a shared HS256 JWT secret that must be byte-identical across bot-backend, video-streaming-server, and user-management, and at least 32 characters. A mismatch or missing value fails fast with a clear log line at boot.
- Where to look: the internal-service JWT secret in each service's environment config; user-management logs (the auth filter validates at startup). Note: bot-backend's
.env.exampledoes not list this required variable — the app won't boot or will 401 upstream without it.
Feature flags and cached config
A flag was flipped in platform config but behavior didn't change
- Likely cause: some platform-config values are cached in the backend with no TTL. The database row is updated but the running app still serves the old value until the app server is restarted.
- Where to look: verify the row in
platform_config, then restart the backend app server in that environment and re-check via the public-flags endpoint.
A feature returns "Coming soon" or a 503-style response
- Likely cause: that's usually not an outage — it's an env/flag gate doing its job. Several features (e.g. feedback PDF export) sit behind an
*.enabledplatform-config flag that returns a 503 / "coming soon" response when off. Separately, optional integrations in the streaming server degrade gracefully to 503 instead of crashing, and the frontend falls back (lite orb, no STT). - Where to look: the relevant
*.enabledkey inplatform_configfor that environment first. If the flag is on and you still get a 5xx, then it's a real error — read the response body and server logs (a real bug can hide behind the same UI message as a closed gate).
/livekit/* routes return 503 but the rest of the streaming server works
- Likely cause: the LiveKit module failed to load at startup; the server deliberately keeps serving everything else.
- Where to look: streaming server startup logs for the module-load error.
Recording and camera
Camera light stays on after leaving an interview
- Likely cause: teardown ordering. Media tracks (including any cloned tracks created for recording) must be stopped and the room disconnected before any untimed "stop" network call — doing it after leaves the camera lit if that call hangs. A
pagehidehandler covers hard tab-closes. - Where to look: the publish/teardown hook in interviews-ui (
useAiRoomPublish), and any recording code that clones tracks — clones need explicit stopping too.
Avatar/streaming layer failure takes down the whole room
- Likely cause: a known cascade pattern: avatar vendor timeout → CDN fetch failure → WebGL context loss. The layers must be isolated defensively; one layer's failure must not cascade.
- Where to look: the streaming server's external-call timeout middleware (tight per-vendor budgets are deliberate — tune them consciously, not casually) and the avatar route handlers.
Backend boot and schema
Backend refuses to boot after a deploy: schema validation error
- Likely cause: the backend runs with schema validation on (
ddl-auto=validate) and no automatic migrations — boot fails if any entity references a column missing from that environment's database. Classic cause: code deployed before its migration, or dev/prod schema drift. - Where to look: the boot log names the missing column. Apply the pending migration (manually, migration-first — see Deploy & Release), and verify dev↔prod schema parity before promoting.
Endpoint behaves strangely after returning a reactive type
- Likely cause: the backend has the reactive web stack on the classpath only for its HTTP client. Returning
Mono/Fluxfrom a controller silently switches the whole stack to reactive. - Where to look: the controller's return type — return plain DTOs.
Environment files and local dev quirks
- interviews-ui: your
.envgot overwritten..envis a derived, transient file — the env-swapping npm scripts copy a target env over it. Canonical sources are.env.local,.env.dev,.env.production. Runnpm run restore:localafter a swapped run. Also note: production hosting does not read.envfiles at all — env vars are configured per-branch in the hosting service. - interviews-ui: build fails immediately.
scripts/check-env.jsvalidates required env vars before every build; the error names the missing var. - interviews-ui: app detects the wrong environment. Detection priority is
NEXT_PUBLIC_ENVIRONMENT→ hosting branch → hostname →NODE_ENV. Override explicitly rather than fighting the heuristics. - smart-interview-ui:
npm testnever exits. The test runner defaults to watch mode. UseCI=true npm testto run once. - bot-backend: app won't start or behaves differently between app and worker. Two separate env loaders exist: the FastAPI app uses pydantic-settings (
.env,env.local); the worker uses dotenv (env.{ENV}). Keep both fed. Also: the Python 3.13 pin is load-bearing — don't bump the interpreter orlivekit-agentscasually. - bot-backend: a vendor call raises
NotImplementedError. Most vendor adapters are intentional stubs pending later migration waves; only the OpenAI chat/TTS paths are live. Check the wave plan before assuming a bug.
Where logs live
Generic map — no console links here, ask a teammate for access:
- Backend services (user-management, video-streaming-server, bot-backend): each runs in its own cloud app environment, which exposes the application log stream and recent log bundles per environment (dev and prod are separate environments). Boot failures (schema validation, missing required env vars) appear here first.
- Frontend apps (interviews-ui, smart-interview-ui): the frontend hosting service keeps per-branch build logs (env validation and compile errors live there). Runtime issues are client-side — start with the browser devtools console and network tab, especially for CORS preflights and WebSocket connections.
- Locally: every service logs to the terminal it runs in; the streaming server and bot-backend use structured logging, so grep by request-id / session-id to follow one interview across services.