Deployment reliability on LiveKit Cloud
Understand rolling deployments, graceful draining, signal handling, and production runbook patterns for LiveKit Agents on Cloud.
Last Updated:
When you run lk agent deploy or lk agent rollback, LiveKit Cloud replaces your agent instances using a rolling deployment. New instances start serving new sessions immediately. Old instances stop accepting work and wait for active sessions to finish. This is called graceful draining.
Most deployment issues come down to three things: not understanding when drain starts, not knowing how long old instances wait, and not verifying that routing actually cut over. This guide covers all three.
How rolling deployments work
Loading diagram…
When you run lk agent deploy, LiveKit Cloud follows this process:
- Build: Your code is uploaded and a container image is built from your Dockerfile.
- Deploy: New instances with your updated code start alongside existing ones.
- Route: New session requests go to new instances.
- Drain: Old instances stop accepting new sessions but remain active for up to 1 hour to complete active sessions.
- Scale: New instances autoscale to meet demand.
lk agent rollback operates in the same rolling manner — the same drain behavior applies in reverse.
Two clocks govern drain timing
There are two independent grace periods that control how long old instances wait before shutting down. They operate at different layers and both matter.
Cloud rollout policy (up to 1 hour)
LiveKit Cloud gives old instances up to 1 hour to complete active sessions during a rolling deployment. This is the outer boundary — Cloud will terminate instances after this window regardless of job state.
Reference: Deployment management docs
Runtime drain_timeout (default 30 minutes)
Inside the agent process, drain_timeout controls how long the runtime waits for active jobs to finish after receiving SIGTERM or SIGINT. It defaults to 1800 seconds (30 minutes).
from livekit.agents import AgentServer
# Default: 1800 seconds (30 minutes)server = AgentServer( drain_timeout=2400, # extend to 40 minutes if your sessions run long)
When a drain starts, the runtime sets the worker status to WS_FULL (unavailable) and waits for all running jobs to complete. If jobs don't finish within drain_timeout, the process raises asyncio.TimeoutError and exits.
Reference: Server options docs · Source: worker.py
How the two clocks interact
Loading diagram…
Your effective drain window is the minimum of these two clocks. With defaults, the runtime's 30-minute timeout is the binding constraint. If you extend drain_timeout beyond 1 hour, the Cloud policy becomes the binding constraint instead.
What signals drain has started
Routing behavior (most reliable operational signal)
Drain has started for old instances when new sessions are no longer routed to them, while existing sessions on those instances continue. You can verify this during deployment by watching session assignment.
Process-level signal
The runtime handles SIGTERM and SIGINT as controlled shutdown triggers. When either signal is received:
- The worker sets
self._draining = True - Worker status updates to
WS_FULL(stops accepting new jobs) - The runtime waits for in-flight job launches to finish
- Then waits for all running jobs to complete (up to
drain_timeout) - Process exits
Reference: Source: cli.py signal handling · Source: worker.py drain method
Failure modes and mitigations
Wrapper scripts swallow TERM/INT
Symptom: Agents continue running unexpectedly during shutdown. Drain looks stuck.
Why: The runtime expects to receive SIGTERM/SIGINT directly. If a custom startup wrapper intercepts signals without forwarding them, the SDK never enters drain mode.
Fix: Run LiveKit entrypoints directly. If a wrapper is required, use exec so the agent process becomes the signal-receiving process:
#!/usr/bin/env bashset -euo pipefail# any preflight work hereexec python agent.py start
Without exec, the shell process receives the signal and the agent process never drains.
Session duration exceeds grace windows
Symptom: Long-running calls or sessions are terminated abruptly near deployment deadlines.
Why: Both the Cloud rollout window and the runtime drain_timeout are bounded. Sessions that exceed them will be cut off.
Fix:
- Design sessions to be resumable or checkpointable.
- Increase
drain_timeoutif your sessions regularly exceed 30 minutes (but remember the 1-hour Cloud ceiling). - Align deployment timing with expected session patterns — don't deploy during peak hours if sessions are long.
Non-backward-compatible changes during overlap
Symptom: In-flight sessions on the old version fail after new version deploys (integration calls break, state transitions fail).
Why: During rolling deployment, old and new versions coexist. If the new version changes metadata schemas, tool contracts, or protocol fields in breaking ways, old sessions can break.
Fix: Roll out schema changes in additive phases. Add new fields first, deploy, then deprecate old fields in a subsequent deploy.
Cold starts mistaken for drain issues
Symptom: Elevated first-connect latency after deployment is interpreted as a draining failure.
Why: On certain plans, agents can scale to zero. The first connection triggers a cold start, which is unrelated to drain behavior.
Fix: Track assignment latency and startup latency as separate metrics from drain behavior.
Routing cutover not verified
Symptom: Unclear whether old instances are actually draining or still receiving new work.
Why: No explicit validation step in the deployment workflow.
Fix: During rollout, confirm all three conditions:
- New sessions land on the new version
- Old instances stop receiving new sessions
- Active sessions on old instances drain to zero
Production runbook
Pre-deploy
- Validate backward compatibility between old and new versions for any shared state, metadata, or tool contracts.
- Avoid wrapping the LiveKit startup command. If a wrapper is unavoidable, verify it forwards
SIGTERM/SIGINTcorrectly withexec. - Ensure your observability distinguishes session assignment by version, active session count per version, and shutdown events with reasons.
During deploy
lk agent deploy
- Watch that new sessions route to new instances.
- Verify old instances accept zero new sessions.
- Track active sessions on old instances trending toward zero.
- Monitor the agent logs for
"draining worker"log entries confirming drain initiation.
During rollback
lk agent rollback
Rollback uses the same rolling drain behavior. Re-validate routing and session counts on the replaced generation.
Post-deploy
- Confirm no abrupt session terminations attributable to signal handling.
- Review outliers: long sessions, reconnect patterns, startup latency.
- If drain appears stuck, verify the worker process is actually receiving
SIGTERM(check for wrapper scripts masking signals).
Troubleshooting drain issues
Quick triage
- Confirm a rollout is in progress (
lk agent deployor rollback event). - Check routing: are new sessions going to the new version? Are old instances no longer accepting new work?
- If old instances remain active unexpectedly, test the signal path: does
SIGTERMreach the LiveKit agent process?
Validate process tree
While testing in local docker container ensure the LiveKit agent runtime is the signal-receiving process. Inspect the process tree for your container:
# Find the agent processps -ef | grep "agent" | grep -v grep
# Send TERM and observe behaviorkill -TERM <pid>
Expected: the process logs "draining worker", active sessions drain, then the process exits.
Controlled TERM test in staging
If you are self hosting agents you shoud also test the following:
- Start one or more active sessions against a staging agent.
- Send
SIGTERMto the agent process. - Verify:
- Process enters drain (logs
"draining worker") - Active sessions continue and complete
- Process exits when sessions finish (or
drain_timeoutelapses)
- Process enters drain (logs
Correlate timing
When debugging timing-related drain issues, compare:
| Layer | Default window | Controls |
|---|---|---|
| Cloud rollout policy | Up to 1 hour | How long Cloud waits before terminating old instances |
Runtime drain_timeout | 1800s (30 min) | How long the process waits for jobs after SIGTERM |
If observed behavior doesn't match expectations, check for: wrapper scripts masking signals, external supervisors with shorter stop deadlines, or sessions exceeding both grace windows.
What to collect during incidents
- Deployment event timestamp (when deploy/rollback started)
- Per-version active session counts over time
- New-session assignment by version during rollout
- Worker/container stop events and exit reasons
- Evidence of SIGTERM receipt in agent logs (look for
"draining worker"entries)
For more on accessing agent logs, see the Agent logs field guide and the log collection docs.
Related resources
- Deployment management docs — rolling deploy, rollback, cold starts
- Server lifecycle docs — registration, job dispatch, graceful drain
- Server options docs —
drain_timeout, permissions, load functions - Agent deployment CLI reference — full CLI command reference
- Deploy agents with GitHub Actions — CI/CD automation
- Checklist for regional deployments — region pinning and compliance
- Agent logs field guide — accessing and interpreting agent logs
- Source:
worker.py— drain implementation,WS_FULLstatus,_is_available()check - Source:
cli.py— signal handling (SIGTERM,SIGINT)