Deployment reliability on LiveKit Cloud

When you run lk agent deploy or lk agent rollback, LiveKit Cloud replaces your agent instances using a rolling deployment. New instances start serving new sessions immediately. Old instances stop accepting work and wait for active sessions to finish. This is called graceful draining.

Most deployment issues come down to three things: not understanding when drain starts, not knowing how long old instances wait, and not verifying that routing actually cut over. This guide covers all three.

How rolling deployments work

Loading diagram…

When you run lk agent deploy, LiveKit Cloud follows this process:

Build: Your code is uploaded and a container image is built from your Dockerfile.
Deploy: New instances with your updated code start alongside existing ones.
Route: New session requests go to new instances.
Drain: Old instances stop accepting new sessions but remain active for up to 1 hour to complete active sessions.
Scale: New instances autoscale to meet demand.

lk agent rollback operates in the same rolling manner — the same drain behavior applies in reverse.

Two clocks govern drain timing

There are two independent grace periods that control how long old instances wait before shutting down. They operate at different layers and both matter.

Cloud rollout policy (up to 1 hour)

LiveKit Cloud gives old instances up to 1 hour to complete active sessions during a rolling deployment. This is the outer boundary — Cloud will terminate instances after this window regardless of job state.

Reference: Deployment management docs

Runtime drain_timeout (default 30 minutes)

Inside the agent process, drain_timeout controls how long the runtime waits for active jobs to finish after receiving SIGTERM or SIGINT. It defaults to 1800 seconds (30 minutes).

from livekit.agents import AgentServer

# Default: 1800 seconds (30 minutes)
server = AgentServer(
    drain_timeout=2400,  # extend to 40 minutes if your sessions run long
)

When a drain starts, the runtime sets the worker status to WS_FULL (unavailable) and waits for all running jobs to complete. If jobs don't finish within drain_timeout, the process raises asyncio.TimeoutError and exits.

Reference: Server options docs · Source: worker.py

How the two clocks interact

Loading diagram…

Your effective drain window is the minimum of these two clocks. With defaults, the runtime's 30-minute timeout is the binding constraint. If you extend drain_timeout beyond 1 hour, the Cloud policy becomes the binding constraint instead.

What signals drain has started

Routing behavior (most reliable operational signal)

Drain has started for old instances when new sessions are no longer routed to them, while existing sessions on those instances continue. You can verify this during deployment by watching session assignment.

Process-level signal

The runtime handles SIGTERM and SIGINT as controlled shutdown triggers. When either signal is received:

The worker sets self._draining = True
Worker status updates to WS_FULL (stops accepting new jobs)
The runtime waits for in-flight job launches to finish
Then waits for all running jobs to complete (up to drain_timeout)
Process exits

Reference: Source: cli.py signal handling · Source: worker.py drain method

Failure modes and mitigations

Wrapper scripts swallow TERM/INT

Symptom: Agents continue running unexpectedly during shutdown. Drain looks stuck.

Why: The runtime expects to receive SIGTERM/SIGINT directly. If a custom startup wrapper intercepts signals without forwarding them, the SDK never enters drain mode.

Fix: Run LiveKit entrypoints directly. If a wrapper is required, use exec so the agent process becomes the signal-receiving process:

#!/usr/bin/env bash
set -euo pipefail
# any preflight work here
exec python agent.py start

Without exec, the shell process receives the signal and the agent process never drains.

Session duration exceeds grace windows

Symptom: Long-running calls or sessions are terminated abruptly near deployment deadlines.

Why: Both the Cloud rollout window and the runtime drain_timeout are bounded. Sessions that exceed them will be cut off.

Fix:

Design sessions to be resumable or checkpointable.
Increase drain_timeout if your sessions regularly exceed 30 minutes (but remember the 1-hour Cloud ceiling).
Align deployment timing with expected session patterns — don't deploy during peak hours if sessions are long.

Non-backward-compatible changes during overlap

Symptom: In-flight sessions on the old version fail after new version deploys (integration calls break, state transitions fail).

Why: During rolling deployment, old and new versions coexist. If the new version changes metadata schemas, tool contracts, or protocol fields in breaking ways, old sessions can break.

Fix: Roll out schema changes in additive phases. Add new fields first, deploy, then deprecate old fields in a subsequent deploy.

Cold starts mistaken for drain issues

Symptom: Elevated first-connect latency after deployment is interpreted as a draining failure.

Why: On certain plans, agents can scale to zero. The first connection triggers a cold start, which is unrelated to drain behavior.

Fix: Track assignment latency and startup latency as separate metrics from drain behavior.

Routing cutover not verified

Symptom: Unclear whether old instances are actually draining or still receiving new work.

Why: No explicit validation step in the deployment workflow.

Fix: During rollout, confirm all three conditions:

New sessions land on the new version
Old instances stop receiving new sessions
Active sessions on old instances drain to zero

Production runbook

Pre-deploy

Validate backward compatibility between old and new versions for any shared state, metadata, or tool contracts.
Avoid wrapping the LiveKit startup command. If a wrapper is unavoidable, verify it forwards SIGTERM/SIGINT correctly with exec.
Ensure your observability distinguishes session assignment by version, active session count per version, and shutdown events with reasons.

During deploy

lk agent deploy

Watch that new sessions route to new instances.
Verify old instances accept zero new sessions.
Track active sessions on old instances trending toward zero.
Monitor the agent logs for "draining worker" log entries confirming drain initiation.

During rollback

lk agent rollback

Rollback uses the same rolling drain behavior. Re-validate routing and session counts on the replaced generation.

Post-deploy

Confirm no abrupt session terminations attributable to signal handling.
Review outliers: long sessions, reconnect patterns, startup latency.
If drain appears stuck, verify the worker process is actually receiving SIGTERM (check for wrapper scripts masking signals).

Troubleshooting drain issues

Quick triage

Confirm a rollout is in progress (lk agent deploy or rollback event).
Check routing: are new sessions going to the new version? Are old instances no longer accepting new work?
If old instances remain active unexpectedly, test the signal path: does SIGTERM reach the LiveKit agent process?

Validate process tree

While testing in local docker container ensure the LiveKit agent runtime is the signal-receiving process. Inspect the process tree for your container:

# Find the agent process
ps -ef | grep "agent" | grep -v grep

# Send TERM and observe behavior
kill -TERM <pid>

Expected: the process logs "draining worker", active sessions drain, then the process exits.

Controlled TERM test in staging

If you are self hosting agents you shoud also test the following:

Start one or more active sessions against a staging agent.
Send SIGTERM to the agent process.
Verify:
- Process enters drain (logs "draining worker")
- Active sessions continue and complete
- Process exits when sessions finish (or drain_timeout elapses)

Correlate timing

When debugging timing-related drain issues, compare:

Layer	Default window	Controls
Cloud rollout policy	Up to 1 hour	How long Cloud waits before terminating old instances
Runtime `drain_timeout`	1800s (30 min)	How long the process waits for jobs after SIGTERM

If observed behavior doesn't match expectations, check for: wrapper scripts masking signals, external supervisors with shorter stop deadlines, or sessions exceeding both grace windows.

What to collect during incidents

Deployment event timestamp (when deploy/rollback started)
Per-version active session counts over time
New-session assignment by version during rollout
Worker/container stop events and exit reasons
Evidence of SIGTERM receipt in agent logs (look for "draining worker" entries)

For more on accessing agent logs, see the Agent logs field guide and the log collection docs.

Related resources

Deployment management docs — rolling deploy, rollback, cold starts
Server lifecycle docs — registration, job dispatch, graceful drain
Server options docs — drain_timeout, permissions, load functions
Agent deployment CLI reference — full CLI command reference
Deploy agents with GitHub Actions — CI/CD automation
Checklist for regional deployments — region pinning and compliance
Agent logs field guide — accessing and interpreting agent logs
Source: worker.py — drain implementation, WS_FULL status, _is_available() check
Source: cli.py — signal handling (SIGTERM, SIGINT)

Deployment reliability on LiveKit Cloud

How rolling deployments work

Two clocks govern drain timing

Cloud rollout policy (up to 1 hour)

Runtime drain_timeout (default 30 minutes)

How the two clocks interact

What signals drain has started

Routing behavior (most reliable operational signal)

Process-level signal

Failure modes and mitigations

Wrapper scripts swallow TERM/INT

Session duration exceeds grace windows

Non-backward-compatible changes during overlap

Cold starts mistaken for drain issues

Routing cutover not verified

Production runbook

Pre-deploy

During deploy

During rollback

Post-deploy

Troubleshooting drain issues

Quick triage

Validate process tree

Controlled TERM test in staging

Correlate timing

What to collect during incidents

Related resources

Deploy to LiveKit Cloud

Cloud administration

Observability

Deploy LiveKit Agents with GitHub Actions

How much can LiveKit scale?

How rolling deployments work

Two clocks govern drain timing

Cloud rollout policy (up to 1 hour)

Runtime drain_timeout (default 30 minutes)

How the two clocks interact

What signals drain has started

Routing behavior (most reliable operational signal)

Process-level signal

Failure modes and mitigations

Wrapper scripts swallow TERM/INT

Session duration exceeds grace windows

Non-backward-compatible changes during overlap

Cold starts mistaken for drain issues

Routing cutover not verified

Production runbook

Pre-deploy

During deploy

During rollback

Post-deploy

Troubleshooting drain issues

Quick triage

Validate process tree

Controlled TERM test in staging

Correlate timing

What to collect during incidents

Related resources

Read related documentation

Deploy to LiveKit Cloud

Cloud administration

Observability

Find more Deployment & Scaling guides

Deploy LiveKit Agents with GitHub Actions

How much can LiveKit scale?