How much can LiveKit scale?

LiveKit Media Transport and LiveKit Agent Hosting share the same design goal: scale to any workload without you managing infrastructure. The realtime network is a mesh of global edge POPs with automatic traffic steering, and the managed agent platform schedules workloads across elastic compute pools. If you need to serve millions of WebRTC minutes or thousands of concurrent agents, the core systems expand horizontally and isolate noisy neighbors so you keep latency low and quality consistent.

Scale media transport without new architecture

Global ingress and egress: WebRTC sessions land on the closest edge and hop across our private backbone, keeping RTT under 250 ms even during regional spikes.
Capacity isolation: Each POP runs multiple SFU clusters so broadcasts, 1:1 calls, and PSTN bridges do not compete for the same CPU budget.
Automated capacity management: Traffic predictions plus saturation telemetry trigger proactive node adds; you see 99.99% uptime in every paid tier because expansion happens before exhaustion.

You do not need to shard rooms or hand-place sessions. Keep media in a single project, pin regions only when regulatory rules demand it, and file a support ticket if you expect traffic patterns that require dedicated scaling rehearsals.

Scale agent hosting as usage spikes

Agent Hosting uses containerized workers backed by GPU-optional pools for TTS/STT fan-out, so deployments stay hot and can burst to the concurrency you provisioned. Separate scheduler queues handle observability, inference, and telephony adapters, which prevents a slow downstream LLM from blocking audio capture or playback. When you roll out new releases, blue/green deployments keep both versions available until traffic drains, so you can test at production scale without downtime.

If you need deterministic startup latency, enable cold-start prevention on the deployment and the platform will keep the necessary workers pre-warmed. Enterprise customers can also request dedicated regional capacity or private networking if compliance teams require it.

Respect guardrails that cap runaway costs

Because everything scales automatically, LiveKit enforces a few quotas to prevent accidental spend. Most limits (agent session minutes, WebRTC minutes, concurrent sessions, telephony minutes) can be lifted through support as soon as you share the traffic profile. The dashboard exposes current usage alongside remaining plan allotments so you can see when you are approaching soft limits.

Inference concurrency is the gating factor

Inference concurrency defines how many simultaneous calls you can make to LiveKit’s managed inference service (LLM, STT, and TTS). It is the one limit that is hard-capped on self-serve plans:

Build: 5 concurrent inference sessions
Ship: 20 concurrent inference sessions
Scale: 50 concurrent inference sessions

Enterprise plans are custom; concurrency is provisioned to match your contract. If you need more concurrency than your self-serve plan allows, you must upgrade rather than request a temporary bump. For Enterprise, contact your account team and we will reserve the additional GPU capacity ahead of your launch.

Other limits you can raise

Agent session concurrency: Build (5), Ship (20), Scale (up to 600). Enterprise can set any number; support can pre-allocate capacity if you provide load-test dates.
WebRTC concurrent connections: Build (100), Ship (1,000), Scale (5,000). Enterprise tiers are eligible for higher concurrent connections.
Telephony minutes and trunks: Included minutes guard against surprises, but you can provision more numbers or SIP minutes by filing a ticket; billing simply follows the published per-minute rates.

Watch the quota widgets in the dashboard during load tests. If you ever see throttling, collect the timestamps and reach out—support can review the metrics, confirm whether you hit a cost-control limit, and schedule increases before your next push.

How much can LiveKit scale?

Scale media transport without new architecture

Scale agent hosting as usage spikes

Respect guardrails that cap runaway costs

Inference concurrency is the gating factor

Other limits you can raise

Deploy to LiveKit Cloud

Cloud administration

Observability

Deploy LiveKit Agents with GitHub Actions

Deployment reliability on LiveKit Cloud

Scale media transport without new architecture

Scale agent hosting as usage spikes

Respect guardrails that cap runaway costs

Inference concurrency is the gating factor

Other limits you can raise

Read related documentation

Deploy to LiveKit Cloud

Cloud administration

Observability

Find more Deployment & Scaling guides

Deploy LiveKit Agents with GitHub Actions

Deployment reliability on LiveKit Cloud