Troubleshooting STT not picking up utterances

The speech-to-text (STT) pipeline may fail to detect user input even when the user is speaking. This guide covers symptoms, potential causes, and troubleshooting steps based on real-world customer interactions.

Symptoms

No errors reported from the STT pipeline (on_error hook not triggered)
Agent state changes correctly from speaking → listening
User state does not change from listening → speaking, despite audio being sent from the user's microphone
Issue persists even after amplifying audio with a gain function
STT resumes working after several retries, and the conversation continues normally
Observed on both short and longer utterances

Example log excerpt showing STT working correctly:

{"message":"STT turn metrics: 0.401 seconds"}

Typical STT latency: 400–500ms.

Setup

Noise Cancellation: noise_cancellation.BVCTelephony() (agent side) + krisp_enabled=True on participant creation

TTS: 11Labs (no issues observed)

STT Providers Tested:

Deepgram → good results for English
Azure → best results for Arabic (used in production)

Agent Config:

self.session = AgentSession[Any](
    turn_detection="vad",
    vad=self.context.proc.userdata["vad"],
    stt=self._stt,
    llm=self._llm,
    tts=self._tts,
    max_endpointing_delay=1,
    user_away_timeout=30,
)

STT Configuration (Azure):

azure.STT(
    speech_key=settings.AZURE_SPEECH_KEY,
    speech_region=settings.AZURE_SPEECH_REGION,
    segmentation_silence_timeout_ms=100,
    segmentation_max_time_ms=20000,
    segmentation_strategy="Default",
    language=conf_.language,  # e.g., "ar-SA" or "en-US"
)

Custom Audio Gain Function:

def _amplify_audio_frame(self, base_frame: rtc.AudioFrame, gain: float):
    audio_data = np.frombuffer(base_frame.data, dtype=np.int16)
    audio_data = (audio_data * gain).astype(np.int16)
    return rtc.AudioFrame(
        data=audio_data.tobytes(),
        sample_rate=base_frame.sample_rate,
        num_channels=base_frame.num_channels,
        samples_per_channel=base_frame.samples_per_channel,
    )

Investigation & Findings

Noise Cancellation May Suppress Speech

Removing BVCTelephony noise cancellation improved STT detection. Aggressive noise cancellation can suppress valid speech, especially in noisy environments or when audio quality is already degraded.

Short Utterances Are Harder to Detect

This is a known issue with many STT providers—brief responses (1–3 words) may not reliably trigger detection.

Workaround: Encourage users to respond in full sentences.

Azure STT Known Issues

Issue #1003 on Azure Speech Known Issues describes similar behavior. Review Azure's known issues to check for ongoing bugs affecting your language or configuration.

LLM Response Failures Unrelated to STT

In some cases, Gemini 2.5 Flash returned MAX_TOKENS errors due to low max_output_tokens. This can appear as an STT issue but is actually an LLM configuration problem.

Resolution: Increase or omit the max_output_tokens parameter to avoid hitting hidden thought_tokens limits.

Resolution

Remove or disable noise cancellation (BVCTelephony, krisp_enabled) during testing to confirm if suppression is the root cause.
Test with different STT providers (Azure vs. Deepgram) to see if behavior is provider-specific.
Encourage longer utterances to improve STT pickup reliability, especially for brief responses.
Review Azure STT known issues to check for ongoing bugs affecting your language or configuration.
For Gemini LLM users: Omit max_output_tokens or set it to a higher value to avoid hitting hidden thought_tokens limits.

Key Takeaways

Noise cancellation can suppress valid speech—test without it to isolate the issue
Short utterances (1–3 words) are less reliably detected by most STT providers
STT provider-specific issues may require switching providers or reviewing known issues
LLM token limits can manifest as STT failures—verify your LLM configuration separately

Troubleshooting STT not picking up utterances

Symptoms

Setup

Investigation & Findings

Noise Cancellation May Suppress Speech

Short Utterances Are Harder to Detect

Azure STT Known Issues

LLM Response Failures Unrelated to STT

Resolution

Key Takeaways

Agents overview

Quickstart guide

Agent models

Building multi-agent architectures with LiveKit agents

Can you increase agent deployment limits?

Symptoms

Setup

Investigation & Findings

Noise Cancellation May Suppress Speech

Short Utterances Are Harder to Detect

Azure STT Known Issues

LLM Response Failures Unrelated to STT

Resolution

Key Takeaways

Read related documentation

Agents overview

Quickstart guide

Agent models

Find more Agents guides

Building multi-agent architectures with LiveKit agents

Can you increase agent deployment limits?