Troubleshooting STT not picking up utterances
Learn how to diagnose and resolve speech-to-text detection failures when STT fails to register user speech, including noise cancellation issues, STT provider limitations, and configuration recommendations.
Last Updated:
The speech-to-text (STT) pipeline may fail to detect user input even when the user is speaking. This guide covers symptoms, potential causes, and troubleshooting steps based on real-world customer interactions.
Symptoms
- No errors reported from the STT pipeline (
on_errorhook not triggered) - Agent state changes correctly from
speaking→listening - User state does not change from
listening→speaking, despite audio being sent from the user's microphone - Issue persists even after amplifying audio with a gain function
- STT resumes working after several retries, and the conversation continues normally
- Observed on both short and longer utterances
Example log excerpt showing STT working correctly:
{"message":"STT turn metrics: 0.401 seconds"}
Typical STT latency: 400–500ms.
Setup
Noise Cancellation: noise_cancellation.BVCTelephony() (agent side) + krisp_enabled=True on participant creation
TTS: 11Labs (no issues observed)
STT Providers Tested:
- Deepgram → good results for English
- Azure → best results for Arabic (used in production)
Agent Config:
self.session = AgentSession[Any]( turn_detection="vad", vad=self.context.proc.userdata["vad"], stt=self._stt, llm=self._llm, tts=self._tts, max_endpointing_delay=1, user_away_timeout=30,)
STT Configuration (Azure):
azure.STT( speech_key=settings.AZURE_SPEECH_KEY, speech_region=settings.AZURE_SPEECH_REGION, segmentation_silence_timeout_ms=100, segmentation_max_time_ms=20000, segmentation_strategy="Default", language=conf_.language, # e.g., "ar-SA" or "en-US")
Custom Audio Gain Function:
def _amplify_audio_frame(self, base_frame: rtc.AudioFrame, gain: float): audio_data = np.frombuffer(base_frame.data, dtype=np.int16) audio_data = (audio_data * gain).astype(np.int16) return rtc.AudioFrame( data=audio_data.tobytes(), sample_rate=base_frame.sample_rate, num_channels=base_frame.num_channels, samples_per_channel=base_frame.samples_per_channel, )
Investigation & Findings
Noise Cancellation May Suppress Speech
Removing BVCTelephony noise cancellation improved STT detection. Aggressive noise cancellation can suppress valid speech, especially in noisy environments or when audio quality is already degraded.
Short Utterances Are Harder to Detect
This is a known issue with many STT providers—brief responses (1–3 words) may not reliably trigger detection.
Workaround: Encourage users to respond in full sentences.
Azure STT Known Issues
Issue #1003 on Azure Speech Known Issues describes similar behavior. Review Azure's known issues to check for ongoing bugs affecting your language or configuration.
LLM Response Failures Unrelated to STT
In some cases, Gemini 2.5 Flash returned MAX_TOKENS errors due to low max_output_tokens. This can appear as an STT issue but is actually an LLM configuration problem.
Resolution: Increase or omit the max_output_tokens parameter to avoid hitting hidden thought_tokens limits.
Resolution
-
Remove or disable noise cancellation (
BVCTelephony,krisp_enabled) during testing to confirm if suppression is the root cause. -
Test with different STT providers (Azure vs. Deepgram) to see if behavior is provider-specific.
-
Encourage longer utterances to improve STT pickup reliability, especially for brief responses.
-
Review Azure STT known issues to check for ongoing bugs affecting your language or configuration.
-
For Gemini LLM users: Omit
max_output_tokensor set it to a higher value to avoid hitting hiddenthought_tokenslimits.
Key Takeaways
- Noise cancellation can suppress valid speech—test without it to isolate the issue
- Short utterances (1–3 words) are less reliably detected by most STT providers
- STT provider-specific issues may require switching providers or reviewing known issues
- LLM token limits can manifest as STT failures—verify your LLM configuration separately