Skip to main content
 
Field Guides

Troubleshooting STT not picking up utterances

Learn how to diagnose and resolve speech-to-text detection failures when STT fails to register user speech, including noise cancellation issues, STT provider limitations, and configuration recommendations.

Last Updated:

Troubleshooting

The speech-to-text (STT) pipeline may fail to detect user input even when the user is speaking. This guide covers symptoms, potential causes, and troubleshooting steps based on real-world customer interactions.

Symptoms

  • No errors reported from the STT pipeline (on_error hook not triggered)
  • Agent state changes correctly from speakinglistening
  • User state does not change from listeningspeaking, despite audio being sent from the user's microphone
  • Issue persists even after amplifying audio with a gain function
  • STT resumes working after several retries, and the conversation continues normally
  • Observed on both short and longer utterances

Example log excerpt showing STT working correctly:

{"message":"STT turn metrics: 0.401 seconds"}

Typical STT latency: 400–500ms.

Setup

Noise Cancellation: noise_cancellation.BVCTelephony() (agent side) + krisp_enabled=True on participant creation

TTS: 11Labs (no issues observed)

STT Providers Tested:

  • Deepgram → good results for English
  • Azure → best results for Arabic (used in production)

Agent Config:

self.session = AgentSession[Any](
turn_detection="vad",
vad=self.context.proc.userdata["vad"],
stt=self._stt,
llm=self._llm,
tts=self._tts,
max_endpointing_delay=1,
user_away_timeout=30,
)

STT Configuration (Azure):

azure.STT(
speech_key=settings.AZURE_SPEECH_KEY,
speech_region=settings.AZURE_SPEECH_REGION,
segmentation_silence_timeout_ms=100,
segmentation_max_time_ms=20000,
segmentation_strategy="Default",
language=conf_.language, # e.g., "ar-SA" or "en-US"
)

Custom Audio Gain Function:

def _amplify_audio_frame(self, base_frame: rtc.AudioFrame, gain: float):
audio_data = np.frombuffer(base_frame.data, dtype=np.int16)
audio_data = (audio_data * gain).astype(np.int16)
return rtc.AudioFrame(
data=audio_data.tobytes(),
sample_rate=base_frame.sample_rate,
num_channels=base_frame.num_channels,
samples_per_channel=base_frame.samples_per_channel,
)

Investigation & Findings

Noise Cancellation May Suppress Speech

Removing BVCTelephony noise cancellation improved STT detection. Aggressive noise cancellation can suppress valid speech, especially in noisy environments or when audio quality is already degraded.

Short Utterances Are Harder to Detect

This is a known issue with many STT providers—brief responses (1–3 words) may not reliably trigger detection.

Workaround: Encourage users to respond in full sentences.

Azure STT Known Issues

Issue #1003 on Azure Speech Known Issues describes similar behavior. Review Azure's known issues to check for ongoing bugs affecting your language or configuration.

LLM Response Failures Unrelated to STT

In some cases, Gemini 2.5 Flash returned MAX_TOKENS errors due to low max_output_tokens. This can appear as an STT issue but is actually an LLM configuration problem.

Resolution: Increase or omit the max_output_tokens parameter to avoid hitting hidden thought_tokens limits.

Resolution

  1. Remove or disable noise cancellation (BVCTelephony, krisp_enabled) during testing to confirm if suppression is the root cause.

  2. Test with different STT providers (Azure vs. Deepgram) to see if behavior is provider-specific.

  3. Encourage longer utterances to improve STT pickup reliability, especially for brief responses.

  4. Review Azure STT known issues to check for ongoing bugs affecting your language or configuration.

  5. For Gemini LLM users: Omit max_output_tokens or set it to a higher value to avoid hitting hidden thought_tokens limits.

Key Takeaways

  • Noise cancellation can suppress valid speech—test without it to isolate the issue
  • Short utterances (1–3 words) are less reliably detected by most STT providers
  • STT provider-specific issues may require switching providers or reviewing known issues
  • LLM token limits can manifest as STT failures—verify your LLM configuration separately