Natural voice conversations depend on two things: knowing when a user is speaking and letting them interrupt naturally. Without reliable noise detection and barge‑in, voice agents feel rigid and frustrating—especially in real‑world environments like call centers or mobile scenarios.
Azure Voice Live API addresses this by providing built‑in noise handling, server‑side Voice Activity Detection (VAD), and native barge‑in support—all configurable with a single session update.
How Voice Live Handles Noise and Interruption
Voice Live performs server‑side speech detection on the incoming audio stream. Instead of relying on simple volume thresholds, it can use Azure Semantic VAD, which is more resilient to background noise and conversational fillers.
When enabled:
Background noise is ignored
Speech start and stop are detected automatically
User speech can interrupt the assistant mid‑response (barge‑in)
All of this happens without stitching together separate STT, silence detection, or TTS cancellation logic.
The Key Configuration: session.update
Noise detection and barge‑in are configured using the session.update event, typically sent immediately after opening the Voice Live WebSocket session from client side such as ACA or Azure function.
Below is a recommended baseline configuration:
{
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_sampling_rate": 24000,
"turn_detection": {
"type": "azure_semantic_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500,
"interrupt_response": true,
"auto_truncate": true
}
}
}
What this configuration enables:
Ø Azure Semantic VAD for robust speech detection
Uses Azure’s semantic model to detect actual speech, not just sound energy. This dramatically reduces false positives from background noise.
Noise‑tolerant turn detection
True barge‑in via interrupt_response: true
Immediate stop and truncation of AI audio when interrupted
Ø threshold
Controls how sensitive speech detection is.
- Lower value → more sensitive
- Higher value → less sensitive
Typical values:
- 0.3–0.4 for quiet environments
- 0.5–0.7 for noisy call centers
What Happens at Runtime
Once configured:
- The assistant begins speaking
- The user starts talking
- Voice Live detects speech server‑side
- AI audio stops immediately
- Unplayed audio is discarded
- The user’s speech becomes the active turn
- No custom interruption logic is required—your application simply reacts to speech start events.
Common Mistakes to Avoid
- Sending audio before session.update
- Forgetting interrupt_response: true
- Using overly aggressive thresholds
- Ignoring speech start events on the client
Best Practices
Use Semantic VAD in noisy environments (call centers, mobile)
Tune the threshold (higher for noisy spaces, lower for quiet rooms)
Enable echo cancellation and noise suppression on the client microphone
Always enable auto_truncate when using barge‑in
Sample JavaScript code:
import WebSocket from "ws";
const VOICELIVE_URL =
"wss://<your-resource>.services.ai.azure.com/voice-live/realtime" +
"?api-version=2025-10-01&model=<model>";
// Use Entra ID token or api-key header
const ws = new WebSocket(VOICELIVE_URL, {
headers: {
// Recommended:
// Authorization: `Bearer ${process.env.AZURE_AI_TOKEN}`
// or api-key for non-browser clients
// "api-key": process.env.AZURE_VOICELIVE_API_KEY
}
});
ws.on("open", () => {
console.log("Connected to Voice Live");
ws.send(JSON.stringify({
type: "session.update",
session: {
modalities: ["text", "audio"],
input_audio_format: "pcm16",
output_audio_format: "pcm16",
input_audio_sampling_rate: 24000,
turn_detection: {
type: "azure_semantic_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
interrupt_response: true,
auto_truncate: true
}
}
}));
console.log("session.update sent");
});