Blog Post

Healthcare and Life Sciences Blog
3 MIN READ

Configuring Noise Detection and Barge‑In with Azure Voice Live API

Yan_Liang's avatar
Yan_Liang
Icon for Microsoft rankMicrosoft
Mar 30, 2026

Natural voice conversations depend on two things: knowing when a user is speaking and letting them interrupt naturally. Without reliable noise detection and barge‑in, voice agents feel rigid and frustrating—especially in real‑world environments like call centers or mobile scenarios.

Azure Voice Live API addresses this by providing built‑in noise handling, server‑side Voice Activity Detection (VAD), and native barge‑in support—all configurable with a single session update.

How Voice Live Handles Noise and Interruption

Voice Live performs server‑side speech detection on the incoming audio stream. Instead of relying on simple volume thresholds, it can use Azure Semantic VAD, which is more resilient to background noise and conversational fillers.

When enabled:

Background noise is ignored

Speech start and stop are detected automatically

User speech can interrupt the assistant mid‑response (barge‑in)

All of this happens without stitching together separate STT, silence detection, or TTS cancellation logic.

The Key Configuration: session.update

Noise detection and barge‑in are configured using the session.update event, typically sent immediately after opening the Voice Live WebSocket session from client side such as ACA or Azure function.

Below is a recommended baseline configuration:

 {

  "type": "session.update",

  "session": {

    "modalities": ["text", "audio"],

    "input_audio_format": "pcm16",

    "output_audio_format": "pcm16",

    "input_audio_sampling_rate": 24000,

 

    "turn_detection": {

      "type": "azure_semantic_vad",

      "threshold": 0.5,

      "prefix_padding_ms": 300,

      "silence_duration_ms": 500,

      "interrupt_response": true,

      "auto_truncate": true

    }

  }

}

 

What this configuration enables:

Ø  Azure Semantic VAD for robust speech detection

Uses Azure’s semantic model to detect actual speech, not just sound energy. This dramatically reduces false positives from background noise.

Noise‑tolerant turn detection

True barge‑in via interrupt_response: true

Immediate stop and truncation of AI audio when interrupted

 

Ø  threshold

Controls how sensitive speech detection is.

  • Lower value → more sensitive
  • Higher value → less sensitive

Typical values:

  • 0.3–0.4 for quiet environments
  • 0.5–0.7 for noisy call centers

 

What Happens at Runtime

Once configured:

  • The assistant begins speaking
  • The user starts talking
  • Voice Live detects speech server‑side
  • AI audio stops immediately
  • Unplayed audio is discarded
  • The user’s speech becomes the active turn
  • No custom interruption logic is required—your application simply reacts to speech start events.

Common Mistakes to Avoid

  • Sending audio before session.update
  • Forgetting interrupt_response: true
  • Using overly aggressive thresholds
  • Ignoring speech start events on the client

 

Best Practices

Use Semantic VAD in noisy environments (call centers, mobile)

Tune the threshold (higher for noisy spaces, lower for quiet rooms)

Enable echo cancellation and noise suppression on the client microphone

Always enable auto_truncate when using barge‑in

 

Sample JavaScript code:

import WebSocket from "ws";

 

const VOICELIVE_URL =

  "wss://<your-resource>.services.ai.azure.com/voice-live/realtime" +

  "?api-version=2025-10-01&model=<model>";

 

// Use Entra ID token or api-key header

const ws = new WebSocket(VOICELIVE_URL, {

  headers: {

    // Recommended:

    // Authorization: `Bearer ${process.env.AZURE_AI_TOKEN}`

    // or api-key for non-browser clients

    // "api-key": process.env.AZURE_VOICELIVE_API_KEY

  }

});

ws.on("open", () => {

  console.log("Connected to Voice Live");

 

  ws.send(JSON.stringify({

    type: "session.update",

    session: {

      modalities: ["text", "audio"],

      input_audio_format: "pcm16",

      output_audio_format: "pcm16",

      input_audio_sampling_rate: 24000,

 

      turn_detection: {

        type: "azure_semantic_vad",

        threshold: 0.5,

        prefix_padding_ms: 300,

        silence_duration_ms: 500,

        interrupt_response: true,

        auto_truncate: true

      }

    }

  }));

  console.log("session.update sent");

});

 

Published Mar 30, 2026
Version 1.0
No CommentsBe the first to comment