Blog Post

Azure AI Foundry Blog
4 MIN READ

Using the Voice Live API in Azure AI Foundry

PabloCastano's avatar
PabloCastano
Icon for Microsoft rankMicrosoft
Oct 12, 2025

Enable real-time speech-to-speech conversational experience through a unified API powered by generative AI models

In this blog post, we’ll explore the Voice Live API from Azure AI Foundry. Officially released for general availability on October 1, 2025, this API unifies speech recognition, generative AI, and text-to-speech capabilities into a single, streamlined interface. It removes the complexity of manually orchestrating multiple components and ensures a consistent developer experience across all models, making it easy to switch and experiment.

What sets Voice Live API apart are its advanced conversational enhancements, including:

  • Semantic Voice Activity Detection (VAD) that’s robust against background noise and accurately detects when a user intends to speak.
  • Semantic end-of-turn detection that supports natural pauses in conversation.
  • Server-side audio processing features like noise suppression and echo cancellation, simplifying client-side development.

Let’s get started.

1. Getting Started with Voice Live API

The Voice Live API ships with an SDKthat lets you open a single realtime WebSocket connection and then do everything—stream microphone audio up, receive synthesized audio/text/function‑call events down— without writing any of the low-level networking plumbing. This is how the connection is opened with the Python SDK.

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
        endpoint=VOICE_LIVE_ENDPOINT,          # https://<your-foundry-resource>.cognitiveservices.azure.com/
        credential=AzureKeyCredential(VOICE_LIVE_KEY),
        model="gpt-4o-realtime",
        connection_options={
            "max_msg_size": 10 * 1024 * 1024,  # allow streamed PCM
            "heartbeat": 20,                   # keep socket alive
            "timeout": 20,                     # network resilience
        },
    ) as connection:

Notice that you don't need an underlying deployment nor manage any generative AI models, as the API handles all the underlying infrastructure.

Immediately after connecting, declare what kind of conversation you want. This is where you “teach” the session the model instructions, which voice to synthesize, what tool functions it may call, and how to detect speech turns:

from azure.ai.voicelive.models import (
    RequestSession, Modality,
    AzureStandardVoice, InputAudioFormat, OutputAudioFormat,
    AzureSemanticVad, ToolChoiceLiteral,
    AudioInputTranscriptionOptions
)

session_config = RequestSession(
        modalities=[Modality.TEXT, Modality.AUDIO],
        instructions="Assist the user with account questions succinctly.",
        voice=AzureStandardVoice(name="alloy", type="azure-standard"),
        input_audio_format=InputAudioFormat.PCM16,
        output_audio_format=OutputAudioFormat.PCM16,
        turn_detection=AzureSemanticVad(
            threshold=0.5,
            prefix_padding_ms=300,
            silence_duration_ms=500
        ),
        tools=[  # optional
            {
                "name": "get_user_information",
                "description": "Retrieve profile and limits for a user",
                "input_schema": {
                    "type": "object",
                    "properties": {"user_id": {"type": "string"}},
                    "required": ["user_id"]
                }
            }
        ],
        tool_choice=ToolChoiceLiteral.AUTO,
        input_audio_transcription=AudioInputTranscriptionOptions(model="whisper-1"),
    )

await connection.session.update(session=session_config)

After session setup, it is pure event-driven flow:

async for event in connection:
    if event.type == ServerEventType.RESPONSE_AUDIO_DELTA:
        playback_queue.put(event.delta)
    elif event.type == ServerEventType.CONVERSATION_ITEM_CREATED and event.item.type == ItemType.FUNCTION_CALL:
        handle_function_call(event)

That’s the core: one connection, one session config message, then pure event-driven flow.

2. Deep Dive: Tool (Function) Handling in the Voice Live SDK

In the Voice Live context, “tools” are model-invocable functions you expose with a JSON schema. The SDK streams a structured function call request (name + incrementally streamed arguments), you execute real code locally, then feed the JSON result back so the model can incorporate it into its next spoken (and/or textual) turn. Let’s unpack the full lifecycle.

First, the model  emits a CONVERSATION_ITEM_CREATED event whose item.type == FUNCTION_CALL

if event.item.type == ItemType.FUNCTION_CALL:
    await self._handle_function_call_with_improved_pattern(event, connection)

Arguments stream (possibly token-by-token) until the SDK signals RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE. Optionally,  the SDK may also complete the “response” segment with RESPONSE_DONE before you run the tool. Then we execute the local Python function, and explicitly request a new model response via connection.response.create(), telling the model to incorporate the tool result into a natural-language (and audio) answer.

async def _handle_function_call(self, created_evt, connection):
    call_item = created_evt.item      # ResponseFunctionCallItem
    name = call_item.name
    call_id = call_item.call_id
    prev_id = call_item.id

    # 1. Wait until arguments are fully streamed
    args_done = await _wait_for_event(
        connection, {ServerEventType.RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE}
    )
    assert args_done.call_id == call_id
    arguments = args_done.arguments   # JSON string

    # 2. (Optional) Wait for RESPONSE_DONE to avoid race with model finishing segment
    await _wait_for_event(connection, {ServerEventType.RESPONSE_DONE})

    # 3. Execute
    func = self.available_functions.get(name)
    if not func:
        # Optionally send an error function output
        return

    result = await func(arguments)  # Implementations are async in this sample

    # 4. Send output
    output_item = FunctionCallOutputItem(call_id=call_id, output=json.dumps(result))
    await connection.conversation.item.create(
        previous_item_id=prev_id,
        item=output_item
    )

    # 5. Trigger follow-up model response
    await connection.response.create()

 

3. Sample App:

Try the repo with sample app we have created, together with all the infrastructure required already automated. 

This sample app simulates a friendly real‑time contact‑center rep who can listen continuously, understand you as you speak, instantly look up things like your credit card’s upcoming due date or a product detail via function calls, and then answer back naturally in a Brazilian Portuguese neural voice with almost no lag.

Behind the scenes it streams your microphone audio to Azure’s Voice Live (GPT‑4o realtime) model, transcribes and reasons on the fly, selectively triggers lightweight “get user information” or “get product information” lookups to Azure AI Search , and speaks responses right back to you.

 

Happy Coding!

Updated Oct 12, 2025
Version 3.0
No CommentsBe the first to comment