We are thrilled to announce that Voice Live API now supports WebRTC (Web Real-Time Communication) connection, enabling low‑latency, real‑time voice interactions directly from web and mobile clients.
Why WebRTC is required in real-time voice agent building
WebRTC enables real‑time, bi‑directional audio and video streaming directly in the browser without plugins or native installs. Unlike WebSocket, which treat audio as generic data and require custom buffering and timing logic, WebRTC is purpose‑built for media‑aware streaming needed for responsive conversational experiences. Specifically:
- Lower latency: WebRTC is designed to minimize delay, making it more suitable for audio and video communication where low latency is critical for maintaining quality and synchronization.
- Built-in media handling: WebRTC has built-in support for audio and video codecs, providing optimized handling of media streams.
- Network resilience: WebRTC includes mechanisms for handling packet loss and jitter, which are essential for maintaining the quality of audio streams over unpredictable networks.
How to set up WebRTC in Voice Live
In a typical setup, the client establishes a WebSocket‑based control channel with the Voice Live API to exchange SDP offer and answer messages required for WebRTC session negotiation. Once negotiation completes, audio is transmitted over WebRTC RTP media tracks.
Non‑audio events, such as voice activity and response lifecycle signals, are exchanged over WebRTC data channels alongside the media streams. Session configuration, control‑plane messages, and error notifications are delivered through the WebSocket control channel.
When initiating a WebRTC call session, simply use the voice-live/realtime/calls endpoint instead of voice-live/realtime. For example:
wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime/calls?api-version=2026-01-01-preview&model=gpt-realtime
For more information, see the step-by-step instruction.
Learn more
Voice Live API is transforming how developers build voice-enabled agent systems by providing an integrated, scalable, and efficient solution. By combining speech recognition, generative AI, and text-to-speech functionalities into a unified interface, it addresses the challenges of traditional implementations, enabling faster development and superior user experiences. From streamlining customer service to enhancing education and public services, the opportunities are endless. The future of voice-first solutions is here—let’s build it together!
Voice Live API introduction (video)