Microsoft Foundry Blog

13 MIN READ

From Zero to Hero: Building a Production-Ready SIP Gateway for Azure Voice Live

mrajguru

Microsoft

Nov 27, 2025

Introduction

Voice technology is transforming how we interact with machines, making conversations with AI feel more natural than ever before. With the public beta release of the Voice Live API developers now have the tools to create low-latency, multimodal voice experiences in their apps, opening up endless possibilities for innovation.

Gone are the days when building a voice bot required stitching together multiple models for transcription, inference, and text-to-speech conversion. With the Realtime API, developers can now streamline the entire process with a single API call, enabling fluid, natural speech-to-speech conversations. This is a game-changer for industries like customer support, education, and real-time language translation, where fast, seamless interactions are crucial.

In this comprehensive technical blog, we'll explore the architecture, implementation, and production deployment of a Python-based SIP gateway that bridges traditional telephony infrastructure with Azure's cutting-edge Voice Live real-time conversation API. This gateway enables callers from any SIP endpoint—whether a desk phone, mobile softphone, or PSTN connection—to engage in natural, AI-powered voice conversations.

By the end of this article, you'll understand:

The architectural design of a production-grade SIP-to-WebSocket bridge
Audio transcoding and resampling strategies for seamless media conversion
Real-world deployment using both cloud based telephony or on premise like Avaya, Genesys
Step-by-step setup instructions for both local testing and enterprise production environments

Architecture Overview

High-Level Design

How Voice Live API Works

Traditionally, building a voice assistant required chaining together several models: an automatic speech recognition (ASR) model like Whisper for transcribing audio, a text-based model for processing responses, and a text-to-speech (TTS) model for generating audio outputs. This multi-step process often led to delays and a loss of emotional nuance.
The Voice Live API revolutionizes this by consolidating these functionalities into a single API call. By establishing a persistent WebSocket connection, developers can stream audio inputs and outputs directly, significantly reducing latency and enhancing the naturalness of conversations. Additionally, the API's function calling capability allows the voice bot to perform actions such as placing orders or retrieving customer information on the fly.

The gateway acts as a bidirectional media proxy between the SIP/RTP world (telephony) and Azure Voice Live's WebSocket-based real-time API. It translates both signaling protocols and audio formats, enabling seamless integration between legacy VoIP infrastructure and modern AI-powered conversational agents.

Key Design Principles

Asynchronous-First Architecture: Built on asyncio for non-blocking I/O, ensuring low latency and high concurrency potential
Separation of Concerns: Modular design with distinct layers for SIP, media, and Voice Live integration
Production-Ready Error Handling: Graceful degradation with silence insertion when audio buffers underrun
Structured Logging: structlog provides machine-parseable, contextual logs for observability
Type Safety: Leverages pydantic for settings validation and mypy for static type checking

Core Components

Key roles of the SBC:

Terminate carrier SIP and interwork toward Azure SIP/RTC endpoints.
Normalize SIP headers, strip unsupported options, and map codecs.
Provide security: TLS signaling, SRTP media, topology hiding, and ACLs.
Optional media anchoring for lawful intercept, recording, or QoS smoothing.

Project Structure

src/voicelive_sip_gateway/
├── config/
│   ├── __init__.py
│   └── settings.py          # Pydantic-based configuration
├── gateway/
│   └── main.py              # Application entry point & lifecycle
├── logging/
│   ├── __init__.py
│   └── setup.py             # Structlog configuration
├── media/
│   ├── __init__.py
│   ├── stream_bridge.py     # Bidirectional audio queue manager
│   └── transcode.py         # μ-law ↔ PCM16 + resampling
├── sip/
│   ├── __init__.py
│   ├── agent.py             # pjsua2 wrapper & call handling
│   ├── rtp.py               # RTP utilities
│   └── sdp.py               # SDP parsing/generation
└── voicelive/
    ├── __init__.py
    ├── client.py            # Azure Voice Live SDK wrapper
    └── events.py            # Event type mapping

Call Flow

Customer calls your number
Audiocodes SBC receives from PSTN, forwards SIP INVITE to Asterisk
Asterisk routing logic sends INVITE to voicelive-bot@gateway.example.com
Voice Live gateway registers with Asterisk, answers the call
RTP audio flows: Caller ↔ SBC ↔ Asterisk ↔ Gateway ↔ Azure

Audio Pipeline: From μ-law to PCM16

The Challenge

Traditional telephony uses G.711 μ-law codec at 8 kHz sampling rate for bandwidth efficiency. Azure Voice Live expects PCM16 (16-bit linear PCM) at 24 kHz. Our gateway must perform both codec conversion and sample rate conversion in real-time with minimal latency.

Audio Flow Diagram

Caller (SIP)                    Gateway                     Azure Voice Live
─────────────                   ───────                     ────────────────
    │                              │                               │
    │ RTP: μ-law 8kHz              │                               │
    ├──────────────────────────────►                               │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │                         │ pjsua2  │ (decodes to PCM16 8kHz)  │
    │                         └────┬────┘                          │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (8kHz → 24kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                              ├───────────────────────────────►
    │                              │   WebSocket: PCM16 24kHz      │
    │                              │                               │
    │                              │◄──────────────────────────────┤
    │                              │   Response: PCM16 24kHz       │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (24kHz → 8kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │◄────────────────────────┤ pjsua2  │ (encodes to μ-law)       │
    │  RTP: μ-law 8kHz        └─────────┘                          │

I want to highlight some key components of the code for better understanding of the overall flow.

Audio Stream Bridge: stream_bridge.py

The AudioStreamBridge class orchestrates bidirectional audio flow using asyncio queues:

class AudioStreamBridge:
    """Bidirectional audio pump between SIP (μ-law) and Voice Live (PCM16 24kHz)."""
    
    VOICELIVE_SAMPLE_RATE = 24000
    SIP_SAMPLE_RATE = 8000
    
    def __init__(self, settings: Settings):
        self._inbound_queue: asyncio.Queue[bytes] = asyncio.Queue()   # SIP → Voice Live
        self._outbound_queue: asyncio.Queue[bytes] = asyncio.Queue()  # Voice Live → SIP

Inbound Path (Caller → AI):

async def _flush(self) -> None:
    """Process inbound audio: PCM16 8kHz from SIP → PCM16 24kHz to Voice Live."""
    while True:
        pcm16_8k = await self._inbound_queue.get()
        pcm16_24k = resample_pcm16(pcm16_8k, self.SIP_SAMPLE_RATE, self.VOICELIVE_SAMPLE_RATE)
        if self._voicelive_client:
            await self._voicelive_client.send_audio_chunk(pcm16_24k)

Outbound Path (AI → Caller):

async def emit_audio_to_sip(self, pcm_chunk: bytes) -> None:
    """Resample Voice Live audio down to 8 kHz PCM frames for SIP playback."""
    pcm_8k = resample_pcm16(pcm_chunk, self.VOICELIVE_SAMPLE_RATE, self.SIP_SAMPLE_RATE)
    
    # Split into 20ms frames (160 samples @ 8kHz = 320 bytes)
    frame_size_bytes = 320
    for offset in range(0, len(pcm_8k), frame_size_bytes):
        frame = pcm_8k[offset : offset + frame_size_bytes]
        if frame:
            await self._outbound_queue.put(frame)

Frame Timing: VoIP typically uses 20ms frames (ptime=20). At 8 kHz: 8000 samples/sec × 0.020 sec = 160 samples = 320 bytes (PCM16).

SIP Signaling with pjsua2

Why pjsua2?

pjproject is the gold standard for SIP/RTP implementation, powering commercial products like Asterisk. The pjsua2 API provides:

Full SIP stack (INVITE, ACK, BYE, REGISTER, etc.)
RTP/RTCP media handling
Built-in codec support (G.711, G.722, Opus, etc.)
NAT traversal (STUN/TURN/ICE)
Thread-safe C++ API with Python bindings

Custom Media Ports

To bridge pjsua2 with our asyncio-based audio queues, we implement custom AudioMediaPort subclasses:

Receiving Audio from Caller

def onFrameReceived(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it receives audio from caller (to Voice Live)."""
    if self._direction == "to_voicelive" and self._loop:
        if frame.type == pj.PJMEDIA_FRAME_TYPE_AUDIO and frame.buf:
            try:
                asyncio.run_coroutine_threadsafe(
                    self._bridge.enqueue_sip_audio(bytes(frame.buf)),
                    self._loop
                )
            except Exception as e:
                self._logger.warning("media.enqueue_failed", error=str(e))

Threading Considerations: pjsua2 runs its event loop in a dedicated thread. We use asyncio.run_coroutine_threadsafe() to safely enqueue data into the main asyncio event loop.

Sending Audio to Caller

def onFrameRequested(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it needs audio to send to caller (from Voice Live)."""
    if self._direction == "from_voicelive" and self._loop:
        try:
            future = asyncio.run_coroutine_threadsafe(
                self._bridge.dequeue_sip_audio_nonblocking(),
                self._loop
            )
            pcm_data = future.result(timeout=0.050)
            
            # Ensure exactly 320 bytes (160 samples @ 8kHz)
            if len(pcm_data) != 320:
                pcm_data = (pcm_data + b'\x00' * 320)[:320]
            
            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(pcm_data)
            frame.size = len(pcm_data)
        except Exception:
            # Return silence on timeout/error
            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(b'\x00' * 320)
            frame.size = 320

Graceful Degradation: If the outbound queue is empty (AI hasn't generated audio yet), we inject silence frames to prevent RTP jitter/dropout.

Call State Management

class GatewayCall(pj.Call):
    """Handles SIP call lifecycle and connects media bridge."""
    
    def onCallState(self, prm: pj.OnCallStateParam) -> None:
        ci = self.getInfo()
        self._logger.info("sip.call_state", remote_uri=ci.remoteUri, state=ci.stateText)
        
        if ci.state == pj.PJSIP_INV_STATE_DISCONNECTED:
            self._cleanup()
            self._account.current_call = None
    
    def onCallMediaState(self, prm: pj.OnCallMediaStateParam) -> None:
        ci = self.getInfo()
        for mi in ci.media:
            if mi.type == pj.PJMEDIA_TYPE_AUDIO and mi.status == pj.PJSUA_CALL_MEDIA_ACTIVE:
                media = self.getMedia(mi.index)
                aud_media = pj.AudioMedia.typecastFromMedia(media)
                
                # Create bidirectional media bridge
                self._to_voicelive_port = CustomAudioMediaPort(self._bridge, "to_voicelive", self._logger)
                self._from_voicelive_port = CustomAudioMediaPort(self._bridge, "from_voicelive", self._logger)
                
                # Connect: Caller → to_voicelive_port → Voice Live
                aud_media.startTransmit(self._to_voicelive_port)
                # Connect: Voice Live → from_voicelive_port → Caller
                self._from_voicelive_port.startTransmit(aud_media)
                
                # Start AI conversation with greeting
                asyncio.run_coroutine_threadsafe(
                    self._voicelive_client.request_response(interrupt=False),
                    self._loop
                )

Account Registration

For production deployments with Asterisk:

class SipAgent:
    def _run_pjsua_thread(self, loop: asyncio.AbstractEventLoop) -> None:
        self._ep = pj.Endpoint()
        self._ep.libCreate()
        self._ep.libInit(ep_cfg)
        self._ep.transportCreate(pj.PJSIP_TRANSPORT_UDP, transport_cfg)
        self._ep.libStart()
        
        self._account = GatewayAccount(self._logger, self._bridge, self._voicelive_client, loop)
        
        if self._settings.sip.register_with_server and self._settings.sip.server:
            acc_cfg = pj.AccountConfig()
            acc_cfg.idUri = f"sip:{self._settings.sip.user}@{self._settings.sip.server}"
            acc_cfg.regConfig.registrarUri = f"sip:{self._settings.sip.server}"
            
            # Digest authentication credentials
            cred = pj.AuthCredInfo()
            cred.scheme = "digest"
            cred.realm = self._settings.sip.auth_realm or "*"
            cred.username = self._settings.sip.auth_user
            cred.data = self._settings.sip.auth_password
            cred.dataType = pj.PJSIP_CRED_DATA_PLAIN_PASSWD
            acc_cfg.sipConfig.authCreds.append(cred)
            
            self._account.create(acc_cfg)

Voice Live Integration

Azure Voice Live Overview

Azure Voice Live is a real-time, bidirectional conversational AI service that combines:

GPT-4o Realtime Preview: Ultra-low latency language model optimized for spoken conversations
Streaming Speech Recognition: Continuous transcription with word-level timestamps
Neural Text-to-Speech: Natural-sounding synthesis with emotional expressiveness
Server-side VAD: Voice Activity Detection for turn-taking without explicit prompts

Client Implementation: client.py

class VoiceLiveClient:
    """Manages lifecycle of an Azure Voice Live WebSocket connection."""
    
    async def connect(self) -> None:
        if self._settings.azure.api_key:
            credential = AzureKeyCredential(self._settings.azure.api_key)
        else:
            # Use AAD authentication (Managed Identity or Service Principal)
            self._aad_credential = DefaultAzureCredential()
            credential = await self._aad_credential.__aenter__()
        
        self._connection_cm = connect(
            endpoint=self._settings.azure.endpoint,
            credential=credential,
            model=self._settings.azure.model,
        )
        self._connection = await self._connection_cm.__aenter__()
        
        # Configure session parameters
        session = RequestSession(
            model="gpt-4o",
            modalities=[Modality.TEXT, Modality.AUDIO],
            instructions=self._settings.azure.instructions,
            input_audio_format=InputAudioFormat.PCM16,
            output_audio_format=OutputAudioFormat.PCM16,
            input_audio_transcription=AudioInputTranscriptionOptions(model="azure-speech"),
            turn_detection=ServerVad(
                threshold=0.5,
                prefix_padding_ms=200,
                silence_duration_ms=400
            ),
            voice=AzureStandardVoice(name=self._settings.azure.voice)
        )
        await self._connection.session.update(session=session)

Key Configuration Choices:

turn_detection=ServerVad(...): Azure detects when the user stops speaking and automatically triggers AI response generation. No need for wake words or explicit prompts.
prefix_padding_ms=200: Include 200ms of audio before speech detection for natural cutoff
silence_duration_ms=400: Wait 400ms of silence before considering turn complete

Streaming Response Audio: AI-generated speech arrives as RESPONSE_AUDIO_DELTA events containing base64-encoded PCM16 chunks. We decode and push these through the audio bridge immediately for low-latency playback.

Production Deployment: Audiocodes + Asterisk

Why This Topology?

Component	Role	Benefits
Audiocodes SBC	Session Border Controller	- NAT/firewall traversal - Security (DOS protection, encryption) - Protocol normalization - Topology hiding - Media anchoring (optional transcoding)
Asterisk PBX	SIP Server	- Call routing & IVR logic - User directory & authentication - Advanced call features (transfer, hold, conference) - CDR/analytics - Integration with enterprise phone systems
Voice Live Gateway	AI Conversation Endpoint	- Real-time AI conversations - Natural language understanding - Dynamic response generation - Multi-lingual support

Network Architecture

                    Internet
                       │
                       │ SIP/RTP
                       ▼
         ┌─────────────────────────┐
         │   Audiocodes SBC        │
         │   Public IP: X.X.X.X    │
         │   Ports: 5060, 10000+   │
         └────────────┬────────────┘
                      │ Private Network
         ┌────────────▼────────────┐
         │   Asterisk PBX          │
         │   Internal: 10.0.1.10   │
         │   Port: 5060            │
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │  Voice Live Gateway     │
         │  Internal: 10.0.1.20    │
         │  Port: 5060             │
         │  ┌───────────────────┐  │
         │  │ Outbound HTTPS    │  │
         │  │ to Azure          │  │
         │  └─────────┬─────────┘  │
         └────────────┼────────────┘
                      │ Internet (HTTPS/WSS)
                      ▼
         ┌────────────────────────┐
         │  Azure Voice Live API  │
         │  *.cognitiveservices   │
         └────────────────────────┘

Audiocodes SBC Configuration

1. SIP Trunk to Asterisk (IP Group)

IP Group Settings:
- Name: Asterisk-Trunk
- Type: Server
- SIP Group Name: asterisk.internal.example.com
- Media Realm: Private
- Proxy Set: Asterisk-ProxySet
- Classification: Classify by Proxy Set
- SBC Operation Mode: SBC-Only
- Topology Location: Internal Network

2. Proxy Set for Asterisk

Proxy Set Name: Asterisk-ProxySet
Proxy Address: 10.0.1.10:5060
Transport Type: UDP
Load Balancing Method: Parking Lot

3. IP-to-IP Routing Rule

Rule Name: PSTN-to-Gateway
Source IP Group: PSTN-Trunk
Destination IP Group: Asterisk-Trunk
Call Trigger: Any
Destination Prefix Manipulation: None
Message Manipulation: None

4. Media Settings

Media Realm: Private
IPv4 Interface: LAN1 (10.0.1.1)
Media Security: None (or SRTP if required)
Codec Preference Order: G711Ulaw, G711Alaw
Transcoding: Disabled (pass-through)
RTP Port Range: 10000-20000

Asterisk Configuration

/etc/asterisk/pjsip.conf

;=====================================
; Transport Configuration
;=====================================
[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
local_net=10.0.0.0/8

;=====================================
; Voice Live Gateway Endpoint
;=====================================
[voicelive-gateway]
type=endpoint
context=voicelive-routing
aors=voicelive-gateway
auth=voicelive-auth
disallow=all
allow=ulaw
allow=alaw
direct_media=no
force_rport=yes
rewrite_contact=yes
rtp_symmetric=yes
ice_support=no

[voicelive-gateway]
type=aor
contact=sip:10.0.1.20:5060
qualify_frequency=30

[voicelive-auth]
type=auth
auth_type=userpass
username=
password=
realm=sip.example.com

;=====================================
; SBC Trunk (for inbound calls)
;=====================================
[audiocodes-sbc]
type=endpoint
context=from-sbc
aors=audiocodes-sbc
disallow=all
allow=ulaw
allow=alaw

[audiocodes-sbc]
type=aor
contact=sip:10.0.1.1:5060

/etc/asterisk/extensions.conf

;=====================================
; Incoming calls from SBC
;=====================================
[from-sbc]
; Example: Route calls to 800-AI-VOICE to the gateway
exten => 8002486423,1,NoOp(Routing to Voice Live Gateway)
 same => n,Set(CALLERID(name)=AI Assistant)
 same => n,Dial(PJSIP/voicelive-bot@voicelive-gateway,30)
 same => n,Hangup()

; Default handler for unmatched numbers
exten => _X.,1,NoOp(Unrouted call: ${EXTEN})
 same => n,Playback(invalid)
 same => n,Hangup()

;=====================================
; Voice Live Gateway routing context
;=====================================
[voicelive-routing]
exten => _X.,1,NoOp(Call from gateway: ${CALLERID(num)})
 same => n,Hangup()

Gateway Configuration

Create /opt/voicelive-gateway/.env:

#=====================================
# SIP Configuration
#=====================================
SIP_SERVER=asterisk.internal.example.com
SIP_PORT=5060
SIP_USER=voicelive-bot@sip.example.com
AUTH_USER=voicelive-bot
AUTH_REALM=sip.example.com
AUTH_PASSWORD=
REGISTER_WITH_SIP_SERVER=true
DISPLAY_NAME=Voice Live Bot

#=====================================
# Network Configuration
#=====================================
SIP_LOCAL_ADDRESS=0.0.0.0
SIP_VIA_ADDR=10.0.1.20
MEDIA_ADDRESS=10.0.1.20
MEDIA_PORT=10000
MEDIA_PORT_COUNT=1000

#=====================================
# Azure Voice Live
#=====================================
AZURE_VOICELIVE_ENDPOINT=wss://xxxxxx.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=
VOICE_LIVE_MODEL=gpt-4o
VOICE_LIVE_VOICE=en-US-AvaNeural
VOICE_LIVE_INSTRUCTIONS=You are a helpful customer service assistant for Contoso Inc. Answer questions about account balances, order status, and general inquiries. Be friendly and concise.
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS=200
VOICE_LIVE_PROACTIVE_GREETING_ENABLED=true
VOICE_LIVE_PROACTIVE_GREETING=Thank you for calling Contoso customer service. How can I help you today?

#=====================================
# Logging
#=====================================
LOG_LEVEL=INFO
VOICE_LIVE_LOG_FILE=/var/log/voicelive-gateway/gateway.log

Systemd Service (Linux)

Create /etc/systemd/system/voicelive-gateway.service:

[Unit]
Description=Voice Live SIP Gateway
After=network.target

[Service]
Type=simple
User=voicelive
Group=voicelive
WorkingDirectory=/opt/voicelive-gateway
EnvironmentFile=/opt/voicelive-gateway/.env
ExecStart=/opt/voicelive-gateway/.venv/bin/python3 -m voicelive_sip_gateway.gateway.main
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/log/voicelive-gateway

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload sudo systemctl enable voicelive-gateway sudo systemctl start voicelive-gateway sudo systemctl status voicelive-gateway

Configuration Guide

Environment Variables Reference

Variable	Description	Example	Required
Azure Voice Live
AZURE_VOICELIVE_ENDPOINT	WebSocket endpoint URL	wss://myresource.cognitiveservices.azure.com/openai/realtime	✅
AZURE_VOICELIVE_API_KEY	API key (or use AAD)	abc123...	✅*
VOICE_LIVE_MODEL	Model identifier	gpt-4o	❌ (default: gpt-4o)
VOICE_LIVE_VOICE	TTS voice name	en-US-AvaNeural	❌
VOICE_LIVE_INSTRUCTIONS	System prompt	You are a helpful assistant	❌
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS	Max tokens per response	200	❌
VOICE_LIVE_PROACTIVE_GREETING_ENABLED	Enable greeting on connect	true	❌
VOICE_LIVE_PROACTIVE_GREETING	Greeting message	Hello! How can I help?	❌
SIP Configuration
SIP_SERVER	Asterisk/PBX hostname	asterisk.example.com	✅**
SIP_PORT	SIP port	5060	❌ (default: 5060)
SIP_USER	SIP user URI	bot@sip.example.com	✅**
AUTH_USER	Auth username	bot	✅**
AUTH_REALM	Auth realm	sip.example.com	✅**
AUTH_PASSWORD	Auth password	securepass	✅**
REGISTER_WITH_SIP_SERVER	Enable registration	true	❌ (default: false)
DISPLAY_NAME	Caller ID name	Voice Live Bot	❌
Network Settings
SIP_LOCAL_ADDRESS	Local bind address	0.0.0.0	❌ (default: 127.0.0.1)
SIP_VIA_ADDR	Via header IP	10.0.1.20	❌
MEDIA_ADDRESS	RTP bind address	10.0.1.20	❌
MEDIA_PORT	RTP port range start	10000	❌
MEDIA_PORT_COUNT	RTP port range count	1000	❌
Logging
LOG_LEVEL	Log verbosity	INFO	❌ (default: INFO)
VOICE_LIVE_LOG_FILE	Log file path	logs/gateway.log	❌

*Either AZURE_VOICELIVE_API_KEY or AAD environment variables (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET) are required.

**Required only when REGISTER_WITH_SIP_SERVER=true.

Local Testing Setup

For developers wanting to test without SBC/PBX infrastructure:

Prerequisites

Install pjproject with Python bindings:

git clone https://github.com/pjsip/pjproject.git cd pjproject ./configure CFLAGS="-O2 -DNDEBUG" && make dep && make cd pjsip-apps/src/swig make export PYTHONPATH=$PWD:$PYTHONPATH

Install gateway dependencies:

cd /path/to/azure-voicelive-sip-python python3 -m venv .venv source .venv/bin/activate pip install -e .[dev]

Install PortAudio (for optional speaker playback):

# macOS brew install portaudio # Ubuntu/Debian sudo apt-get install portaudio19-dev

Local Configuration

Create .env:

# No SIP server - direct connection
REGISTER_WITH_SIP_SERVER=false
SIP_LOCAL_ADDRESS=127.0.0.1
SIP_VIA_ADDR=127.0.0.1
MEDIA_ADDRESS=127.0.0.1

# Azure Voice Live
AZURE_VOICELIVE_ENDPOINT=wss://your-resource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=your-api-key

# Logging
LOG_LEVEL=DEBUG
VOICE_LIVE_LOG_FILE=logs/gateway.log

Run the Gateway

# Load environment variables set -a && source .env && set +a # Start gateway make run # Or manually: PYTHONPATH=src python3 -m voicelive_sip_gateway.gateway.main

Expected output:

2025-11-27 10:15:32 [info ] voicelive.connected endpoint=wss://... 2025-11-27 10:15:32 [info ] sip.transport_created port=5060 2025-11-27 10:15:32 [info ] sip.agent_started address=127.0.0.1 port=5060

Connect a SIP Softphone

Download a free SIP client:

Windows/macOS: MicroSIP, Zoiper
Linux: Linphone
iOS/Android: Linphone, Zoiper

Configure the softphone:

Account Settings: - Username: test - Domain: 127.0.0.1 - Port: 5060 - Transport: UDP - Registration: Disabled (direct connection) To call: sip:test@127.0.0.1:5060

Expected Call Flow:

Dial sip:test@127.0.0.1:5060
Gateway answers with 200 OK
Hear AI greeting: "Thank you for calling. How can I help you today?"
Speak your question
Hear AI-generated response

Latency Budget

Component	Typical Latency	Notes
Network (caller → gateway)	10-50ms	Depends on ISP, distance
SIP/RTP processing	<5ms	pjproject is highly optimized
Audio resampling	<2ms	scipy.resample_poly is efficient
WebSocket (gateway → Azure)	20-80ms	Depends on region, network
Voice Live processing	200-500ms	STT + LLM inference + TTS
Total round-trip	250-650ms	Perceived as near real-time

Optimization Tips:

Deploy gateway in same Azure region as Voice Live resource
Use expedited routing on Audiocodes SBC (disable unnecessary media anchoring)
Minimize SIP hops: Direct SBC → Gateway (skip Asterisk for simple scenarios)
Monitor queue depths: Log warnings if _inbound_queue or _outbound_queue exceed 10 items

Key Log Events

Event	Meaning	Action
sip.agent_started	Gateway listening	✅ Normal
sip.incoming_call	New call received	✅ Normal
sip.media_active	Audio bridge established	✅ Normal
voicelive.connected	WebSocket connected	✅ Normal
voicelive.audio_delta	AI audio chunk received	✅ Normal
voicelive.event_error	Voice Live API error	⚠️ Check API key, quota
media.enqueue_failed	Audio queue full	⚠️ CPU overload or slow network
sip.thread_error	pjsua crash	🔴 Restart gateway

Common Issues

1. No audio from caller

Symptoms: AI doesn't respond to speech

Diagnosis:

# Check RTP packets arriving sudo tcpdump -i any -n udp port 10000-11000

Solutions:

Verify MEDIA_ADDRESS matches gateway's reachable IP
Check firewall rules (allow UDP 10000-11000)
Ensure Asterisk direct_media=no and rtp_symmetric=yes

2. Choppy audio

Symptoms: Robotic voice, dropouts

Diagnosis: Check queue depths in logs

Solutions:

Increase CPU allocation
Reduce VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS to lower processing time
Check network jitter (target <30ms)

3. SIP registration fails

Symptoms: Asterisk shows Unreachable in pjsip show endpoints

Diagnosis:

# On Asterisk pjsip set logger on # Watch for REGISTER attempts

Solutions:

Verify AUTH_USER, AUTH_PASSWORD, AUTH_REALM match Asterisk pjsip.conf
Check network connectivity: ping asterisk.example.com
Ensure REGISTER_WITH_SIP_SERVER=true

Conclusion

We've built a production-grade bridge between the telephony world and Azure's AI-powered Voice Live service. Key achievements:

✅ Real-time audio transcoding with <5ms overhead
✅ Robust SIP stack using industry-standard pjproject
✅ Asynchronous architecture for high concurrency potential
✅ Enterprise deployment with Audiocodes SBC + Asterisk PBX
✅ Comprehensive observability via structured logging

Next Steps

Enhancements to Consider:

Multi-call support: Refactor to support multiple concurrent calls per instance
DTMF handling: Implement RFC 2833 for interactive voice response (IVR)
Call transfer: Add SIP REFER support for transferring to human agents
Recording: Save conversations to Azure Blob Storage for compliance
Metrics: Expose Prometheus metrics (call duration, error rates, latency percentiles)
Kubernetes deployment: Helm chart for auto-scaling gateway pods
TLS/SRTP: Encrypt SIP signaling and RTP media end-to-end

Resources

Azure Voice Live Documentation: https://learn.microsoft.com/azure/ai-services/speech-service/voice-live

Get the Code

The complete source code for this project is available on GitHub:

https://github.com/monuminu/azure-voicelive-sip-python

Thanks

Manoranjan Rajguru

AI Global Black Belt

https://www.linkedin.com/in/manoranjan-rajguru/

Updated Nov 27, 2025

Version 2.0

mrajguru

Microsoft

Joined October 13, 2023

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity

Blog Post

From Zero to Hero: Building a Production-Ready SIP Gateway for Azure Voice Live

Introduction

Architecture Overview

High-Level Design

Key Design Principles

Core Components

Project Structure

Call Flow

Audio Pipeline: From μ-law to PCM16

The Challenge

Audio Flow Diagram

Audio Stream Bridge: stream_bridge.py

SIP Signaling with pjsua2

Why pjsua2?

Custom Media Ports

Receiving Audio from Caller

Sending Audio to Caller

Call State Management

Account Registration

Voice Live Integration

Azure Voice Live Overview

Client Implementation: client.py

Production Deployment: Audiocodes + Asterisk

Why This Topology?

Network Architecture

Audiocodes SBC Configuration

1. SIP Trunk to Asterisk (IP Group)

2. Proxy Set for Asterisk

3. IP-to-IP Routing Rule

4. Media Settings

Asterisk Configuration

/etc/asterisk/pjsip.conf

/etc/asterisk/extensions.conf

Gateway Configuration

Systemd Service (Linux)

Configuration Guide

Environment Variables Reference

Local Testing Setup

Prerequisites

Local Configuration

Run the Gateway

Connect a SIP Softphone

Latency Budget

Key Log Events

Common Issues

1. No audio from caller

2. Choppy audio

3. SIP registration fails

Conclusion

Next Steps

Resources

Get the Code