Blog Post

Microsoft Foundry Blog
13 MIN READ

From Zero to Hero: Building a Production-Ready SIP Gateway for Azure Voice Live

mrajguru's avatar
mrajguru
Icon for Microsoft rankMicrosoft
Nov 27, 2025

Introduction

Voice technology is transforming how we interact with machines, making conversations with AI feel more natural than ever before. With the public beta release of the Voice Live API developers now have the tools to create low-latency, multimodal voice experiences in their apps, opening up endless possibilities for innovation.

Gone are the days when building a voice bot required stitching together multiple models for transcription, inference, and text-to-speech conversion. With the Realtime API, developers can now streamline the entire process with a single API call, enabling fluid, natural speech-to-speech conversations. This is a game-changer for industries like customer support, education, and real-time language translation, where fast, seamless interactions are crucial.

In this comprehensive technical blog, we'll explore the architecture, implementation, and production deployment of a Python-based SIP gateway that bridges traditional telephony infrastructure with Azure's cutting-edge Voice Live real-time conversation API. This gateway enables callers from any SIP endpoint—whether a desk phone, mobile softphone, or PSTN connection—to engage in natural, AI-powered voice conversations.

By the end of this article, you'll understand:

  • The architectural design of a production-grade SIP-to-WebSocket bridge
  • Audio transcoding and resampling strategies for seamless media conversion
  • Real-world deployment using both cloud based telephony or on premise like Avaya, Genesys
  • Step-by-step setup instructions for both local testing and enterprise production environments

Architecture Overview

High-Level Design

How Voice Live API Works

 Traditionally, building a voice assistant required chaining together several models: an automatic speech recognition (ASR) model like Whisper for transcribing audio, a text-based model for processing responses, and a text-to-speech (TTS) model for generating audio outputs. This multi-step process often led to delays and a loss of emotional nuance.
The Voice Live API revolutionizes this by consolidating these functionalities into a single API call. By establishing a persistent WebSocket connection, developers can stream audio inputs and outputs directly, significantly reducing latency and enhancing the naturalness of conversations. Additionally, the API's function calling capability allows the voice bot to perform actions such as placing orders or retrieving customer information on the fly.

The gateway acts as a bidirectional media proxy between the SIP/RTP world (telephony) and Azure Voice Live's WebSocket-based real-time API. It translates both signaling protocols and audio formats, enabling seamless integration between legacy VoIP infrastructure and modern AI-powered conversational agents.

Key Design Principles

  1. Asynchronous-First Architecture: Built on asyncio for non-blocking I/O, ensuring low latency and high concurrency potential
  2. Separation of Concerns: Modular design with distinct layers for SIP, media, and Voice Live integration
  3. Production-Ready Error Handling: Graceful degradation with silence insertion when audio buffers underrun
  4. Structured Logging: structlog provides machine-parseable, contextual logs for observability
  5. Type Safety: Leverages pydantic for settings validation and mypy for static type checking

Core Components

 

Key roles of the SBC:

  • Terminate carrier SIP and interwork toward Azure SIP/RTC endpoints.
  • Normalize SIP headers, strip unsupported options, and map codecs.
  • Provide security: TLS signaling, SRTP media, topology hiding, and ACLs.
  • Optional media anchoring for lawful intercept, recording, or QoS smoothing.

Project Structure

src/voicelive_sip_gateway/
├── config/
│   ├── __init__.py
│   └── settings.py          # Pydantic-based configuration
├── gateway/
│   └── main.py              # Application entry point & lifecycle
├── logging/
│   ├── __init__.py
│   └── setup.py             # Structlog configuration
├── media/
│   ├── __init__.py
│   ├── stream_bridge.py     # Bidirectional audio queue manager
│   └── transcode.py         # μ-law ↔ PCM16 + resampling
├── sip/
│   ├── __init__.py
│   ├── agent.py             # pjsua2 wrapper & call handling
│   ├── rtp.py               # RTP utilities
│   └── sdp.py               # SDP parsing/generation
└── voicelive/
    ├── __init__.py
    ├── client.py            # Azure Voice Live SDK wrapper
    └── events.py            # Event type mapping

 

Call Flow

  1. Customer calls your number
  2. Audiocodes SBC receives from PSTN, forwards SIP INVITE to Asterisk
  3. Asterisk routing logic sends INVITE to voicelive-bot@gateway.example.com
  4. Voice Live gateway registers with Asterisk, answers the call
  5. RTP audio flows: Caller ↔ SBC ↔ Asterisk ↔ Gateway ↔ Azure

Audio Pipeline: From μ-law to PCM16

The Challenge

Traditional telephony uses G.711 μ-law codec at 8 kHz sampling rate for bandwidth efficiency. Azure Voice Live expects PCM16 (16-bit linear PCM) at 24 kHz. Our gateway must perform both codec conversion and sample rate conversion in real-time with minimal latency.

Audio Flow Diagram

 

Caller (SIP)                    Gateway                     Azure Voice Live
─────────────                   ───────                     ────────────────
    │                              │                               │
    │ RTP: μ-law 8kHz              │                               │
    ├──────────────────────────────►                               │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │                         │ pjsua2  │ (decodes to PCM16 8kHz)  │
    │                         └────┬────┘                          │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (8kHz → 24kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                              ├───────────────────────────────►
    │                              │   WebSocket: PCM16 24kHz      │
    │                              │                               │
    │                              │◄──────────────────────────────┤
    │                              │   Response: PCM16 24kHz       │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (24kHz → 8kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │◄────────────────────────┤ pjsua2  │ (encodes to μ-law)       │
    │  RTP: μ-law 8kHz        └─────────┘                          │

 

I want to highlight some key components of the code for better understanding of the overall flow.

Audio Stream Bridge: stream_bridge.py

The AudioStreamBridge class orchestrates bidirectional audio flow using asyncio queues:

class AudioStreamBridge:
    """Bidirectional audio pump between SIP (μ-law) and Voice Live (PCM16 24kHz)."""
    
    VOICELIVE_SAMPLE_RATE = 24000
    SIP_SAMPLE_RATE = 8000
    
    def __init__(self, settings: Settings):
        self._inbound_queue: asyncio.Queue[bytes] = asyncio.Queue()   # SIP → Voice Live
        self._outbound_queue: asyncio.Queue[bytes] = asyncio.Queue()  # Voice Live → SIP

Inbound Path (Caller → AI):

async def _flush(self) -> None:
    """Process inbound audio: PCM16 8kHz from SIP → PCM16 24kHz to Voice Live."""
    while True:
        pcm16_8k = await self._inbound_queue.get()
        pcm16_24k = resample_pcm16(pcm16_8k, self.SIP_SAMPLE_RATE, self.VOICELIVE_SAMPLE_RATE)
        if self._voicelive_client:
            await self._voicelive_client.send_audio_chunk(pcm16_24k)

Outbound Path (AI → Caller):

async def emit_audio_to_sip(self, pcm_chunk: bytes) -> None:
    """Resample Voice Live audio down to 8 kHz PCM frames for SIP playback."""
    pcm_8k = resample_pcm16(pcm_chunk, self.VOICELIVE_SAMPLE_RATE, self.SIP_SAMPLE_RATE)
    
    # Split into 20ms frames (160 samples @ 8kHz = 320 bytes)
    frame_size_bytes = 320
    for offset in range(0, len(pcm_8k), frame_size_bytes):
        frame = pcm_8k[offset : offset + frame_size_bytes]
        if frame:
            await self._outbound_queue.put(frame)

Frame Timing: VoIP typically uses 20ms frames (ptime=20). At 8 kHz: 8000 samples/sec × 0.020 sec = 160 samples = 320 bytes (PCM16).

SIP Signaling with pjsua2

Why pjsua2?

pjproject is the gold standard for SIP/RTP implementation, powering commercial products like Asterisk. The pjsua2 API provides:

  • Full SIP stack (INVITE, ACK, BYE, REGISTER, etc.)
  • RTP/RTCP media handling
  • Built-in codec support (G.711, G.722, Opus, etc.)
  • NAT traversal (STUN/TURN/ICE)
  • Thread-safe C++ API with Python bindings

Custom Media Ports

To bridge pjsua2 with our asyncio-based audio queues, we implement custom AudioMediaPort subclasses:

Receiving Audio from Caller

def onFrameReceived(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it receives audio from caller (to Voice Live)."""
    if self._direction == "to_voicelive" and self._loop:
        if frame.type == pj.PJMEDIA_FRAME_TYPE_AUDIO and frame.buf:
            try:
                asyncio.run_coroutine_threadsafe(
                    self._bridge.enqueue_sip_audio(bytes(frame.buf)),
                    self._loop
                )
            except Exception as e:
                self._logger.warning("media.enqueue_failed", error=str(e))

Threading Considerations: pjsua2 runs its event loop in a dedicated thread. We use asyncio.run_coroutine_threadsafe() to safely enqueue data into the main asyncio event loop.

Sending Audio to Caller

def onFrameRequested(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it needs audio to send to caller (from Voice Live)."""
    if self._direction == "from_voicelive" and self._loop:
        try:
            future = asyncio.run_coroutine_threadsafe(
                self._bridge.dequeue_sip_audio_nonblocking(),
                self._loop
            )
            pcm_data = future.result(timeout=0.050)
            
            # Ensure exactly 320 bytes (160 samples @ 8kHz)
            if len(pcm_data) != 320:
                pcm_data = (pcm_data + b'\x00' * 320)[:320]
            
            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(pcm_data)
            frame.size = len(pcm_data)
        except Exception:
            # Return silence on timeout/error
            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(b'\x00' * 320)
            frame.size = 320

Graceful Degradation: If the outbound queue is empty (AI hasn't generated audio yet), we inject silence frames to prevent RTP jitter/dropout.

Call State Management

class GatewayCall(pj.Call):
    """Handles SIP call lifecycle and connects media bridge."""
    
    def onCallState(self, prm: pj.OnCallStateParam) -> None:
        ci = self.getInfo()
        self._logger.info("sip.call_state", remote_uri=ci.remoteUri, state=ci.stateText)
        
        if ci.state == pj.PJSIP_INV_STATE_DISCONNECTED:
            self._cleanup()
            self._account.current_call = None
    
    def onCallMediaState(self, prm: pj.OnCallMediaStateParam) -> None:
        ci = self.getInfo()
        for mi in ci.media:
            if mi.type == pj.PJMEDIA_TYPE_AUDIO and mi.status == pj.PJSUA_CALL_MEDIA_ACTIVE:
                media = self.getMedia(mi.index)
                aud_media = pj.AudioMedia.typecastFromMedia(media)
                
                # Create bidirectional media bridge
                self._to_voicelive_port = CustomAudioMediaPort(self._bridge, "to_voicelive", self._logger)
                self._from_voicelive_port = CustomAudioMediaPort(self._bridge, "from_voicelive", self._logger)
                
                # Connect: Caller → to_voicelive_port → Voice Live
                aud_media.startTransmit(self._to_voicelive_port)
                # Connect: Voice Live → from_voicelive_port → Caller
                self._from_voicelive_port.startTransmit(aud_media)
                
                # Start AI conversation with greeting
                asyncio.run_coroutine_threadsafe(
                    self._voicelive_client.request_response(interrupt=False),
                    self._loop
                )

 

Account Registration

For production deployments with Asterisk:

class SipAgent:
    def _run_pjsua_thread(self, loop: asyncio.AbstractEventLoop) -> None:
        self._ep = pj.Endpoint()
        self._ep.libCreate()
        self._ep.libInit(ep_cfg)
        self._ep.transportCreate(pj.PJSIP_TRANSPORT_UDP, transport_cfg)
        self._ep.libStart()
        
        self._account = GatewayAccount(self._logger, self._bridge, self._voicelive_client, loop)
        
        if self._settings.sip.register_with_server and self._settings.sip.server:
            acc_cfg = pj.AccountConfig()
            acc_cfg.idUri = f"sip:{self._settings.sip.user}@{self._settings.sip.server}"
            acc_cfg.regConfig.registrarUri = f"sip:{self._settings.sip.server}"
            
            # Digest authentication credentials
            cred = pj.AuthCredInfo()
            cred.scheme = "digest"
            cred.realm = self._settings.sip.auth_realm or "*"
            cred.username = self._settings.sip.auth_user
            cred.data = self._settings.sip.auth_password
            cred.dataType = pj.PJSIP_CRED_DATA_PLAIN_PASSWD
            acc_cfg.sipConfig.authCreds.append(cred)
            
            self._account.create(acc_cfg)

 

Voice Live Integration

Azure Voice Live Overview

Azure Voice Live is a real-time, bidirectional conversational AI service that combines:

  • GPT-4o Realtime Preview: Ultra-low latency language model optimized for spoken conversations
  • Streaming Speech Recognition: Continuous transcription with word-level timestamps
  • Neural Text-to-Speech: Natural-sounding synthesis with emotional expressiveness
  • Server-side VAD: Voice Activity Detection for turn-taking without explicit prompts

Client Implementation: client.py

class VoiceLiveClient:
    """Manages lifecycle of an Azure Voice Live WebSocket connection."""
    
    async def connect(self) -> None:
        if self._settings.azure.api_key:
            credential = AzureKeyCredential(self._settings.azure.api_key)
        else:
            # Use AAD authentication (Managed Identity or Service Principal)
            self._aad_credential = DefaultAzureCredential()
            credential = await self._aad_credential.__aenter__()
        
        self._connection_cm = connect(
            endpoint=self._settings.azure.endpoint,
            credential=credential,
            model=self._settings.azure.model,
        )
        self._connection = await self._connection_cm.__aenter__()
        
        # Configure session parameters
        session = RequestSession(
            model="gpt-4o",
            modalities=[Modality.TEXT, Modality.AUDIO],
            instructions=self._settings.azure.instructions,
            input_audio_format=InputAudioFormat.PCM16,
            output_audio_format=OutputAudioFormat.PCM16,
            input_audio_transcription=AudioInputTranscriptionOptions(model="azure-speech"),
            turn_detection=ServerVad(
                threshold=0.5,
                prefix_padding_ms=200,
                silence_duration_ms=400
            ),
            voice=AzureStandardVoice(name=self._settings.azure.voice)
        )
        await self._connection.session.update(session=session)

Key Configuration Choices:

  • turn_detection=ServerVad(...): Azure detects when the user stops speaking and automatically triggers AI response generation. No need for wake words or explicit prompts.
  • prefix_padding_ms=200: Include 200ms of audio before speech detection for natural cutoff
  • silence_duration_ms=400: Wait 400ms of silence before considering turn complete

Streaming Response Audio: AI-generated speech arrives as RESPONSE_AUDIO_DELTA events containing base64-encoded PCM16 chunks. We decode and push these through the audio bridge immediately for low-latency playback.

Production Deployment: Audiocodes + Asterisk

Why This Topology?

ComponentRoleBenefits
Audiocodes SBCSession Border Controller- NAT/firewall traversal
- Security (DOS protection, encryption)
- Protocol normalization
- Topology hiding
- Media anchoring (optional transcoding)
Asterisk PBXSIP Server- Call routing & IVR logic
- User directory & authentication
- Advanced call features (transfer, hold, conference)
- CDR/analytics
- Integration with enterprise phone systems
Voice Live GatewayAI Conversation Endpoint- Real-time AI conversations
- Natural language understanding
- Dynamic response generation
- Multi-lingual support

Network Architecture

                    Internet
                       │
                       │ SIP/RTP
                       ▼
         ┌─────────────────────────┐
         │   Audiocodes SBC        │
         │   Public IP: X.X.X.X    │
         │   Ports: 5060, 10000+   │
         └────────────┬────────────┘
                      │ Private Network
         ┌────────────▼────────────┐
         │   Asterisk PBX          │
         │   Internal: 10.0.1.10   │
         │   Port: 5060            │
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │  Voice Live Gateway     │
         │  Internal: 10.0.1.20    │
         │  Port: 5060             │
         │  ┌───────────────────┐  │
         │  │ Outbound HTTPS    │  │
         │  │ to Azure          │  │
         │  └─────────┬─────────┘  │
         └────────────┼────────────┘
                      │ Internet (HTTPS/WSS)
                      ▼
         ┌────────────────────────┐
         │  Azure Voice Live API  │
         │  *.cognitiveservices   │
         └────────────────────────┘

 

Audiocodes SBC Configuration

1. SIP Trunk to Asterisk (IP Group)

IP Group Settings:
- Name: Asterisk-Trunk
- Type: Server
- SIP Group Name: asterisk.internal.example.com
- Media Realm: Private
- Proxy Set: Asterisk-ProxySet
- Classification: Classify by Proxy Set
- SBC Operation Mode: SBC-Only
- Topology Location: Internal Network

 

2. Proxy Set for Asterisk

Proxy Set Name: Asterisk-ProxySet
Proxy Address: 10.0.1.10:5060
Transport Type: UDP
Load Balancing Method: Parking Lot

 

3. IP-to-IP Routing Rule

Rule Name: PSTN-to-Gateway
Source IP Group: PSTN-Trunk
Destination IP Group: Asterisk-Trunk
Call Trigger: Any
Destination Prefix Manipulation: None
Message Manipulation: None

 

4. Media Settings

Media Realm: Private
IPv4 Interface: LAN1 (10.0.1.1)
Media Security: None (or SRTP if required)
Codec Preference Order: G711Ulaw, G711Alaw
Transcoding: Disabled (pass-through)
RTP Port Range: 10000-20000

 

Asterisk Configuration

/etc/asterisk/pjsip.conf

;=====================================
; Transport Configuration
;=====================================
[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
local_net=10.0.0.0/8

;=====================================
; Voice Live Gateway Endpoint
;=====================================
[voicelive-gateway]
type=endpoint
context=voicelive-routing
aors=voicelive-gateway
auth=voicelive-auth
disallow=all
allow=ulaw
allow=alaw
direct_media=no
force_rport=yes
rewrite_contact=yes
rtp_symmetric=yes
ice_support=no

[voicelive-gateway]
type=aor
contact=sip:10.0.1.20:5060
qualify_frequency=30

[voicelive-auth]
type=auth
auth_type=userpass
username=
password=
realm=sip.example.com

;=====================================
; SBC Trunk (for inbound calls)
;=====================================
[audiocodes-sbc]
type=endpoint
context=from-sbc
aors=audiocodes-sbc
disallow=all
allow=ulaw
allow=alaw

[audiocodes-sbc]
type=aor
contact=sip:10.0.1.1:5060

 

/etc/asterisk/extensions.conf

;=====================================
; Incoming calls from SBC
;=====================================
[from-sbc]
; Example: Route calls to 800-AI-VOICE to the gateway
exten => 8002486423,1,NoOp(Routing to Voice Live Gateway)
 same => n,Set(CALLERID(name)=AI Assistant)
 same => n,Dial(PJSIP/voicelive-bot@voicelive-gateway,30)
 same => n,Hangup()

; Default handler for unmatched numbers
exten => _X.,1,NoOp(Unrouted call: ${EXTEN})
 same => n,Playback(invalid)
 same => n,Hangup()

;=====================================
; Voice Live Gateway routing context
;=====================================
[voicelive-routing]
exten => _X.,1,NoOp(Call from gateway: ${CALLERID(num)})
 same => n,Hangup()

 

Gateway Configuration

Create /opt/voicelive-gateway/.env:

#=====================================
# SIP Configuration
#=====================================
SIP_SERVER=asterisk.internal.example.com
SIP_PORT=5060
SIP_USER=voicelive-bot@sip.example.com
AUTH_USER=voicelive-bot
AUTH_REALM=sip.example.com
AUTH_PASSWORD=
REGISTER_WITH_SIP_SERVER=true
DISPLAY_NAME=Voice Live Bot

#=====================================
# Network Configuration
#=====================================
SIP_LOCAL_ADDRESS=0.0.0.0
SIP_VIA_ADDR=10.0.1.20
MEDIA_ADDRESS=10.0.1.20
MEDIA_PORT=10000
MEDIA_PORT_COUNT=1000

#=====================================
# Azure Voice Live
#=====================================
AZURE_VOICELIVE_ENDPOINT=wss://xxxxxx.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=
VOICE_LIVE_MODEL=gpt-4o
VOICE_LIVE_VOICE=en-US-AvaNeural
VOICE_LIVE_INSTRUCTIONS=You are a helpful customer service assistant for Contoso Inc. Answer questions about account balances, order status, and general inquiries. Be friendly and concise.
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS=200
VOICE_LIVE_PROACTIVE_GREETING_ENABLED=true
VOICE_LIVE_PROACTIVE_GREETING=Thank you for calling Contoso customer service. How can I help you today?

#=====================================
# Logging
#=====================================
LOG_LEVEL=INFO
VOICE_LIVE_LOG_FILE=/var/log/voicelive-gateway/gateway.log

 

Systemd Service (Linux)

Create /etc/systemd/system/voicelive-gateway.service:

[Unit]
Description=Voice Live SIP Gateway
After=network.target

[Service]
Type=simple
User=voicelive
Group=voicelive
WorkingDirectory=/opt/voicelive-gateway
EnvironmentFile=/opt/voicelive-gateway/.env
ExecStart=/opt/voicelive-gateway/.venv/bin/python3 -m voicelive_sip_gateway.gateway.main
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/log/voicelive-gateway

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload sudo systemctl enable voicelive-gateway sudo systemctl start voicelive-gateway sudo systemctl status voicelive-gateway

Configuration Guide

Environment Variables Reference

VariableDescriptionExampleRequired
Azure Voice Live   
AZURE_VOICELIVE_ENDPOINTWebSocket endpoint URLwss://myresource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEYAPI key (or use AAD)abc123...✅*
VOICE_LIVE_MODELModel identifiergpt-4o❌ (default: gpt-4o)
VOICE_LIVE_VOICETTS voice nameen-US-AvaNeural
VOICE_LIVE_INSTRUCTIONSSystem promptYou are a helpful assistant
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENSMax tokens per response200
VOICE_LIVE_PROACTIVE_GREETING_ENABLEDEnable greeting on connecttrue
VOICE_LIVE_PROACTIVE_GREETINGGreeting messageHello! How can I help?
SIP Configuration   
SIP_SERVERAsterisk/PBX hostnameasterisk.example.com✅**
SIP_PORTSIP port5060❌ (default: 5060)
SIP_USERSIP user URIbot@sip.example.com✅**
AUTH_USERAuth usernamebot✅**
AUTH_REALMAuth realmsip.example.com✅**
AUTH_PASSWORDAuth passwordsecurepass✅**
REGISTER_WITH_SIP_SERVEREnable registrationtrue❌ (default: false)
DISPLAY_NAMECaller ID nameVoice Live Bot
Network Settings   
SIP_LOCAL_ADDRESSLocal bind address0.0.0.0❌ (default: 127.0.0.1)
SIP_VIA_ADDRVia header IP10.0.1.20
MEDIA_ADDRESSRTP bind address10.0.1.20
MEDIA_PORTRTP port range start10000
MEDIA_PORT_COUNTRTP port range count1000
Logging   
LOG_LEVELLog verbosityINFO❌ (default: INFO)
VOICE_LIVE_LOG_FILELog file pathlogs/gateway.log

*Either AZURE_VOICELIVE_API_KEY or AAD environment variables (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET) are required.

**Required only when REGISTER_WITH_SIP_SERVER=true.

 

Local Testing Setup

For developers wanting to test without SBC/PBX infrastructure:

Prerequisites

  1. Install pjproject with Python bindings:

git clone https://github.com/pjsip/pjproject.git cd pjproject ./configure CFLAGS="-O2 -DNDEBUG" && make dep && make cd pjsip-apps/src/swig make export PYTHONPATH=$PWD:$PYTHONPATH

  1. Install gateway dependencies:

cd /path/to/azure-voicelive-sip-python python3 -m venv .venv source .venv/bin/activate pip install -e .[dev]

  1. Install PortAudio (for optional speaker playback):

# macOS brew install portaudio # Ubuntu/Debian sudo apt-get install portaudio19-dev

Local Configuration

Create .env:

# No SIP server - direct connection
REGISTER_WITH_SIP_SERVER=false
SIP_LOCAL_ADDRESS=127.0.0.1
SIP_VIA_ADDR=127.0.0.1
MEDIA_ADDRESS=127.0.0.1

# Azure Voice Live
AZURE_VOICELIVE_ENDPOINT=wss://your-resource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=your-api-key

# Logging
LOG_LEVEL=DEBUG
VOICE_LIVE_LOG_FILE=logs/gateway.log

 

Run the Gateway

# Load environment variables set -a && source .env && set +a # Start gateway make run # Or manually: PYTHONPATH=src python3 -m voicelive_sip_gateway.gateway.main

Expected output:

2025-11-27 10:15:32 [info ] voicelive.connected endpoint=wss://... 2025-11-27 10:15:32 [info ] sip.transport_created port=5060 2025-11-27 10:15:32 [info ] sip.agent_started address=127.0.0.1 port=5060

Connect a SIP Softphone

Download a free SIP client:

Configure the softphone:

Account Settings: - Username: test - Domain: 127.0.0.1 - Port: 5060 - Transport: UDP - Registration: Disabled (direct connection) To call: sip:test@127.0.0.1:5060

Expected Call Flow:

  1. Dial sip:test@127.0.0.1:5060
  2. Gateway answers with 200 OK
  3. Hear AI greeting: "Thank you for calling. How can I help you today?"
  4. Speak your question
  5. Hear AI-generated response

Latency Budget

ComponentTypical LatencyNotes
Network (caller → gateway)10-50msDepends on ISP, distance
SIP/RTP processing<5mspjproject is highly optimized
Audio resampling<2msscipy.resample_poly is efficient
WebSocket (gateway → Azure)20-80msDepends on region, network
Voice Live processing200-500msSTT + LLM inference + TTS
Total round-trip250-650msPerceived as near real-time

Optimization Tips:

  1. Deploy gateway in same Azure region as Voice Live resource
  2. Use expedited routing on Audiocodes SBC (disable unnecessary media anchoring)
  3. Minimize SIP hops: Direct SBC → Gateway (skip Asterisk for simple scenarios)
  4. Monitor queue depths: Log warnings if _inbound_queue or _outbound_queue exceed 10 items

Key Log Events

EventMeaningAction
sip.agent_startedGateway listening✅ Normal
sip.incoming_callNew call received✅ Normal
sip.media_activeAudio bridge established✅ Normal
voicelive.connectedWebSocket connected✅ Normal
voicelive.audio_deltaAI audio chunk received✅ Normal
voicelive.event_errorVoice Live API error⚠️ Check API key, quota
media.enqueue_failedAudio queue full⚠️ CPU overload or slow network
sip.thread_errorpjsua crash🔴 Restart gateway

Common Issues

1. No audio from caller

Symptoms: AI doesn't respond to speech

Diagnosis:

# Check RTP packets arriving sudo tcpdump -i any -n udp port 10000-11000

Solutions:

  • Verify MEDIA_ADDRESS matches gateway's reachable IP
  • Check firewall rules (allow UDP 10000-11000)
  • Ensure Asterisk direct_media=no and rtp_symmetric=yes

2. Choppy audio

Symptoms: Robotic voice, dropouts

Diagnosis: Check queue depths in logs

Solutions:

  • Increase CPU allocation
  • Reduce VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS to lower processing time
  • Check network jitter (target <30ms)

3. SIP registration fails

Symptoms: Asterisk shows Unreachable in pjsip show endpoints

Diagnosis:

# On Asterisk pjsip set logger on # Watch for REGISTER attempts

Solutions:

  • Verify AUTH_USER, AUTH_PASSWORD, AUTH_REALM match Asterisk pjsip.conf
  • Check network connectivity: ping asterisk.example.com
  • Ensure REGISTER_WITH_SIP_SERVER=true

Conclusion

We've built a production-grade bridge between the telephony world and Azure's AI-powered Voice Live service. Key achievements:

✅ Real-time audio transcoding with <5ms overhead
✅ Robust SIP stack using industry-standard pjproject
✅ Asynchronous architecture for high concurrency potential
✅ Enterprise deployment with Audiocodes SBC + Asterisk PBX
✅ Comprehensive observability via structured logging

Next Steps

Enhancements to Consider:

  1. Multi-call support: Refactor to support multiple concurrent calls per instance
  2. DTMF handling: Implement RFC 2833 for interactive voice response (IVR)
  3. Call transfer: Add SIP REFER support for transferring to human agents
  4. Recording: Save conversations to Azure Blob Storage for compliance
  5. Metrics: Expose Prometheus metrics (call duration, error rates, latency percentiles)
  6. Kubernetes deployment: Helm chart for auto-scaling gateway pods
  7. TLS/SRTP: Encrypt SIP signaling and RTP media end-to-end

Resources

 

Get the Code

The complete source code for this project is available on GitHub:

https://github.com/monuminu/azure-voicelive-sip-python

 

Thanks

Manoranjan Rajguru

AI Global Black Belt

https://www.linkedin.com/in/manoranjan-rajguru/

Updated Nov 27, 2025
Version 2.0
No CommentsBe the first to comment