Introduction
Voice technology is transforming how we interact with machines, making conversations with AI feel more natural than ever before. With the public beta release of the Voice Live API developers now have the tools to create low-latency, multimodal voice experiences in their apps, opening up endless possibilities for innovation.
Gone are the days when building a voice bot required stitching together multiple models for transcription, inference, and text-to-speech conversion. With the Realtime API, developers can now streamline the entire process with a single API call, enabling fluid, natural speech-to-speech conversations. This is a game-changer for industries like customer support, education, and real-time language translation, where fast, seamless interactions are crucial.
In this comprehensive technical blog, we'll explore the architecture, implementation, and production deployment of a Python-based SIP gateway that bridges traditional telephony infrastructure with Azure's cutting-edge Voice Live real-time conversation API. This gateway enables callers from any SIP endpoint—whether a desk phone, mobile softphone, or PSTN connection—to engage in natural, AI-powered voice conversations.
By the end of this article, you'll understand:
- The architectural design of a production-grade SIP-to-WebSocket bridge
- Audio transcoding and resampling strategies for seamless media conversion
- Real-world deployment using both cloud based telephony or on premise like Avaya, Genesys
- Step-by-step setup instructions for both local testing and enterprise production environments
Architecture Overview
High-Level Design
How Voice Live API Works
Traditionally, building a voice assistant required chaining together several models: an automatic speech recognition (ASR) model like Whisper for transcribing audio, a text-based model for processing responses, and a text-to-speech (TTS) model for generating audio outputs. This multi-step process often led to delays and a loss of emotional nuance.
The Voice Live API revolutionizes this by consolidating these functionalities into a single API call. By establishing a persistent WebSocket connection, developers can stream audio inputs and outputs directly, significantly reducing latency and enhancing the naturalness of conversations. Additionally, the API's function calling capability allows the voice bot to perform actions such as placing orders or retrieving customer information on the fly.
The gateway acts as a bidirectional media proxy between the SIP/RTP world (telephony) and Azure Voice Live's WebSocket-based real-time API. It translates both signaling protocols and audio formats, enabling seamless integration between legacy VoIP infrastructure and modern AI-powered conversational agents.
Key Design Principles
- Asynchronous-First Architecture: Built on asyncio for non-blocking I/O, ensuring low latency and high concurrency potential
- Separation of Concerns: Modular design with distinct layers for SIP, media, and Voice Live integration
- Production-Ready Error Handling: Graceful degradation with silence insertion when audio buffers underrun
- Structured Logging: structlog provides machine-parseable, contextual logs for observability
- Type Safety: Leverages pydantic for settings validation and mypy for static type checking
Core Components
Key roles of the SBC:
- Terminate carrier SIP and interwork toward Azure SIP/RTC endpoints.
- Normalize SIP headers, strip unsupported options, and map codecs.
- Provide security: TLS signaling, SRTP media, topology hiding, and ACLs.
- Optional media anchoring for lawful intercept, recording, or QoS smoothing.
Project Structure
src/voicelive_sip_gateway/
├── config/
│ ├── __init__.py
│ └── settings.py # Pydantic-based configuration
├── gateway/
│ └── main.py # Application entry point & lifecycle
├── logging/
│ ├── __init__.py
│ └── setup.py # Structlog configuration
├── media/
│ ├── __init__.py
│ ├── stream_bridge.py # Bidirectional audio queue manager
│ └── transcode.py # μ-law ↔ PCM16 + resampling
├── sip/
│ ├── __init__.py
│ ├── agent.py # pjsua2 wrapper & call handling
│ ├── rtp.py # RTP utilities
│ └── sdp.py # SDP parsing/generation
└── voicelive/
├── __init__.py
├── client.py # Azure Voice Live SDK wrapper
└── events.py # Event type mapping
Call Flow
- Customer calls your number
- Audiocodes SBC receives from PSTN, forwards SIP INVITE to Asterisk
- Asterisk routing logic sends INVITE to voicelive-bot@gateway.example.com
- Voice Live gateway registers with Asterisk, answers the call
- RTP audio flows: Caller ↔ SBC ↔ Asterisk ↔ Gateway ↔ Azure
Audio Pipeline: From μ-law to PCM16
The Challenge
Traditional telephony uses G.711 μ-law codec at 8 kHz sampling rate for bandwidth efficiency. Azure Voice Live expects PCM16 (16-bit linear PCM) at 24 kHz. Our gateway must perform both codec conversion and sample rate conversion in real-time with minimal latency.
Audio Flow Diagram
Caller (SIP) Gateway Azure Voice Live
───────────── ─────── ────────────────
│ │ │
│ RTP: μ-law 8kHz │ │
├──────────────────────────────► │
│ │ │
│ ┌────▼────┐ │
│ │ pjsua2 │ (decodes to PCM16 8kHz) │
│ └────┬────┘ │
│ │ │
│ ┌────▼────────┐ │
│ │ Resample │ (8kHz → 24kHz) │
│ │ PCM16 │ │
│ └────┬────────┘ │
│ │ │
│ ├───────────────────────────────►
│ │ WebSocket: PCM16 24kHz │
│ │ │
│ │◄──────────────────────────────┤
│ │ Response: PCM16 24kHz │
│ │ │
│ ┌────▼────────┐ │
│ │ Resample │ (24kHz → 8kHz) │
│ │ PCM16 │ │
│ └────┬────────┘ │
│ │ │
│ ┌────▼────┐ │
│◄────────────────────────┤ pjsua2 │ (encodes to μ-law) │
│ RTP: μ-law 8kHz └─────────┘ │
I want to highlight some key components of the code for better understanding of the overall flow.
Audio Stream Bridge: stream_bridge.py
The AudioStreamBridge class orchestrates bidirectional audio flow using asyncio queues:
class AudioStreamBridge:
"""Bidirectional audio pump between SIP (μ-law) and Voice Live (PCM16 24kHz)."""
VOICELIVE_SAMPLE_RATE = 24000
SIP_SAMPLE_RATE = 8000
def __init__(self, settings: Settings):
self._inbound_queue: asyncio.Queue[bytes] = asyncio.Queue() # SIP → Voice Live
self._outbound_queue: asyncio.Queue[bytes] = asyncio.Queue() # Voice Live → SIP
Inbound Path (Caller → AI):
async def _flush(self) -> None:
"""Process inbound audio: PCM16 8kHz from SIP → PCM16 24kHz to Voice Live."""
while True:
pcm16_8k = await self._inbound_queue.get()
pcm16_24k = resample_pcm16(pcm16_8k, self.SIP_SAMPLE_RATE, self.VOICELIVE_SAMPLE_RATE)
if self._voicelive_client:
await self._voicelive_client.send_audio_chunk(pcm16_24k)
Outbound Path (AI → Caller):
async def emit_audio_to_sip(self, pcm_chunk: bytes) -> None:
"""Resample Voice Live audio down to 8 kHz PCM frames for SIP playback."""
pcm_8k = resample_pcm16(pcm_chunk, self.VOICELIVE_SAMPLE_RATE, self.SIP_SAMPLE_RATE)
# Split into 20ms frames (160 samples @ 8kHz = 320 bytes)
frame_size_bytes = 320
for offset in range(0, len(pcm_8k), frame_size_bytes):
frame = pcm_8k[offset : offset + frame_size_bytes]
if frame:
await self._outbound_queue.put(frame)
Frame Timing: VoIP typically uses 20ms frames (ptime=20). At 8 kHz: 8000 samples/sec × 0.020 sec = 160 samples = 320 bytes (PCM16).
SIP Signaling with pjsua2
Why pjsua2?
pjproject is the gold standard for SIP/RTP implementation, powering commercial products like Asterisk. The pjsua2 API provides:
- Full SIP stack (INVITE, ACK, BYE, REGISTER, etc.)
- RTP/RTCP media handling
- Built-in codec support (G.711, G.722, Opus, etc.)
- NAT traversal (STUN/TURN/ICE)
- Thread-safe C++ API with Python bindings
Custom Media Ports
To bridge pjsua2 with our asyncio-based audio queues, we implement custom AudioMediaPort subclasses:
Receiving Audio from Caller
def onFrameReceived(self, frame: pj.MediaFrame) -> None:
"""Called by pjsua when it receives audio from caller (to Voice Live)."""
if self._direction == "to_voicelive" and self._loop:
if frame.type == pj.PJMEDIA_FRAME_TYPE_AUDIO and frame.buf:
try:
asyncio.run_coroutine_threadsafe(
self._bridge.enqueue_sip_audio(bytes(frame.buf)),
self._loop
)
except Exception as e:
self._logger.warning("media.enqueue_failed", error=str(e))
Threading Considerations: pjsua2 runs its event loop in a dedicated thread. We use asyncio.run_coroutine_threadsafe() to safely enqueue data into the main asyncio event loop.
Sending Audio to Caller
def onFrameRequested(self, frame: pj.MediaFrame) -> None:
"""Called by pjsua when it needs audio to send to caller (from Voice Live)."""
if self._direction == "from_voicelive" and self._loop:
try:
future = asyncio.run_coroutine_threadsafe(
self._bridge.dequeue_sip_audio_nonblocking(),
self._loop
)
pcm_data = future.result(timeout=0.050)
# Ensure exactly 320 bytes (160 samples @ 8kHz)
if len(pcm_data) != 320:
pcm_data = (pcm_data + b'\x00' * 320)[:320]
frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
frame.buf = pj.ByteVector(pcm_data)
frame.size = len(pcm_data)
except Exception:
# Return silence on timeout/error
frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
frame.buf = pj.ByteVector(b'\x00' * 320)
frame.size = 320
Graceful Degradation: If the outbound queue is empty (AI hasn't generated audio yet), we inject silence frames to prevent RTP jitter/dropout.
Call State Management
class GatewayCall(pj.Call):
"""Handles SIP call lifecycle and connects media bridge."""
def onCallState(self, prm: pj.OnCallStateParam) -> None:
ci = self.getInfo()
self._logger.info("sip.call_state", remote_uri=ci.remoteUri, state=ci.stateText)
if ci.state == pj.PJSIP_INV_STATE_DISCONNECTED:
self._cleanup()
self._account.current_call = None
def onCallMediaState(self, prm: pj.OnCallMediaStateParam) -> None:
ci = self.getInfo()
for mi in ci.media:
if mi.type == pj.PJMEDIA_TYPE_AUDIO and mi.status == pj.PJSUA_CALL_MEDIA_ACTIVE:
media = self.getMedia(mi.index)
aud_media = pj.AudioMedia.typecastFromMedia(media)
# Create bidirectional media bridge
self._to_voicelive_port = CustomAudioMediaPort(self._bridge, "to_voicelive", self._logger)
self._from_voicelive_port = CustomAudioMediaPort(self._bridge, "from_voicelive", self._logger)
# Connect: Caller → to_voicelive_port → Voice Live
aud_media.startTransmit(self._to_voicelive_port)
# Connect: Voice Live → from_voicelive_port → Caller
self._from_voicelive_port.startTransmit(aud_media)
# Start AI conversation with greeting
asyncio.run_coroutine_threadsafe(
self._voicelive_client.request_response(interrupt=False),
self._loop
)
Account Registration
For production deployments with Asterisk:
class SipAgent:
def _run_pjsua_thread(self, loop: asyncio.AbstractEventLoop) -> None:
self._ep = pj.Endpoint()
self._ep.libCreate()
self._ep.libInit(ep_cfg)
self._ep.transportCreate(pj.PJSIP_TRANSPORT_UDP, transport_cfg)
self._ep.libStart()
self._account = GatewayAccount(self._logger, self._bridge, self._voicelive_client, loop)
if self._settings.sip.register_with_server and self._settings.sip.server:
acc_cfg = pj.AccountConfig()
acc_cfg.idUri = f"sip:{self._settings.sip.user}@{self._settings.sip.server}"
acc_cfg.regConfig.registrarUri = f"sip:{self._settings.sip.server}"
# Digest authentication credentials
cred = pj.AuthCredInfo()
cred.scheme = "digest"
cred.realm = self._settings.sip.auth_realm or "*"
cred.username = self._settings.sip.auth_user
cred.data = self._settings.sip.auth_password
cred.dataType = pj.PJSIP_CRED_DATA_PLAIN_PASSWD
acc_cfg.sipConfig.authCreds.append(cred)
self._account.create(acc_cfg)
Voice Live Integration
Azure Voice Live Overview
Azure Voice Live is a real-time, bidirectional conversational AI service that combines:
- GPT-4o Realtime Preview: Ultra-low latency language model optimized for spoken conversations
- Streaming Speech Recognition: Continuous transcription with word-level timestamps
- Neural Text-to-Speech: Natural-sounding synthesis with emotional expressiveness
- Server-side VAD: Voice Activity Detection for turn-taking without explicit prompts
Client Implementation: client.py
class VoiceLiveClient:
"""Manages lifecycle of an Azure Voice Live WebSocket connection."""
async def connect(self) -> None:
if self._settings.azure.api_key:
credential = AzureKeyCredential(self._settings.azure.api_key)
else:
# Use AAD authentication (Managed Identity or Service Principal)
self._aad_credential = DefaultAzureCredential()
credential = await self._aad_credential.__aenter__()
self._connection_cm = connect(
endpoint=self._settings.azure.endpoint,
credential=credential,
model=self._settings.azure.model,
)
self._connection = await self._connection_cm.__aenter__()
# Configure session parameters
session = RequestSession(
model="gpt-4o",
modalities=[Modality.TEXT, Modality.AUDIO],
instructions=self._settings.azure.instructions,
input_audio_format=InputAudioFormat.PCM16,
output_audio_format=OutputAudioFormat.PCM16,
input_audio_transcription=AudioInputTranscriptionOptions(model="azure-speech"),
turn_detection=ServerVad(
threshold=0.5,
prefix_padding_ms=200,
silence_duration_ms=400
),
voice=AzureStandardVoice(name=self._settings.azure.voice)
)
await self._connection.session.update(session=session)
Key Configuration Choices:
- turn_detection=ServerVad(...): Azure detects when the user stops speaking and automatically triggers AI response generation. No need for wake words or explicit prompts.
- prefix_padding_ms=200: Include 200ms of audio before speech detection for natural cutoff
- silence_duration_ms=400: Wait 400ms of silence before considering turn complete
Streaming Response Audio: AI-generated speech arrives as RESPONSE_AUDIO_DELTA events containing base64-encoded PCM16 chunks. We decode and push these through the audio bridge immediately for low-latency playback.
Production Deployment: Audiocodes + Asterisk
Why This Topology?
| Component | Role | Benefits |
|---|---|---|
| Audiocodes SBC | Session Border Controller | - NAT/firewall traversal - Security (DOS protection, encryption) - Protocol normalization - Topology hiding - Media anchoring (optional transcoding) |
| Asterisk PBX | SIP Server | - Call routing & IVR logic - User directory & authentication - Advanced call features (transfer, hold, conference) - CDR/analytics - Integration with enterprise phone systems |
| Voice Live Gateway | AI Conversation Endpoint | - Real-time AI conversations - Natural language understanding - Dynamic response generation - Multi-lingual support |
Network Architecture
Internet
│
│ SIP/RTP
▼
┌─────────────────────────┐
│ Audiocodes SBC │
│ Public IP: X.X.X.X │
│ Ports: 5060, 10000+ │
└────────────┬────────────┘
│ Private Network
┌────────────▼────────────┐
│ Asterisk PBX │
│ Internal: 10.0.1.10 │
│ Port: 5060 │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Voice Live Gateway │
│ Internal: 10.0.1.20 │
│ Port: 5060 │
│ ┌───────────────────┐ │
│ │ Outbound HTTPS │ │
│ │ to Azure │ │
│ └─────────┬─────────┘ │
└────────────┼────────────┘
│ Internet (HTTPS/WSS)
▼
┌────────────────────────┐
│ Azure Voice Live API │
│ *.cognitiveservices │
└────────────────────────┘
Audiocodes SBC Configuration
1. SIP Trunk to Asterisk (IP Group)
IP Group Settings:
- Name: Asterisk-Trunk
- Type: Server
- SIP Group Name: asterisk.internal.example.com
- Media Realm: Private
- Proxy Set: Asterisk-ProxySet
- Classification: Classify by Proxy Set
- SBC Operation Mode: SBC-Only
- Topology Location: Internal Network
2. Proxy Set for Asterisk
Proxy Set Name: Asterisk-ProxySet
Proxy Address: 10.0.1.10:5060
Transport Type: UDP
Load Balancing Method: Parking Lot
3. IP-to-IP Routing Rule
Rule Name: PSTN-to-Gateway
Source IP Group: PSTN-Trunk
Destination IP Group: Asterisk-Trunk
Call Trigger: Any
Destination Prefix Manipulation: None
Message Manipulation: None
4. Media Settings
Media Realm: Private
IPv4 Interface: LAN1 (10.0.1.1)
Media Security: None (or SRTP if required)
Codec Preference Order: G711Ulaw, G711Alaw
Transcoding: Disabled (pass-through)
RTP Port Range: 10000-20000
Asterisk Configuration
/etc/asterisk/pjsip.conf
;=====================================
; Transport Configuration
;=====================================
[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
local_net=10.0.0.0/8
;=====================================
; Voice Live Gateway Endpoint
;=====================================
[voicelive-gateway]
type=endpoint
context=voicelive-routing
aors=voicelive-gateway
auth=voicelive-auth
disallow=all
allow=ulaw
allow=alaw
direct_media=no
force_rport=yes
rewrite_contact=yes
rtp_symmetric=yes
ice_support=no
[voicelive-gateway]
type=aor
contact=sip:10.0.1.20:5060
qualify_frequency=30
[voicelive-auth]
type=auth
auth_type=userpass
username=
password=
realm=sip.example.com
;=====================================
; SBC Trunk (for inbound calls)
;=====================================
[audiocodes-sbc]
type=endpoint
context=from-sbc
aors=audiocodes-sbc
disallow=all
allow=ulaw
allow=alaw
[audiocodes-sbc]
type=aor
contact=sip:10.0.1.1:5060
/etc/asterisk/extensions.conf
;=====================================
; Incoming calls from SBC
;=====================================
[from-sbc]
; Example: Route calls to 800-AI-VOICE to the gateway
exten => 8002486423,1,NoOp(Routing to Voice Live Gateway)
same => n,Set(CALLERID(name)=AI Assistant)
same => n,Dial(PJSIP/voicelive-bot@voicelive-gateway,30)
same => n,Hangup()
; Default handler for unmatched numbers
exten => _X.,1,NoOp(Unrouted call: ${EXTEN})
same => n,Playback(invalid)
same => n,Hangup()
;=====================================
; Voice Live Gateway routing context
;=====================================
[voicelive-routing]
exten => _X.,1,NoOp(Call from gateway: ${CALLERID(num)})
same => n,Hangup()
Gateway Configuration
Create /opt/voicelive-gateway/.env:
#=====================================
# SIP Configuration
#=====================================
SIP_SERVER=asterisk.internal.example.com
SIP_PORT=5060
SIP_USER=voicelive-bot@sip.example.com
AUTH_USER=voicelive-bot
AUTH_REALM=sip.example.com
AUTH_PASSWORD=
REGISTER_WITH_SIP_SERVER=true
DISPLAY_NAME=Voice Live Bot
#=====================================
# Network Configuration
#=====================================
SIP_LOCAL_ADDRESS=0.0.0.0
SIP_VIA_ADDR=10.0.1.20
MEDIA_ADDRESS=10.0.1.20
MEDIA_PORT=10000
MEDIA_PORT_COUNT=1000
#=====================================
# Azure Voice Live
#=====================================
AZURE_VOICELIVE_ENDPOINT=wss://xxxxxx.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=
VOICE_LIVE_MODEL=gpt-4o
VOICE_LIVE_VOICE=en-US-AvaNeural
VOICE_LIVE_INSTRUCTIONS=You are a helpful customer service assistant for Contoso Inc. Answer questions about account balances, order status, and general inquiries. Be friendly and concise.
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS=200
VOICE_LIVE_PROACTIVE_GREETING_ENABLED=true
VOICE_LIVE_PROACTIVE_GREETING=Thank you for calling Contoso customer service. How can I help you today?
#=====================================
# Logging
#=====================================
LOG_LEVEL=INFO
VOICE_LIVE_LOG_FILE=/var/log/voicelive-gateway/gateway.log
Systemd Service (Linux)
Create /etc/systemd/system/voicelive-gateway.service:
[Unit]
Description=Voice Live SIP Gateway
After=network.target
[Service]
Type=simple
User=voicelive
Group=voicelive
WorkingDirectory=/opt/voicelive-gateway
EnvironmentFile=/opt/voicelive-gateway/.env
ExecStart=/opt/voicelive-gateway/.venv/bin/python3 -m voicelive_sip_gateway.gateway.main
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/log/voicelive-gateway
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload sudo systemctl enable voicelive-gateway sudo systemctl start voicelive-gateway sudo systemctl status voicelive-gateway
Configuration Guide
Environment Variables Reference
| Variable | Description | Example | Required |
|---|---|---|---|
| Azure Voice Live | |||
| AZURE_VOICELIVE_ENDPOINT | WebSocket endpoint URL | wss://myresource.cognitiveservices.azure.com/openai/realtime | ✅ |
| AZURE_VOICELIVE_API_KEY | API key (or use AAD) | abc123... | ✅* |
| VOICE_LIVE_MODEL | Model identifier | gpt-4o | ❌ (default: gpt-4o) |
| VOICE_LIVE_VOICE | TTS voice name | en-US-AvaNeural | ❌ |
| VOICE_LIVE_INSTRUCTIONS | System prompt | You are a helpful assistant | ❌ |
| VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS | Max tokens per response | 200 | ❌ |
| VOICE_LIVE_PROACTIVE_GREETING_ENABLED | Enable greeting on connect | true | ❌ |
| VOICE_LIVE_PROACTIVE_GREETING | Greeting message | Hello! How can I help? | ❌ |
| SIP Configuration | |||
| SIP_SERVER | Asterisk/PBX hostname | asterisk.example.com | ✅** |
| SIP_PORT | SIP port | 5060 | ❌ (default: 5060) |
| SIP_USER | SIP user URI | bot@sip.example.com | ✅** |
| AUTH_USER | Auth username | bot | ✅** |
| AUTH_REALM | Auth realm | sip.example.com | ✅** |
| AUTH_PASSWORD | Auth password | securepass | ✅** |
| REGISTER_WITH_SIP_SERVER | Enable registration | true | ❌ (default: false) |
| DISPLAY_NAME | Caller ID name | Voice Live Bot | ❌ |
| Network Settings | |||
| SIP_LOCAL_ADDRESS | Local bind address | 0.0.0.0 | ❌ (default: 127.0.0.1) |
| SIP_VIA_ADDR | Via header IP | 10.0.1.20 | ❌ |
| MEDIA_ADDRESS | RTP bind address | 10.0.1.20 | ❌ |
| MEDIA_PORT | RTP port range start | 10000 | ❌ |
| MEDIA_PORT_COUNT | RTP port range count | 1000 | ❌ |
| Logging | |||
| LOG_LEVEL | Log verbosity | INFO | ❌ (default: INFO) |
| VOICE_LIVE_LOG_FILE | Log file path | logs/gateway.log | ❌ |
*Either AZURE_VOICELIVE_API_KEY or AAD environment variables (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET) are required.
**Required only when REGISTER_WITH_SIP_SERVER=true.
Local Testing Setup
For developers wanting to test without SBC/PBX infrastructure:
Prerequisites
- Install pjproject with Python bindings:
git clone https://github.com/pjsip/pjproject.git cd pjproject ./configure CFLAGS="-O2 -DNDEBUG" && make dep && make cd pjsip-apps/src/swig make export PYTHONPATH=$PWD:$PYTHONPATH
- Install gateway dependencies:
cd /path/to/azure-voicelive-sip-python python3 -m venv .venv source .venv/bin/activate pip install -e .[dev]
- Install PortAudio (for optional speaker playback):
# macOS brew install portaudio # Ubuntu/Debian sudo apt-get install portaudio19-dev
Local Configuration
Create .env:
# No SIP server - direct connection
REGISTER_WITH_SIP_SERVER=false
SIP_LOCAL_ADDRESS=127.0.0.1
SIP_VIA_ADDR=127.0.0.1
MEDIA_ADDRESS=127.0.0.1
# Azure Voice Live
AZURE_VOICELIVE_ENDPOINT=wss://your-resource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=your-api-key
# Logging
LOG_LEVEL=DEBUG
VOICE_LIVE_LOG_FILE=logs/gateway.log
Run the Gateway
# Load environment variables set -a && source .env && set +a # Start gateway make run # Or manually: PYTHONPATH=src python3 -m voicelive_sip_gateway.gateway.main
Expected output:
2025-11-27 10:15:32 [info ] voicelive.connected endpoint=wss://... 2025-11-27 10:15:32 [info ] sip.transport_created port=5060 2025-11-27 10:15:32 [info ] sip.agent_started address=127.0.0.1 port=5060
Connect a SIP Softphone
Download a free SIP client:
Configure the softphone:
Account Settings: - Username: test - Domain: 127.0.0.1 - Port: 5060 - Transport: UDP - Registration: Disabled (direct connection) To call: sip:test@127.0.0.1:5060
Expected Call Flow:
- Dial sip:test@127.0.0.1:5060
- Gateway answers with 200 OK
- Hear AI greeting: "Thank you for calling. How can I help you today?"
- Speak your question
- Hear AI-generated response
Latency Budget
| Component | Typical Latency | Notes |
|---|---|---|
| Network (caller → gateway) | 10-50ms | Depends on ISP, distance |
| SIP/RTP processing | <5ms | pjproject is highly optimized |
| Audio resampling | <2ms | scipy.resample_poly is efficient |
| WebSocket (gateway → Azure) | 20-80ms | Depends on region, network |
| Voice Live processing | 200-500ms | STT + LLM inference + TTS |
| Total round-trip | 250-650ms | Perceived as near real-time |
Optimization Tips:
- Deploy gateway in same Azure region as Voice Live resource
- Use expedited routing on Audiocodes SBC (disable unnecessary media anchoring)
- Minimize SIP hops: Direct SBC → Gateway (skip Asterisk for simple scenarios)
- Monitor queue depths: Log warnings if _inbound_queue or _outbound_queue exceed 10 items
Key Log Events
| Event | Meaning | Action |
|---|---|---|
| sip.agent_started | Gateway listening | ✅ Normal |
| sip.incoming_call | New call received | ✅ Normal |
| sip.media_active | Audio bridge established | ✅ Normal |
| voicelive.connected | WebSocket connected | ✅ Normal |
| voicelive.audio_delta | AI audio chunk received | ✅ Normal |
| voicelive.event_error | Voice Live API error | ⚠️ Check API key, quota |
| media.enqueue_failed | Audio queue full | ⚠️ CPU overload or slow network |
| sip.thread_error | pjsua crash | 🔴 Restart gateway |
Common Issues
1. No audio from caller
Symptoms: AI doesn't respond to speech
Diagnosis:
# Check RTP packets arriving sudo tcpdump -i any -n udp port 10000-11000
Solutions:
- Verify MEDIA_ADDRESS matches gateway's reachable IP
- Check firewall rules (allow UDP 10000-11000)
- Ensure Asterisk direct_media=no and rtp_symmetric=yes
2. Choppy audio
Symptoms: Robotic voice, dropouts
Diagnosis: Check queue depths in logs
Solutions:
- Increase CPU allocation
- Reduce VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS to lower processing time
- Check network jitter (target <30ms)
3. SIP registration fails
Symptoms: Asterisk shows Unreachable in pjsip show endpoints
Diagnosis:
# On Asterisk pjsip set logger on # Watch for REGISTER attempts
Solutions:
- Verify AUTH_USER, AUTH_PASSWORD, AUTH_REALM match Asterisk pjsip.conf
- Check network connectivity: ping asterisk.example.com
- Ensure REGISTER_WITH_SIP_SERVER=true
Conclusion
We've built a production-grade bridge between the telephony world and Azure's AI-powered Voice Live service. Key achievements:
✅ Real-time audio transcoding with <5ms overhead
✅ Robust SIP stack using industry-standard pjproject
✅ Asynchronous architecture for high concurrency potential
✅ Enterprise deployment with Audiocodes SBC + Asterisk PBX
✅ Comprehensive observability via structured logging
Next Steps
Enhancements to Consider:
- Multi-call support: Refactor to support multiple concurrent calls per instance
- DTMF handling: Implement RFC 2833 for interactive voice response (IVR)
- Call transfer: Add SIP REFER support for transferring to human agents
- Recording: Save conversations to Azure Blob Storage for compliance
- Metrics: Expose Prometheus metrics (call duration, error rates, latency percentiles)
- Kubernetes deployment: Helm chart for auto-scaling gateway pods
- TLS/SRTP: Encrypt SIP signaling and RTP media end-to-end
Resources
- Azure Voice Live Documentation: https://learn.microsoft.com/azure/ai-services/speech-service/voice-live
Get the Code
The complete source code for this project is available on GitHub:
https://github.com/monuminu/azure-voicelive-sip-python
Thanks
Manoranjan Rajguru
AI Global Black Belt