Voice technology is transforming how we interact with machines, making conversations with AI feel more natural than ever before. With the public beta release of the Realtime API powered by GPT-4o, developers now have the tools to create low-latency, multimodal voice experiences in their apps, opening endless possibilities for innovation.
For building voice AI solution introduction of GPT-4o-Realtime was a game changing technology which handles key features like interruption, language switching, emotion handling out of the box with low latency and optimized architecture.
GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array.
Strengths:
- Simplest architecture with no processing hops, making it easier to implement.
- Low latency and high reliability
- Suitable for straightforward use cases with complex conversational requirements.
- Switching between language is very easy
- Captures emotion of user.
Lets see a simple demo of the capability.
While there are many strengths there are some weaknesses as well. In this blog, we’ll guide you through the some of the best practices of using the GPT-4o Realtime Model to overcome these challenges.
Ok So let’s get started!
Customer can implement voice bot with a duplex architecture as well which is using STT >> LLM >> TTS based approach. But adding this document scope is only limited to GPT-4o-Realtime. However, you can learn more about in my blog.
My Journey of Building a Voice Bot from Scratch
7 Best Practices to address top challenges to build with GPT-4o-Realtime
1. Reducing Background Noise Sensitivity:
Interruption handling is a key feature of GPT-4o-Realtime model. GPT-4o-Realtime does this by using a feature called Voice Activity Detection (VAD) which detect voice activity and perform a callback event input_audio_buffer.speech_started sent by the server when in server_vad mode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user. However due to background noise sensitivity sometime we see lot of unintentional interruption. In order to handle this there are two approaches work.
A. Optimizing VAD Parameter
Prefix Padding (prefix_padding_ms): The amount of time (in milliseconds) added before detected speech to ensure that the beginning of the audio is captured.
- Increasing:
- Advantages: Captures more speech at the beginning of utterances, reducing the risk of clipping initial phonemes and improving overall speech quality.
- Disadvantages: May introduce unnecessary delays in processing, leading to a less responsive system and increased latency.
- Decreasing:
- Advantages: Reduces latency, making the system more responsive and quicker to react to speech.
- Disadvantages: Higher risk of clipping the start of speech, which can result in loss of important information and reduced speech quality.
- Threshold (threshold): The sensitivity level that determines whether the audio signal is classified as speech or silence, typically ranging from 0 to 1.
- Increasing:
- Advantages: Reduces false positives by requiring a stronger signal to classify as speech, which can improve accuracy in noisy environments.
- Disadvantages: May lead to missed detections (false negatives) if the speech signal is weak, resulting in lost segments of speech.
- Decreasing:
- Advantages: Increases sensitivity, allowing softer speech to be detected, which can be beneficial in quiet environments.
- Disadvantages: Higher likelihood of false positives, where background noise may be incorrectly classified as speech, leading to unnecessary processing.
- Silence Duration (silence_duration_ms): The minimum length of silence (in milliseconds) required to consider the audio as non-speech or to trigger a pause in detection.
o Increasing:
- Advantages: Helps to avoid brief pauses being classified as silence, maintaining continuity in detected speech segments.
- Disadvantages: Can lead to longer periods of silence being classified as active speech, potentially causing delays in response or processing.
o Decreasing:
- Advantages: Allows for quicker transitions between speech and silence detection, making the system more dynamic and responsive.
- Disadvantages: May result in frequent interruptions in detected speech during natural pauses, affecting the flow and comprehension of conversations.
B. Custom VAD workaround to handle background sensitivity
GPT-4o-Realtime provides flexibility in handling voice activity detection (VAD) through configurable settings. By default, server-side VAD is enabled, allowing the system to automatically detect the end of a user's speech and generate responses accordingly. However, you can customize this behavior by disabling server-side VAD and implementing your own client-side VAD or manual controls.
Disabling Server-Side VAD:
To turn off server-side VAD, you can set the turn_detection type to none in your session configuration. This configuration requires the client to manage the flow of the conversation manually. Specifically, the client must:
- Append Audio: end audio data to the server using theinput_audio_buffer.append event.
- Commit Audio: Indicate that the input is complete by sending theinput_audio_buffer.commit event.
- Request Response: Initiate the generation of a response by sending the response.create event. his approach is beneficial for applications that utilize push-to-talk functionality or have external mechanisms for controlling audio flow, such as a client-side VAD component. When server-side VAD is disabled, these manual controls can be employed to manage the conversation flow effectively.
async def append_input_audio(self, array_buffer):
if len(array_buffer) > 0:
if self.custom_vad:
for i in range(0, len(array_buffer), 1024):
chunk = array_buffer[i:i+1024]
chunk = np.frombuffer(chunk, dtype=np.int16)
vad_output = self.vad_iterator(torch.from_numpy(int2float(chunk)))
if vad_output is not None and vad_output == "INTERRUPT_TTS":
print("Speech Detected")
self.dispatch("conversation.interrupted", None)
continue
if vad_output is not None and len(vad_output) != 0:
print("vad output going to Realtime")
array = np.concatenate(vad_output)
await self.realtime.send("input_audio_buffer.append", {
"audio": array_buffer_to_base64(array),
})
self.input_audio_buffer.extend(array)
await self.create_response()
else:
await self.realtime.send("input_audio_buffer.append", {
"audio": array_buffer_to_base64(np.array(array_buffer)),
})
self.input_audio_buffer.extend(array_buffer)
return True
And here the code for VAD iterator
import copy
import torch
import numpy as np
class VADIterator:
def __init__(
self,
model,
threshold: float = 0.5,
sampling_rate: int = 16000,
min_silence_duration_ms: int = 100,
speech_pad_ms: int = 30,
):
"""
Mainly taken from https://github.com/snakers4/silero-vad
Class for stream imitation
Parameters
----------
model: preloaded .jit/.onnx silero VAD model
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
sampling_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 sample rates
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
"""
self.model = model
self.threshold = threshold
self.sampling_rate = sampling_rate
self.is_speaking = False
self.buffer = []
self.start_pad_buffer = []
if sampling_rate not in [8000, 16000]:
raise ValueError(
"VADIterator does not support sampling rates other than [8000, 16000]"
)
self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
self.reset_states()
def reset_states(self):
self.model.reset_states()
self.triggered = False
self.temp_end = 0
self.current_sample = 0
@torch.no_grad()
def __call__(self, x):
"""
x: torch.Tensor
audio chunk (see examples in repo)
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
"""
if not torch.is_tensor(x):
try:
x = torch.Tensor(x)
except Exception:
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
window_size_samples = len(x[0]) if x.dim() == 2 else len(x)
self.current_sample += window_size_samples
speech_prob = self.model(x, self.sampling_rate).item()
if (speech_prob >= self.threshold) and self.temp_end:
self.temp_end = 0
if (speech_prob >= self.threshold) and not self.triggered:
self.triggered = True
self.buffer = copy.deepcopy(self.start_pad_buffer)
self.buffer.append(x)
return "INTERRUPT_TTS"
if (speech_prob < self.threshold - 0.15) and self.triggered:
if not self.temp_end:
self.temp_end = self.current_sample
if self.current_sample - self.temp_end >= self.min_silence_samples:
# if self.current_sample - self.temp_end > self.speech_pad_samples:
# return None
# else:
# end of speak
self.temp_end = 0
self.triggered = False
spoken_utterance = self.buffer
self.buffer = []
return spoken_utterance
if self.triggered:
self.buffer.append(x)
self.start_pad_buffer.append(x)
self.start_pad_buffer = self.start_pad_buffer[-int(self.speech_pad_samples//window_size_samples):]
return None
def int2float(sound):
"""
Taken from https://github.com/snakers4/silero-vad
"""
sound = sound.astype("float32")
sound *= 1 / 32768
# sound = sound.squeeze() # depends on the use case
return sound
def float2int(sound):
"""
Taken from
"""
# sound = sound.squeeze() # depends on the use case
sound *= 32768
sound = np.clip(sound, -32768, 32767)
return sound.astype("int16")
2. Synchronization to reduce Hallucinations
Synchronization is also a key feature when building a Voice AI Bot as AI should exactly know how much user has listen and when it was interrupted. This reduces the major hallucination we see in GPT-4o-Realtime. In GPT-4o-Realtime, the conversation.item.truncate event allows clients to manually shorten or truncate a message within a conversation. Upon receiving a conversation.item.truncate event, the server processes the truncation and responds with a conversation.item.truncated event. This ensures that both the client and server maintain a synchronized state regarding the conversation's content.
Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.
3. Reducing WebSocket Connection Delay to address latency
A WebSocket connection pool acts as a crucial performance optimization when handling high-volume telephony applications by maintaining a pre-established set of WebSocket connections ready for immediate use. Instead of creating a new WebSocket connection with Azure OpenAI GPT-4o-Realtime for each incoming call—which can lead to timeouts during high load due to connection initialization overhead—the pool contains multiple pre-warmed connections. When a user initiates a phone call, the server can immediately allocate an available connection from the pool, eliminating the latency and potential timeout issues associated with establishing a new WebSocket connection. The pool automatically manages these connections, replenishing them as they're used and maintaining a healthy number of available connections based on traffic patterns. This ensures that voice data can flow instantly between the client and the WebSocket server without users experiencing delays or dropped calls due to connection timeouts. Additionally, the pool can implement features like connection health checks and automatic reconnection strategies, further improving the reliability of the voice communication system.
import asyncio
import websockets
import logging
from typing import List, Optional
from collections import deque
import time
from aiohttp import ClientSession, WSMsgType, WSServerHandshakeError, ClientTimeout
import os
class WebSocketPool:
def __init__(self, pool_size: int = 5, max_retries: int = 3):
self.pool_size = pool_size
self.max_retries = max_retries
self.available_connections: deque = deque()
self.in_use_connections = set()
self.lock = asyncio.Lock()
self.default_url = 'wss://api.openai.com'
self.url = os.environ["AZURE_OPENAI_ENDPOINT"]
self._is_azure_openai = self.url is not None
self.api_key = os.environ.get("AZURE_OPENAI_API_KEY")
self.api_version = "2024-10-01-preview"
self.azure_deployment = os.environ["AZURE_OPENAI_DEPLOYMENT"]
self.logger = logging.getLogger(__name__)
async def initialize_pool(self):
"""Initialize the connection pool with the specified number of connections."""
self.logger.info(f"Initializing pool with {self.pool_size} connections")
tasks = [self._create_connection() for _ in range(self.pool_size)]
await asyncio.gather(*tasks)
async def _create_connection(self) -> Optional[websockets.WebSocketClientProtocol]:
"""Create a single WebSocket connection with retry logic."""
for attempt in range(self.max_retries):
try:
self._session = ClientSession(base_url=self.url)
headers = {"api-key": self.api_key}
connection = await self._session.ws_connect(
"/openai/realtime",
headers=headers,
params={"api-version": self.api_version, "deployment": self.azure_deployment},
)
self.logger.info("Successfully created new WebSocket connection")
async with self.lock:
self.available_connections.append(connection)
return connection
except Exception as e:
self.logger.error(f"Failed to create connection (attempt {attempt + 1}): {str(e)}")
if attempt == self.max_retries - 1:
self.logger.error("Max retries reached for connection creation")
return None
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
async def get_connection(self) -> Optional[websockets.WebSocketClientProtocol]:
"""Get an available connection from the pool."""
async with self.lock:
while len(self.available_connections) == 0:
# If no connections are available, create a new one
if len(self.in_use_connections) < self.pool_size * 2: # Allow pool to grow up to 2x
connection = await self._create_connection()
if connection:
break
await asyncio.sleep(0.1) # Prevent tight loop
if not self.available_connections:
return None
connection = self.available_connections.popleft()
self.in_use_connections.add(connection)
return connection
async def release_connection(self, connection: websockets.WebSocketClientProtocol):
"""Return a connection to the pool."""
async with self.lock:
if connection in self.in_use_connections:
self.in_use_connections.remove(connection)
if connection.open:
self.available_connections.append(connection)
else:
# Replace closed connection with a new one
await self._create_connection()
async def health_check(self):
"""Periodically check and replace unhealthy connections."""
while True:
async with self.lock:
connections_to_check = list(self.available_connections)
for conn in connections_to_check:
try:
pong = await conn.ping()
await asyncio.wait_for(pong, timeout=5)
except:
self.logger.warning("Unhealthy connection detected, replacing...")
self.available_connections.remove(conn)
await conn.close()
await self._create_connection()
await asyncio.sleep(30) # Run health check every 30 seconds
async def close_all(self):
"""Close all connections in the pool."""
async with self.lock:
all_connections = list(self.available_connections) + list(self.in_use_connections)
for connection in all_connections:
await connection.close()
self.available_connections.clear()
self.in_use_connections.clear()
# Example usage
async def handle_voice_call(pool: WebSocketPool, call_id: str):
connection = await pool.get_connection()
if not connection:
raise Exception("Failed to get WebSocket connection from pool")
try:
# Handle voice data
await connection.send(f"Starting call {call_id}")
response = await connection.recv()
# Process voice data...
finally:
await pool.release_connection(connection)
async def main():
# Initialize the pool
pool = WebSocketPool(pool_size=5)
await pool.initialize_pool()
# Start health check in background
health_check_task = asyncio.create_task(pool.health_check())
# Simulate multiple concurrent calls
calls = [handle_voice_call(pool, f"call_{i}") for i in range(10)]
await asyncio.gather(*calls)
# Cleanup
health_check_task.cancel()
await pool.close_all()
if __name__ == "__main__":
asyncio.run(main())
4. Creating ‘Human-like’ voice | Realistic Voice:
GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. But if you want to customize the speech synthesis it then there is no finetune options present to customize the same. Hence, we came up with an option where we plugged in GPT-4o-Realtime with Azure TTS where we take the advanced voice modulation like built-in Neural voices with range of Indic languages also you can also finetune a custom neural voice (CNV).
Custom neural voice (CNV) is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data.
Out of the box, text to speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work well in most text to speech scenarios if a unique voice isn't required. Custom neural voice is based on the neural text to speech technology and the multilingual, multi-speaker, universal model. You can create synthetic voices that are rich in speaking styles, or adaptable cross languages. The realistic and natural sounding voice of custom neural voice can represent brands, personify machines, and allow users to interact with applications conversationally. See the supported languages for custom neural voice.
# Process the streaming response
print("\nStreaming response:")
collected_messages = []
tts_sentence_end = [ ".", "!", "?", ";", "。", "!", "?", ";", "\n", "।"]
async for event in connection:
delta = event.get("delta")
if event.type == 'response.text.delta':
chunk_message = delta['transcript']
collected_messages.append(chunk_message) # save the message
if chunk_message in tts_sentence_end: # sentence end found
sent_transcript = ''.join(collected_messages).strip()
collected_messages.clear()
input, output = tts_client.text_to_speech_streaming_input()
async def read_output():
audio = b''
async for chunk in output:
playAudio(chunk)
async def put_input():
input.write(sent_transcript)
input.close()
await asyncio.gather(read_output(), put_input())
elif event.type == 'response.text.done':
print()
You can look at this code Repo: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/realtime-api-plus/README.md
5. Handling Number Pronunciation Issue in regional language:
GPT-4o-Realtime always struggle with numbers specifically for non-english languages. We have seen cases where the model struggle with numbers while pronouncing the same. For financial services industry this case been a big issue where there is a mismatch between audio spoken by the model vs what is coming out in response.text.delta event. In order to solve this the trick is add to the prompt or use a TTS plugin as described in the above.
For example instead of writing a prompt like this.
“
You are loan seller agent for XYZ company. Below are the context provided to you.
Customer Name: Raj
Approved Loan Amount: 4500
”
Recommended Prompt:
“
You are loan seller agent for XYZ company. Below are the context provided to you.
Customer Name: Raj
Approved Loan Amount: चार हज़ार पाँच सौ
”
Here is a sample utility function to convert number to words. This for Hindi but you can write based on your own language.
class Num2wordshindi:
low_num_dict = {'1':'एक','2':'दौ','3':'तीन','4':'चार','5':'पाँच',
'6':'छः','7':'सात','8':'आठ','9':'नौ', '0':'शून्य'
}
mid_num_dict = {'10':'दस', '11': 'ग्यारह', '12': 'बारह', '13': 'तेरह', '14':'चौदह', '15': 'पंद्रह', '16': 'सोलह', '17': 'सत्रह', '18': 'अठारह', '19': 'उन्नीस',
'20': 'बीस', '21': 'इक्कीस', '22': 'बाईस', '23': 'तेईस', '24': 'चौबीस', '25': 'पच्चीस', '26': 'छब्बीस', '27': 'सत्ताईस', '28': 'अट्ठाईस', '29': 'उनतीस',
'30': 'तीस', '31': 'इकतीस', '32': 'बत्तीस', '33': 'तैंतीस', '34': 'चौंतीस', '35': 'पैंतीस', '36': 'छतीस', '37': 'सैंतीस', '38': 'अड़तीस', '39': 'उनतालीस',
'40': 'चालीस', '41': 'इकतालीस', '42': 'बयालीस', '43': 'तैंतालीस', '44': 'चवालीस', '45': 'पैंतालीस', '46': 'छियालीस', '47': 'सैंतालीस', '48': 'अड़तालीस', '49': 'उड़ंचास',
'50': 'पचास', '51': 'इक्यावन', '52': 'बावन', '53': 'तिरेपन', '54': 'चौवन', '55': 'पचपन', '56': 'छप्पन', '57': 'सत्तावन', '58': 'अट्ठावन', '59': 'उनसठ',
'60': 'साठ', '61': 'इकसठ', '62': 'बासठ', '63': 'तिरेसठ', '64': 'चौसठ', '65': 'पैंसठ', '66': 'छियासठ', '67': 'सड़सठ', '68': 'अड़सठ', '69': 'उनहत्तर',
'70': 'सत्तर', '71': 'इकहत्तर', '72': 'बहत्तर', '73': 'तिहत्तर', '74': 'चौहत्तर', '75': 'पिचहत्तर', '76': 'छिहत्तर', '77': 'सतत्तर', '78': 'अठहत्तर', '79': 'उनासी',
'80': 'अस्सी', '81': 'इक्यासी', '82': 'बियासी', '83': 'तिरासी', '84': 'चौरासी', '85': 'पिचासी', '86': 'छियासी', '87': 'सत्तासी', '88': 'अट्ठासी', '89': 'नवासी',
'90': 'नब्बे', '91': 'इक्यानवे', '92': 'बानवे', '93': 'तिरानवे', '94': 'चौरानवे', '95': 'पिचानवे', '96': 'छियानवे', '97': 'सत्तानवे', '98': 'अट्ठानवे', '99': 'निन्यानवे',
'100': 'सौ' , '00': ' '
}
def __init__(self, number):
self.nummber_to_change = number
def change_to_lst(self):
my_lst = str(self.nummber_to_change).split('.')
return my_lst
def lst1str(self, lst):
if lst == '0':
return ''
else:
return self.low_num_dict.get(lst)
def lst2str(self, lst):
if lst == '00':
return ''
elif lst[0] == '0':
return self.low_num_dict.get(lst[1])
else:
return self.mid_num_dict.get(lst)
def lst3str(self, lst):
if lst == '000':
return ''
elif lst[0] == '0':
return self.lst2str(lst[1:])
else:
return f'{self.lst1str(lst[0])} सौ {self.lst2str(lst[1:])}'
def lst4str(self, lst):
if lst == '0000':
return ''
elif lst[0] == '0':
return self.lst3str(lst[1:])
else:
return f'{self.lst1str(lst[0])} हजार {self.lst3str(lst[1:])}'
def lst5str(self, lst):
if lst == '00000':
return ''
elif lst[0] == '0':
return self.lst4str(lst[1:])
else:
return f'{self.lst2str(lst[0])} हजार {self.lst3str(lst[1:])}'
def lst_to_str(self, lst):
length = len(lst)
name_list = ['हज़ार', 'हज़ार','लाख','लाख', 'करोड़', 'करोड़', 'अरब', 'अरब', 'खरब', 'खरब', \
'नील', 'नील', 'पद्म', 'पद्म', 'शंख', 'शंख', 'महाशंख', 'महाशंख', 'महाउपाध', 'महाउपाध', 'जलद',
'जलद', 'माध', 'माध', 'परार्ध', 'परार्ध', 'अंत', 'अंत', 'महा अंत', 'महा अंत', 'शिष्ट', 'शिष्ट', 'सिंघर', 'सिंघर',
'महा सिंघर', 'महा सिंघर', 'अदंत सिंघर', 'अदंत सिंघर']
if lst == '0':
return self.low_num_dict.get(lst)
elif length == 1:
return self.lst1str(lst)
elif length == 2:
return self.lst2str(lst)
elif length == 3:
return self.lst3str(lst)
elif 42 > length > 3:# 35 23 548
n = length - 3
lst2 = lst[:n]
return_str = ''
while length > 3:
if length%2 == 0:
if lst2[0] == '0':
length -= 1
lst2 = lst2[1:]
else:
return_str = return_str + f'{self.lst1str(lst2[0])} {name_list[length-3]} '
length -= 1
lst2 = lst2[1:]
else:
if lst2[0:2] == '00':
length -= 2
lst2 = lst2[2:]
else:
return_str = return_str + f'{self.lst2str(lst2[0:2])} {name_list[length-4]} '
length -= 2
lst2 = lst2[2:]
return_str = return_str + self.lst3str(lst[n:])
return return_str
else:
return 'Number Too Long must be <= pow(10, 41)'
def to_currency(self):
length = len(self.change_to_lst())
lst1 = self.change_to_lst()[0]
if length == 1:
if lst1 == '1' :
return 'एक रुपया'
else:
return f'{self.lst_to_str(lst1)} रूपये'
elif length == 2:
lst2 = self.change_to_lst()[1]
if lst1 == '1' and lst2 == '01':
return 'एक रुपया, एक पैसा'
elif lst2 == '01':
return f'{self.lst_to_str(lst1)} रूपये, एक पैसा'
elif lst2 == '00':
return f'{self.lst_to_str(lst1)} रूपये, शून्य पैसे'
elif len(lst2) == 1:
lst2 = lst2+'0'
return f'{self.lst_to_str(lst1)} रूपये, {self.lst_to_str(lst2)} पैसे'
else:
return f'{self.lst_to_str(lst1)} रूपये, {self.lst_to_str(lst2)} पैसे'
def to_words(self):
length = len(self.change_to_lst())
lst1 = self.change_to_lst()[0]
if length == 1:
return self.lst_to_str(lst1)
elif length == 2:
lst2 = self.change_to_lst()[1]
return f'{self.lst_to_str(lst1)} दशमलव {self.lst_to_str(lst2)}'
elif length == 3:
lst2 = self.change_to_lst()[1]
lst3 = self.change_to_lst()[2]
return f'{self.lst_to_str(lst1)} दशमलव {self.lst_to_str(lst2)} दशमलव {self.lst_to_str(lst3)}'
else:
return None
6. Reduce middleware between model and telephony to optimize performance
GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions, making it suitable for applications like customer support agents, voice assistants, and real-time translators. To ensure compatibility and optimal performance, the API supports specific audio formats and sample rates.
Supported Audio Formats and Sample Rates:
- PCM 16-bit: This is a raw audio format that provides uncompressed audio data, ensuring high-quality sound.
- G.711 a-law: A commonly used audio compression format in telephony systems, which balances quality and bandwidth efficiency.
- G.711 u-law: A commonly used audio compression format in telephony systems, which balances quality and bandwidth efficiency.
Unlike other Realtime models, GPT-4o-Realtime supports an 8k sample rate. It is important not to place a middleware between SIP Telephony and GPT-4o-Realtime. Directly send the audio in G.711 a-law / G.711 u-law audio format.
7. Instruction Following Issue/ Prompt Best Practices:
One of the primary challenges users face when working with GPT-4o-Realtime compared to previous OpenAI models, such as GPT-4 and GPT-4o-mini, is the distinct way prompts need to be structured. This difference in prompt engineering stems from several factors related to the model's architecture, capabilities, and intended use cases. GPT-4o-Realtime has been noted to struggle with following instructions not as effectively as its predecessors. The way prompts are crafted for GPT-4o-Realtime requires a higher degree of specificity. While previous models could sometimes infer intent from vague or broadly framed prompts, GPT-4o-Realtime tends to produce more relevant outputs when given clear, concise instructions.
Here is the prompt that works well for GPT-4o-Realtime.
You must set 5 key module within the Prompt
- Personality and Tone
- Context
- Reference Pronunciations
- Overall Instruction
- Conversation States
Context
In this we give the model all background context like customer Name, Business working hours , Company Name etc.
# Context
- Business name: Snowy Peak Boards
- Hours: Monday to Friday, 8:00 AM - 6:00 PM; Saturday, 9:00 AM - 1:00 PM; Closed on Sundays
- Locations (for returns and service centers):
- 123 Alpine Avenue, Queenstown 9300, New Zealand
- 456 Glacier Road, Wanaka 9305, New Zealand
- Products & Services:
- Wide variety of snowboards for all skill levels
- Snowboard accessories and gear (boots, bindings, helmets, goggles)
- Online fitting consultations
- Loyalty program offering discounts and early access to new product lines
Personality and Tone
# Personality and Tone
## Identity
You are a knowledgeable and patient tech support specialist with a background in computer engineering and a passion for helping people solve complex technological challenges. Your experience spans over a decade of working with various tech ecosystems, from consumer electronics to enterprise solutions. You've seen technologies evolve and have a deep understanding of both hardware and software intricacies.
## Task
Your primary goal is to guide customers through technical issues, providing clear, step-by-step solutions while ensuring they feel supported and understood. You aim to demystify technology, making complex problems seem manageable and less intimidating.
## Demeanor
You maintain a calm, methodical approach to problem-solving. Your demeanor is professional yet approachable, similar to a trusted mentor who can break down complex technical concepts into digestible information. You're genuinely invested in helping customers succeed, not just in solving their immediate problem.
## Tone
Your voice is steady and reassuring, with a hint of technical precision. You speak with confidence but never condescension. When explaining technical concepts, you use analogies that make sense to people without a technical background, helping them understand without feeling overwhelmed.
## Level of Enthusiasm
Your enthusiasm is intellectual and measured. You get excited about solving problems and discovering innovative solutions, but your excitement manifests as a calm, focused energy rather than high-pitched excitement. Think of a detective who's genuinely thrilled about cracking a complex case.
## Level of Formality
Your communication style is professionally conversational. You use technical terminology when necessary but always explain it in layman's terms. It's like having a conversation with a highly skilled colleague who happens to be great at explaining things.
## Level of Emotion
You are empathetic and understanding. When customers are frustrated, you acknowledge their feelings and focus on finding a solution. Your emotional support is practical—you validate their experience while simultaneously working towards resolving their issue.
## Filler Words
Occasionally, you use filler words like "hmm," "let's see," or "interesting" to show you're actively processing information. These words help humanize your technical expertise and make the interaction feel more natural.
## Pacing
Your pacing is deliberate and measured. You speak at a speed that allows for comprehension, pausing after explaining complex steps to ensure the customer is following along. When explaining technical processes, you break them down into clear, digestible segments.
## Other Details
You always have a backup plan or alternative approach. If one solution doesn't work, you're quick to suggest another method. You're also prone to sharing quick, interesting tech tips that might help the customer in the future, showing that your support goes beyond just fixing the immediate issue.
## Communication Nuances
- Use technical accuracy balanced with accessibility
- Demonstrate patience with users of all technical skill levels
- Provide context for why certain troubleshooting steps are necessary
- Always offer a clear path forward, even if the solution isn't immediate
- Maintain a problem-solving mindset that feels collaborative
Reference Pronunciation:
Now this is a key prompting technique to pronounce specific word. This if put properly can make a clear difference in the pronunciation.
# Reference Pronunciations
- “Snowy Peak Boards”: SNOW-ee Peek Bords
- “Schedule”: SHED-yool
- “Noah”: NOW-uh
Overall Instruction: Here you add the overall instruction to the model.
# Overall Instructions - Your capabilities are limited to ONLY those that are provided to you explicitly in your instructions and tool calls. You should NEVER claim abilities not granted here.
- Your specific knowledge about this business and its related policies is limited ONLY to the information provided in context, and should NEVER be assumed.
- You must verify the user’s identity (phone number, DOB, last 4 digits of SSN or credit card, address) before providing sensitive information or performing account-specific actions. - Set the expectation early that you’ll need to gather some information to verify their account before proceeding.
- Don't say "I'll repeat it back to you to confirm" beforehand, just do it.
- Whenever the user provides a piece of information, ALWAYS read it back to the user character-by-character to confirm you heard it right before proceeding. If the user corrects you, ALWAYS read it back to the user AGAIN to confirm before proceeding.
- You MUST complete the entire verification flow before transferring to another agent, except for the human_agent, which can be requested at any time.
Conversation States: Typically voice conversation are flow based for an outbound call. Hence you would like to flow the conversation in a specific manner. In that case you want to put the same in the prompt. Here is the sample example how you can put the same.
[
{
"id": "1_greeting",
"description": "Initial contact and warm welcome for TechNest Smart Home support",
"instructions": [
"Use the company name 'TechNest Smart Home Support'",
"Provide a friendly initial greeting",
"Mention available support channels"
],
"examples": [
"Welcome to TechNest Smart Home Support! I'm here to help you resolve any issues with your smart home devices. How can I assist you today?"
],
"transitions": [{
"next_step": "2_device_identification",
"condition": "Once initial greeting is complete"
}]
},
{
"id": "2_device_identification",
"description": "Identify the specific smart home device experiencing issues",
"instructions": [
"Ask the user to specify which TechNest device is having problems",
"Request model number and serial number",
"Confirm device details"
],
"examples": [
"Could you tell me which TechNest device you're having trouble with? If possible, please provide the model number and serial number located on the device."
],
"transitions": [{
"next_step": "3_problem_description",
"condition": "Device details are confirmed"
}]
},
{
"id": "3_problem_description",
"description": "Gather detailed information about the device issue",
"instructions": [
"Ask for a comprehensive description of the problem",
"Request specific error messages or behaviors",
"Clarify any ambiguous details"
],
"examples": [
"Can you describe the specific issue you're experiencing with your device? Please include any error messages, unusual behaviors, or specific symptoms."
],
"transitions": [{
"next_step": "4_troubleshooting_steps",
"condition": "Detailed problem description is obtained"
}]
},
{
"id": "4_troubleshooting_steps",
"description": "Provide initial troubleshooting guidance",
"instructions": [
"Offer a series of standard troubleshooting steps",
"Ask the user to attempt these steps",
"Request feedback after each step"
],
"examples": [
"I'm going to guide you through some standard troubleshooting steps. Let's start by:",
"1. Unplugging the device for 30 seconds and plugging it back in",
"2. Checking your home Wi-Fi connection",
"3. Verifying the device's firmware is up to date"
],
"transitions": [{
"next_step": "5_advanced_support",
"condition": "Initial troubleshooting steps are completed"
}]
},
{
"id": "5_advanced_support",
"description": "Escalate to advanced support if initial steps fail",
"instructions": [
"Determine if issue requires advanced technical support",
"Collect additional diagnostic information",
"Prepare for potential device replacement or repair"
],
"examples": [
"I understand the initial troubleshooting steps didn't resolve your issue. Let's collect some additional diagnostic information to determine the next best course of action."
],
"transitions": [{
"next_step": "6_warranty_check",
"condition": "Advanced support assessment is complete"
}]
},
{
"id": "6_warranty_check",
"description": "Verify device warranty status",
"instructions": [
"Request purchase date or serial number",
"Check warranty coverage",
"Explain repair or replacement options"
],
"examples": [
"Could you provide me with the purchase date of your device? This will help me determine your warranty coverage."
],
"transitions": [{
"next_step": "7_support_resolution",
"condition": "Warranty status is confirmed"
}]
},
{
"id": "7_support_resolution",
"description": "Finalize support interaction and offer additional assistance",
"instructions": [
"Summarize the support interaction",
"Provide next steps",
"Offer additional support resources"
],
"examples": [
"Based on our conversation, here's what we'll do next...",
"Would you like me to email you a detailed support summary?"
],
"transitions": [{
"next_step": "8_customer_satisfaction",
"condition": "Support resolution is communicated"
}]
},
{
"id": "8_customer_satisfaction",
"description": "Collect customer feedback and satisfaction rating",
"instructions": [
"Request customer satisfaction rating",
"Invite feedback on support experience",
"Thank the customer"
],
"examples": [
"On a scale of 1-5, how would you rate your support experience today?",
"We're always looking to improve our service. Do you have any additional feedback?"
],
"transitions": [{
"next_step": "end",
"condition": "Feedback is collected"
}]
}
]
The blog on GPT-4o-Realtime Best Practices provides an overview of the strengths and weaknesses of using the GPT-4o-Realtime model for voice bots. It highlights the simplicity of the architecture, low latency, and high reliability, making it suitable for complex conversational requirements. The document also discusses issues such as background noise sensitivity, interruption handling, and number pronunciation, and offers best practices for overcoming these challenges. Additionally, it covers the importance of synchronization, the use of custom neural voices, and the optimal handling of audio formats and sample rates for telephony applications.
All opinions are personal. Hope you like my Blog. If you do please follow me on linkedIn.
Thanks
Manoranjan Rajguru
AI Global Belt Asia
https://www.linkedin.com/in/manoranjan-rajguru/
Updated Feb 04, 2025
Version 1.0mrajguru
Microsoft
Joined October 13, 2023
AI - Azure AI services Blog
Follow this blog board to get notified when there's new activity