Today at Microsoft Ignite, we are excited to announce the upcoming preview of bidirectional audio streaming for Azure Communication Services Call Automation SDK, which unlocks new possibilities for developers and businesses. This capability results in seamless, low-latency, real-time communication when integrated with services like Azure Open AI and the real-time voice APIs, significantly enhancing how businesses can build and deploy conversational AI solutions.
With the advent of new AI technologies, companies are developing solutions to reduce customer wait times and improve the overall customer experience. To achieve this, many businesses are turning to AI-powered agents. These AI-based agents must be capable of having conversations with customers in a human-like manner while maintaining very low latencies to ensure smooth interactions. This is especially critical in the voice channel, where any delay can significantly impact the fluidity and natural feel of the conversation.
With bidirectional streaming, businesses can now elevate their voice solutions to low-latency, human-like, interactive conversational AI agents. Our bidirectional streaming APIs enable developers to stream audio from an ongoing call on Azure Communication Services to their web server in real-time.
On the server, powerful language models interpret the caller's query and stream the responses back to the caller. All this is accomplished while maintaining low latency, ensuring the caller feels like they are speaking to a human. One such example of this would be to take the audio streams and processing them through Azure Open AI’s real-time voice API and then streaming the responses back into the call.
With the integration of bidirectional streaming into Azure Communication Services Call Automation SDK, developers have new tools to innovate:
- Leverage conversational AI Solutions: Develop sophisticated customer support virtual agents that can interact with customers in real-time, providing immediate responses and solutions.
- Personalized customer experiences: By harnessing real-time data, businesses can offer more personalized and dynamic customer interactions in real-time, leading to increased satisfaction and loyalty.
- Reduce wait times for customers: By using bidirectional audio streams in combination with Large Language Models (LLMs) you can build virtual agents that can be the first point of contact for customers reducing the need for customers waiting for a human agent being available.
Integrating with real-time voice-based Large Language Models (LLMs)
With the advancements in voice based LLMs, developers want to take advantage of services like bidirectional streaming and send audio directly between the caller and the LLM. Today we’ll show you how you can start audio streaming through Azure Communication Services.
Developers can start bidirectional streaming at the time of answering the call by providing the WebSocket URL.
//Answer call with bidirectional streaming
websocketUri = appBaseUrl.Replace("https", "wss") + "/ws";
var options = new AnswerCallOptions(incomingCallContext, callbackUri)
{
MediaStreamingOptions = new MediaStreamingOptions(
transportUri: new Uri(websocketUri),
contentType: MediaStreamingContent.Audio,
audioChannelType: MediaStreamingAudioChannel.Mixed,
startMediaStreaming: true)
{
EnableBidirectional = true,
AudioFormat = AudioFormat.Pcm24KMono
}
};
At the same time, you should open your connection with Azure Open AI real-time voice API. Once the WebSocket connection is setup, Azure Communication Services starts streaming audio to your webserver. From there you can relay the audio to Azure Open AI voice and vice versa. Once the LLM reasons over the content provided in the audio it streams audio to your service which you can stream back into the Azure Communication Services call. (More information about how to set this up will be made available after Ignite)
//Receiving streaming data from Azure Communication Services over websocket
private async Task StartReceivingFromAcsMediaWebSocket()
{
if (m_webSocket == null) return;
try
{
while (m_webSocket.State == WebSocketState.Open || m_webSocket.State == WebSocketState.Closed)
{
byte[] receiveBuffer = new byte[2048];
WebSocketReceiveResult receiveResult = await m_webSocket.ReceiveAsync(new ArraySegment<byte>(receiveBuffer), m_cts.Token);
if (receiveResult.MessageType == WebSocketMessageType.Close) continue;
var data = Encoding.UTF8.GetString(receiveBuffer).TrimEnd('\0');
if(StreamingData.Parse(data) is AudioData audioData)
{
using var ms = new MemoryStream(audioData.Data);
await m_aiServiceHandler.SendAudioToExternalAI(ms);
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Exception -> {ex}");
}
}
Streaming audio data back into Azure Communication Services
//create and serialize streaming data
private void ConvertToAcsAudioPacketAndForward( byte[] audioData )
{
var audio = new OutStreamingData(MediaKind.AudioData)
{
AudioData = new AudioData(audioData)
};
// Serialize the JSON object to a string
string jsonString = System.Text.Json.JsonSerializer.Serialize<OutStreamingData>(audio);
// Queue the async operation for later execution
try
{
m_channel.Writer.TryWrite(async () => await m_mediaStreaming.SendMessageAsync(data));
}
catch (Exception ex)
{
Console.WriteLine($"\"Exception received on ReceiveAudioForOutBound {ex}");
}
}
//Send encoded data over the websocket to Azure Communication Services
public async Task SendMessageAsync(string message)
{
if (m_webSocket?.State == WebSocketState.Open)
{
byte[] jsonBytes = Encoding.UTF8.GetBytes(message);
// Send the PCM audio chunk over WebSocket
await m_webSocket.SendAsync(new ArraySegment<byte>(jsonBytes), WebSocketMessageType.Text, endOfMessage: true, CancellationToken.None);
}
}
To reduce developer overhead when integrating with voice-based LLMs, Azure Communication Services supports a new sample rate of 24Khz, eliminating the need for developers to resample audio data and helping preserve audio quality in the process
Next steps
The SDK and documentation will be available in the next few weeks after this announcement, offering tools and information to integrate bidirectional streaming and utilize voice-based LLMs in your applications.
Stay tuned and check our blog for updates!