Azure Communication Services Blog

7 MIN READ

Build your own real-time voice agent - Announcing preview of bidirectional audio streaming APIs

Microsoft

Dec 05, 2024

We are pleased to announce the public preview of bidirectional audio streaming, enhancing the capabilities of voice based conversational AI.

During Satya Nadella’s keynote at Ignite, Seth Juarez demonstrated a voice agent engaging in a live phone conversation with a customer. You can now create similar experiences using Azure Communication Services bidirectional audio streaming APIs and GPT 4o model.

In our recent Ignite blog post, we announced the upcoming preview of our audio streaming APIs. Now that it is publicly available, this blog describes how to use the bidirectional audio streaming APIs available in Azure Communication Services Call Automation SDK to build low-latency voice agents powered by GPT 4o Realtime API.

How does the bi-directional audio streaming API enhance the quality of voice-driven agent experiences?

AI-powered agents facilitate seamless, human-like interactions and can engage with users through various channels such as chat or voice. In the context of voice communication, low latency in conversational responses is crucial as delays can cause users to perceive a lack of response and disrupt the flow of conversation.

Gone are the days when building a voice bot required stitching together multiple models for transcription, inference, and text-to-speech conversion. Developers can now stream live audio from an ongoing call (VoIP or telephony) to their backend server logic using the bi-directional audio streaming APIs, leverage GPT 4o to process audio input, and deliver responses back with minimal latency for the caller/user.

Building Your Own Real-Time Voice Agent

In this section, we walk you through a QuickStart for using Call Automation’s audio streaming APIs for building a voice agent.

Before you begin, ensure you have the following:

Active Azure Subscription: Create an account for free.

Azure Communication Resource: Create an Azure Communication Resource and record your resource connection string for later use.

Azure Communication Services Phone Number: A calling-enabled phone number. You can buy a new phone number or use a free trial number.

Azure Dev Tunnels CLI: For details, see Enable dev tunnel.

Azure OpenAI Resource: Set up an Azure OpenAI resource by following the instructions in Create and deploy an Azure OpenAI Service resource.

Azure OpenAI Service Model: To use this sample, you must have the GPT-4o-Realtime-Preview model deployed. Follow the instructions at GPT-4o Realtime API for speech and audio (Preview) to set it up.

Development Environment: Familiarity with .NET and basic asynchronous programming.
Clone the quick start sample application: You can find the quick start at Azure Communication Services Call Automation and Azure OpenAI Service.

git clone https://github.com/Azure-Samples/communication-services-dotnet-quickstarts.git

After completing the prerequisites, open the cloned project and follow these setup steps.

Environment Setup

Before running this sample, you need to set up the previously mentioned resources with the following configuration updates:

Setup and host your Azure dev tunnel

Azure Dev tunnels is an Azure service that enables you to expose locally hosted web services to the internet. Use the following commands to connect your local development environment to the public internet. This creates a tunnel with a persistent endpoint URL and enables anonymous access. We use this endpoint to notify your application of calling events from the Azure Communication Services Call Automation service.

devtunnel create --allow-anonymous 
devtunnel port create -p 5165 
devtunnel host

2. Navigate to the quick start CallAutomation_AzOpenAI_Voice from the project you cloned.

3. Add the required API keys and endpoints

Open the appsettings.json file and add values for the following settings:

- DevTunnelUri: Your dev tunnel endpoint
- AcsConnectionString: Azure Communication Services resource connection string
- AzureOpenAIServiceKey: OpenAI Service Key
- AzureOpenAIServiceEndpoint: OpenAI Service Endpoint
- AzureOpenAIDeploymentModelName: OpenAI Model name

Run the Application

Ensure your AzureDevTunnel URI is active and points to the correct port of your localhost application.
Run the command dotnet run to build and run the sample application.
Register an Event Grid Webhook for the IncomingCall Event that points to your DevTunnel URI (https://<your-devtunnel-uri/api/incomingCall>). For more information, see Incoming call concepts.

Test the app

Once the application is running:

Call your Azure Communication Services number: Dial the number set up in your Azure Communication Services resource. A voice agent answer, enabling you to converse naturally.

View the transcription: See a live transcription in the console window.

QuickStart Walkthrough

Now that the app is running and testable, let’s explore the quick start code snippet and how to use the new APIs.

Within the program.cs file, the endpoint /api/incomingCall, handles inbound calls.

app.MapPost("/api/incomingCall", async ( 
    [FromBody] EventGridEvent[] eventGridEvents, 
    ILogger<Program> logger) => 
{ 
    foreach (var eventGridEvent in eventGridEvents) 
    { 
        Console.WriteLine($"Incoming Call event received."); 
 
        // Handle system events 
        if (eventGridEvent.TryGetSystemEventData(out object eventData)) 
        { 
            // Handle the subscription validation event. 
            if (eventData is SubscriptionValidationEventData subscriptionValidationEventData) 
            { 
                var responseData = new SubscriptionValidationResponse 
                { 
                    ValidationResponse = subscriptionValidationEventData.ValidationCode 
                }; 
                return Results.Ok(responseData); 
            } 
        } 
 
        var jsonObject = Helper.GetJsonObject(eventGridEvent.Data); 
        var callerId = Helper.GetCallerId(jsonObject); 
        var incomingCallContext = Helper.GetIncomingCallContext(jsonObject); 
        var callbackUri = new Uri(new Uri(appBaseUrl), $"/api/callbacks/{Guid.NewGuid()}?callerId={callerId}"); 
        logger.LogInformation($"Callback Url: {callbackUri}"); 
        var websocketUri = appBaseUrl.Replace("https", "wss") + "/ws"; 
        logger.LogInformation($"WebSocket Url: {callbackUri}"); 
 
        var mediaStreamingOptions = new MediaStreamingOptions( 
                new Uri(websocketUri), 
                MediaStreamingContent.Audio, 
                MediaStreamingAudioChannel.Mixed, 
                startMediaStreaming: true 
                ) 
        { 
            EnableBidirectional = true, 
            AudioFormat = AudioFormat.Pcm24KMono 
        }; 
 
        var options = new AnswerCallOptions(incomingCallContext, callbackUri) 
        { 
            MediaStreamingOptions = mediaStreamingOptions, 
        }; 
 
        AnswerCallResult answerCallResult = await client.AnswerCallAsync(options); 
        logger.LogInformation($"Answered call for connection id: {answerCallResult.CallConnection.CallConnectionId}"); 
    } 
    return Results.Ok(); 
});

In the preceding code, MediaStreamingOptions encapsulates all the configurations for bidirectional streaming.

WebSocketUri: We use the dev tunnel URI with the WebSocket protocol, appending the path /ws. This path manages the WebSocket messages.

MediaStreamingContent: The current version of the API supports only audio.

Audio Channel: Supported formats include:

Mixed: Contains the combined audio streams of all participants on the call, flattened into one stream.

Unmixed: Contains a single audio stream per participant per channel, with support for up to four channels for the most dominant speakers at any given time. You also get a participantRawID to identify the speaker.

StartMediaStreaming: This flag, when set to true, enables the bidirectional stream automatically once the call is established.

EnableBidirectional: This enables audio sending and receiving. By default, it only receives audio data from Azure Communication Services to your application.

AudioFormat: This can be either 16k pulse code modulation (PCM) mono or 24k PCM mono.

Once you configure all these settings, you need to pass them to AnswerCallOptions. Now that the call is established, let's dive into the part for handling WebSocket messages.

This code snippet handles the audio data received over the WebSocket. The WebSocket's path is specified as /ws, which corresponds to the WebSocketUri provided in the configuration.

app.Use(async (context, next) => 
{ 
    if (context.Request.Path == "/ws") 
    { 
        if (context.WebSockets.IsWebSocketRequest) 
        { 
            try 
            { 
                var webSocket = await context.WebSockets.AcceptWebSocketAsync(); 
                var mediaService = new AcsMediaStreamingHandler(webSocket, builder.Configuration); 
 
                // Set the single WebSocket connection 
                await mediaService.ProcessWebSocketAsync(); 
            } 
            catch (Exception ex) 
            { 
                Console.WriteLine($"Exception received {ex}"); 
            } 
        } 
        else 
        { 
            context.Response.StatusCode = StatusCodes.Status400BadRequest; 
        } 
    } 
    else 
    { 
        await next(context); 
    } 
});

The method await mediaService.ProcessWebSocketAsync() processesg all incoming messages. The method establishes a connection with OpenAI, initiates a conversation session, and waits for a response from OpenAI. This method ensures seamless communication between the application and OpenAI, enabling real-time audio data processing and interaction.

// Method to receive messages from WebSocket 
public async Task ProcessWebSocketAsync() 
{ 
    if (m_webSocket == null) { return; } 
 
    // Start forwarder to AI model 
    m_aiServiceHandler = new AzureOpenAIService(this, m_configuration); 
 
    try 
    { 
        m_aiServiceHandler.StartConversation(); 
        await StartReceivingFromAcsMediaWebSocket(); 
    } 
    catch (Exception ex) 
    { 
        Console.WriteLine($"Exception -> {ex}"); 
    } 
    finally 
    { 
        m_aiServiceHandler.Close(); 
        this.Close(); 
    } 
}

Once the application receives data from Azure Communication Services, it parses the incoming JSON payload to extract the audio data segment. The application then forwards the segment to OpenAI for further processing. The parsing ensures data integrity ibefore sending it to OpenAI for analysis.

// Receive messages from WebSocket 
private async Task StartReceivingFromAcsMediaWebSocket() 
{ 
    if (m_webSocket == null) { return; } 
 
    try 
    { 
        while (m_webSocket.State == WebSocketState.Open || m_webSocket.State == WebSocketState.Closed) 
        { 
            byte[] receiveBuffer = new byte; 
            WebSocketReceiveResult receiveResult = await m_webSocket.ReceiveAsync(new ArraySegment(receiveBuffer), m_cts.Token); 
 
            if (receiveResult.MessageType != WebSocketMessageType.Close) 
            { 
                string data = Encoding.UTF8.GetString(receiveBuffer).TrimEnd('\0'); 
                await WriteToAzOpenAIServiceInputStream(data); 
            } 
        } 
    } 
    catch (Exception ex) 
    { 
        Console.WriteLine($"Exception -> {ex}"); 
    } 
}

Here is how the application parses and forwards the data segment to OpenAI using the established session:

private async Task WriteToAzOpenAIServiceInputStream(string data) 
{ 
    var input = StreamingData.Parse(data); 
 
    if (input is AudioData audioData) 
    { 
        using (var ms = new MemoryStream(audioData.Data)) 
        { 
            await m_aiServiceHandler.SendAudioToExternalAI(ms); 
        } 
    } 
}

Once the application receives the response from OpenAI, it formats the data to be forwarded to Azure Communication Services and relays the response in the call. If the application detects voice activity while OpenAI is talking, it sends a barge-in message to Azure Communication Services to manage the voice playing in the call.

// Loop and wait for the AI response 

private async Task GetOpenAiStreamResponseAsync() 
{ 
    try 
    { 
        await m_aiSession.StartResponseAsync(); 
        await foreach (ConversationUpdate update in m_aiSession.ReceiveUpdatesAsync(m_cts.Token)) 
        { 
            if (update is ConversationSessionStartedUpdate sessionStartedUpdate) 
            { 
                Console.WriteLine($"<<< Session started. ID: {sessionStartedUpdate.SessionId}"); 
                Console.WriteLine(); 
            } 

            if (update is ConversationInputSpeechStartedUpdate speechStartedUpdate) 
            { 
                Console.WriteLine($"  -- Voice activity detection started at {speechStartedUpdate.AudioStartTime} ms"); 
                // Barge-in, send stop audio 
                var jsonString = OutStreamingData.GetStopAudioForOutbound(); 
                await m_mediaStreaming.SendMessageAsync(jsonString); 
            } 

            if (update is ConversationInputSpeechFinishedUpdate speechFinishedUpdate) 
            { 
                Console.WriteLine($"  -- Voice activity detection ended at {speechFinishedUpdate.AudioEndTime} ms"); 
            } 

            if (update is ConversationItemStreamingStartedUpdate itemStartedUpdate) 
            { 
                Console.WriteLine($"  -- Begin streaming of new item"); 
            } 

            // Audio transcript updates contain the incremental text matching the generated output audio. 
            if (update is ConversationItemStreamingAudioTranscriptionFinishedUpdate outputTranscriptDeltaUpdate) 
            { 
                Console.Write(outputTranscriptDeltaUpdate.Transcript); 
            } 

            // Audio delta updates contain the incremental binary audio data of the generated output audio
            // matching the output audio format configured for the session. 
            if (update is ConversationItemStreamingPartDeltaUpdate deltaUpdate) 
            { 
                if (deltaUpdate.AudioBytes != null) 
                { 
                    var jsonString = OutStreamingData.GetAudioDataForOutbound(deltaUpdate.AudioBytes.ToArray()); 
                    await m_mediaStreaming.SendMessageAsync(jsonString); 
                } 
            } 

            if (update is ConversationItemStreamingTextFinishedUpdate itemFinishedUpdate) 
            { 
                Console.WriteLine(); 
                Console.WriteLine($"  -- Item streaming finished, response_id={itemFinishedUpdate.ResponseId}"); 
            } 

            if (update is ConversationInputTranscriptionFinishedUpdate transcriptionCompletedUpdate) 
            { 
                Console.WriteLine(); 
                Console.WriteLine($"  -- User audio transcript: {transcriptionCompletedUpdate.Transcript}"); 
                Console.WriteLine(); 
            }  

            if (update is ConversationResponseFinishedUpdate turnFinishedUpdate) 
            { 
                Console.WriteLine($"  -- Model turn generation finished. Status: {turnFinishedUpdate.Status}"); 
            }  

            if (update is ConversationErrorUpdate errorUpdate) 
            { 
                Console.WriteLine(); 
                Console.WriteLine($"ERROR: {errorUpdate.Message}"); 
                break; 
            } 
        } 
    } 

    catch (OperationCanceledException e) 
    { 
        Console.WriteLine($"{nameof(OperationCanceledException)} thrown with message: {e.Message}"); 
    } 

    catch (Exception ex) 
    { 
        Console.WriteLine($"Exception during AI streaming -> {ex}"); 
    } 
}

Once the data is prepared for Azure Communication Services, the application sends the data over the WebSocket:

public async Task SendMessageAsync(string message) 
{ 
    if (m_webSocket?.State == WebSocketState.Open) 
    { 
        byte[] jsonBytes = Encoding.UTF8.GetBytes(message); 

        // Send the PCM audio chunk over WebSocket 
        await m_webSocket.SendAsync(new ArraySegment<byte>(jsonBytes), WebSocketMessageType.Text, endOfMessage: true, CancellationToken.None); 
    } 
}

This wraps up our QuickStart overview. We hope you create outstanding voice agents with the new audio streaming APIs. Happy coding!

For more information about Azure Communication Services bidirectional audio streaming APIs , check out: