The new Azure OpenAI gpt-4o-realtime-preview model opens the door for even more natural application user interfaces with its speech-to-speech capability.
This new voice-based interface also brings an interesting new challenge with it: how do you implement retrieval-augmented generation (RAG), the prevailing pattern for combining language models with your own data, in a system that uses audio for input and output?
In this blog post we present a simple architecture for voice-based generative AI applications that enables RAG on top of the real-time audio API with full-duplex audio streaming from client devices, while securely handling access to both model and retrieval system.
Architecting for real-time voice + RAG
Supporting RAG workflows
We use two key building blocks to make voice work with RAG:
- Function calling: the gpt-4o-realtime-preview model supports function calling, allowing us to include “tools” for searching and grounding in the session configuration. The model listens to audio input and directly invokes these tools with parameters that describe what it’s looking to retrieve from the knowledge base.
- Real-time middle tier: we need to separate what needs to happen in the client from what cannot be done client-side. The full-duplex, real-time audio content needs to go to/from the client device’s speakers/microphone. On the other hand, the model configuration (system message, max tokens, temperature, etc.) and access to the knowledge base for RAG needs to be handled on the server, since we don’t want the client to have credentials for these resources, and don’t want to require the client to have network line-of-sight to these components. To accomplish this, we introduce a middle tier component that proxies audio traffic, while keeping aspects such as model configuration and function calling entirely on the backend.
These two building blocks work in coordination: the real-time API knows not to move a conversation forward if there are outstanding function calls. When the model needs information from the knowledge base to respond to input, it emits a “search” function call. We turn that function call into an Azure AI Search “hybrid” query (vector + hybrid + reranking), get the content passages that best relate to what the model needs to know, and send it back to the model as the function’s output. Once the model sees that output, it responds via the audio channel, moving the conversation forward.
A critical element in this picture is fast and accurate retrieval. The search call happens between the user turn and the model response in the audio channel, a latency-sensitive point in time. Azure AI Search is the perfect fit for this, with its low latency for vector and hybrid queries and its support for semantic reranking to maximize relevance of responses.
Generating Grounded Responses
Using function calling addresses the question of how to coordinate search queries against the knowledge base, but this inversion of control creates a new problem: we don’t know which of the passages retrieved from the knowledge base were used to ground each response. Typical RAG applications that interact with the model API in terms of text we can ask in the prompt to produce citations with special notation that we can present in the UX appropriately, but when the model is generating audio, we don’t want it to say file names or URLs out loud. Since it’s critical for generative AI applications to be transparent about what grounding data was used to respond to any given input, we need a different mechanism for identifying and showing citations in the user experience.
We also use function calling to accomplish this. We introduce a second tool called “report_grounding”, and as part of the system prompt we include instructions along these lines:
Use the following step-by-step instructions to respond with short and concise answers using a knowledge base:
Step 1 - Always use the 'search' tool to check the knowledge base before answering a question.
Step 2 - Always use the 'report_grounding' tool to report the source of information from the knowledge base.
Step 3 - Produce an answer that's as short as possible. If the answer isn't in the knowledge base, say you don't know.
We experimented with different ways to formulate this prompt and found that explicitly listing this as a step-by-step process is particularly effective.
With these two tools in place, we now have a system that flows audio to the model, enables the model to call back to app logic in the backend both for searching and for telling us which pieces of grounding data was used, and then flows audio back to the client along with extra messages to let the client know about the grounding information (you can see this in the UI as citations to documents that show up as the answer is spoken).
Using any Real-Time API-enabled client
Note that the middle tier completely suppresses tools-related interactions and overrides system configuration options but otherwise maintains the same protocol. This means that any client that works directly against the Azure OpenAI API will “just work” against the real-time middle tier, since the RAG process is entirely encapsulated on the backend.
Creating secure generative AI apps
We’re keeping all configuration elements (system prompt, max tokens, etc.) and all credentials (to access Azure OpenAI, Azure AI Search, etc.) in the backend, securely separated from clients. Furthermore, Azure OpenAI and Azure AI Search include extensive security capabilities to further secure the backend, including network isolation to make the API endpoints of both models and search indexes not reachable through the internet, Entra ID to avoid keys for authentication across services, and options for multiple layers of encryption for the indexed content.
Try it today
The code and data for everything discussed in this blog post is available in this GitHub repo: Azure-Samples/aisearch-openai-rag-audio. You can use it as-is, or you can easily change the data to your own and talk to your data.
The code in the repo above and the description in this blog post in more of a pattern than a specific solution. You’ll need to experiment to get the prompts right, maybe expand the RAG workflow, and certainly assess it for security and AI safety.
To learn more about the Azure OpenAI gpt-4o-realtime-preview model and real-time API you can go here. For Azure AI Search you’ll find plenty of resources here, and the documentation here.
Looking forward to seeing new “talk to your data” scenarios!