Automating Real-Time Multi-Modal Customer Service with AI

JamesN

Microsoft

Dec 19, 2024

In today's fast-paced world, customer service is a crucial touchpoint for businesses, but traditional methods often lead to inefficiencies such as long wait times and repetitive information exchanges. These issues arise from the limited capacity of human agents and the complexity of managing various systems. Automating customer service with intelligent AI offers a solution that can efficiently scale and integrate with necessary systems via APIs. By providing real-time support, AI can address these challenges, offering consistent and immediate assistance that meets the high expectations of modern consumers.

Introducing Real-Time Multi-Modal Customer Service Solution Accelerator

This blog post introduces a solution accelerator specifically designed to automate customer service using AI. This innovative tool brings several key capabilities to the table:

Multi-Modal Real-Time Communication: The accelerator supports text chat and voice and has plans to incorporate video interactions, providing customers with a comprehensive communication suite.

Scalable Framework: It offers a framework that can expand across various domains, simulating the expertise of multiple human agents in a real-world customer service environment.
Microservice Architecture: With a stateful and scalable design, the architecture clearly separates the stateful agent service from the front-end layer, ensuring efficient operation.
Customization and Configuration: The solution is highly configurable, allowing businesses to easily customize agent workflows, system interactions, and introduce new domain areas as needed.
Advanced Technology Integration: Utilizing the latest real-time voice capability API from OpenAI and robust open-source SLM models, the accelerator is built on cutting-edge technology.
Error Handling and Recovery: It incorporates error handling and recovery techniques to maintain conversation memory, ensuring a seamless customer experience even in the event of disruptions.

Challenges in Building AI for Customer Service

While the benefits are clear, building AI for customer service presents several challenges:

Human-Like Communication: To truly mimic a human agent, AI must communicate with all the nuances of human interaction while adhering to customer support processes and guidelines. Current chatbot technologies are often limited to predefined flows and simple intent detection.
Integration with External Systems: Performing business transactions often requires complex interactions with external systems. This can involve multiple technical operations to complete a single business task.
Maintaining Context: Keeping track of conversation context across multiple interactions is crucial, especially when technical disruptions occur. This challenge is also present in human interactions, where customers may lose context if transferred between agents.
Complex Business Processes: Real-world customer service scenarios can be complex, requiring deep domain expertise that AI must emulate to be effective.
Wide Coverage Needs: Customer service must encompass a broad range of products and business domains, each with its own complexities and process flows, necessitating a flexible and adaptable AI solution.

By addressing these challenges, our AI-powered solution accelerator is poised to transform customer service, delivering efficient, scalable, and intelligent support that meets the demands of today's consumers. As technology continues to advance, businesses that embrace AI-driven customer service will be well-positioned to enhance customer satisfaction and drive success.

Key Solution Designs in the Accelerator

There are a number of key design innovations developed in this solution accelerator to overcome the challenges described above. Most of these design features are shared between both the text and voice modalities, but there are some aspects that are unique to each modality.

Common Elements:

Multi-Domain Agent Framework: To achieve a multi-domain agent solution, we have designed two patterns (one per modality) to orchestrate multiple individual agents—one for the hotel domain and another for the flight domain—so they can work seamlessly together. From the customer's perspective, these agents appear as a single customer service entity. Each domain agent is defined by a profile, which includes the system prompt and other agent specific data, and their associated tools, which are used to interact with source systems. This design means you can easily adapt the solution to your own use cases by replacing the sample agent profiles and tools with your own.
Stateful & Memory: State and memory are maintained across interactions, ensuring a coherent user experience both during agent transfers and in the event of connectivity issues. The solution provides an integration with Azure Redis to durably save session state along with an option for local in-memory storage for development.
Process Flow Definition: Clearly defined process flows guide the agent's actions, ensuring consistency and adherence to guidelines. These flows are defined in the agent profiles and their tools and are completely customizable to model your own workflows.
Source System Interactivity (Tool Calls): The ability to call external tools and systems is integrated into the service, allowing for seamless execution of complex tasks.
Headless Service: Operates independently of specific user interfaces, allowing flexibility in deployment.

Text Agent Specifics:

- Domain Agent Orchestration: To enable a seamless customer experience when interacting with multiple domain agents we’ve implemented a robust process to handle agent transfers.
  - Underlying the individual agents is the Agent Runner, which manages the transfer process. Each agent is equipped with a `get_help` tool, which is called when the agent detects the conversation topic has moved out of its domain. When the agent calls for help the Agent Runner takes over to route the conversation to the appropriate agent.
  - To transfer the conversation, first the Agent Runner classifies the intent of the new topic by comparing the user’s request against the available agent’s domain descriptions. The classifier returns the intended agent’s name, which the Agent Runner then checks for validity and ensures that it differs from the current agent. This process repeats up to three times and if a valid agent is still not identified then the default agent is assigned to handle the user’s request.
  - Finally, once a valid agent is identified, the Agent Runner assigns it as the active agent and supplies the conversation history to the new agent to ensure context is preserved through the transfer.
- History Management: Includes capabilities for limiting and restoring conversation history as needed to stay within the context window limits of the model. Specifically, the solution provides the `clean_up_history` function to limit the conversation history only to user questions and agent responses, reducing the clutter of tool calls. Additionally, the function `reset_history_to_last_question` is provided to restrict the history to the last user question. These functions can be used to effectively manage the size of the history while maintaining appropriate context.

Voice Agent Specifics:

Realtime API Capabilities: The voice modality, which is enabled by the GPT-4o Realtime API for speech and audio (Preview), unlocks exciting new possibilities for customer support scenarios. The model provides a number of features to address the challenges of using AI voice for customer service in real time.
- Session State Lifecycle: Manages the state of voice sessions throughout their lifecycle through the WebSocket connection.
- Voice Streaming Handling: Efficiently processes live voice streams for real-time interaction.
- Interruption Handling: Capable of managing interruptions seamlessly, maintaining conversation flow.
- Tool Calls: Enables integrations with external systems for task execution during voice interactions.
- Transcription Handling: Accurately transcribes voice interactions to enable conversation history tracking.
Session Management: Solution maintains session history using WebSocket sessions, the management of chat history is handled to ensure seamless interaction and continuity. Each session uniquely identifies and associated with a specific client, allowing aggregation of messages exchanged during that session. It ensures that the conversation can be resumed or referenced as needed within the same session.
Decoupled Architecture: Our architecture is an extension of the pattern developed in VoiceRAG, which introduced a simple decoupled architecture for implementing RAG with the Azure OpenAI gpt-4o-realtime-preview model. We leveraged that pattern to provide multi-modal agentic capabilities for customer service scenarios where the voice-to-voice capabilities show huge value in improved customer experiences through highly personalized and responsive engagements. As in the original VoiceRAG pattern, the front-end client is decoupled from the middle tier which handles all interactions with the real-time model. This provides the benefits including:
- Easy compatibility with any client that can work with Azure OpenAI API.
- Enhanced security by preventing the client from accessing the model directly along with any configuration and credentials.
Domain Agent Orchestration (Taming the chatter): In our experience with the real-time API we discovered that a different approach was needed for agent orchestration from the text agent, where the agent itself was able to detect changes in the conversation topic and raise a request for assistance with a tool call. The real-time model is just too chatty to reliably respect that check and often prefers to respond rather than delegate via tool call. To mitigate this tendency, we’ve introduced an asynchronous intent monitoring process to identify when topic changes occur and assign the new agent before the existing agent can respond.
- To enable this intent detection process, we leveraged the real-time API’s capability to provide transcriptions of the input user audio with the `input_audio_transcription` parameter. Once the transcription of the user’s request is returned by the model, we append it to the full conversation history, which is then sent to the intent detection model to determine which domain agent should be assigned to respond.
- This intent detection process is powered by a low latency small language model (SLM). We’re presently using a fine-tuned Mistral-7B model for this task but are testing even lower latency SLMs like Phi-4. The SLM has been fine-tuned on a large, generated dataset of conversations representative of the domains and includes transitions between them to ensure these can be accurately detected by the model. When this model receives the conversation transcript it classifies the intent of the most recent user request and returns the name of the appropriate domain agent.
- If the intended agent doesn’t match the currently assigned agent, then a transfer is initiated. The application resets the session with the real-time API using the profile and tools of the new agent. It also transfers the full conversation history by sending all prior messages as ` conversation.item.create` items to maintain conversation context with the customer.

History Management: The middle tier includes features to limit conversation history to a predefined limit to keep conversations within the context window limits.

These key design elements ensure that the solution accelerator delivers a robust, scalable, and versatile customer service AI capable of handling a wide range of interactions across different modalities and domains.

See how AI can enhance your customer service experience today

This solution is available on GitHub today to explore adding multi-modal agent capabilities to your customer service use cases: microsoft/multi-modal-customer-service-agent. Begin by exploring the travel agent use cases included in the repo sample and easily replace the agent profiles and capabilities with your own personas and systems to create a multi-modal customer service agentic system for your business.