Building Knowledge-Grounded Conversational AI Agents with Azure Speech Photo Avatars

mhadiputro

Microsoft

Feb 23, 2026

While LLM powered chat agents have advanced conversational intelligence, many still struggle to feel engaging or relatable over time. Text- and voice-only interactions often lack the visual cues needed to establish presence, trust, and emotional connection — creating a disconnect between technical capability and user perception. Avatars help bridge this gap by giving AI agents a visible identity, enabling more natural, human like, and sustained conversational experiences.

From Chat to Presence: The Next Step in Conversational AI

Chat agents are now embedded across nearly every industry, from customer support on websites to direct integrations inside business applications designed to boost efficiency and productivity. As these agents become more capable and more visible, user expectations are also rising: conversations should feel natural, trustworthy, and engaging.

While text‑only chat agents work well for many scenarios, voice‑enabled agents take a meaningful step forward by introducing a clearer persona and a stronger sense of presence, making interactions feel more human and intuitive (see healow Genie success story). In domains such as Retail, Healthcare, Education, and Corporate Training, adding a visual dimension through AI avatars further elevates the experience. Pairing voice with a lifelike visual representation improves inclusiveness, reduces interaction friction, and helps users better contextualize conversations—especially in scenarios that rely on trust, guidance, or repeated engagement.

To support these experiences, Microsoft offers two AI avatar options through Azure Speech: Video Avatars, which are generally available and provide full‑ or partial‑body immersive representations, and Photo Avatars, currently in public preview, which deliver a headshot‑style visual well suited for web‑based agents and digital twin scenarios. Both options support custom avatars, enabling organizations to reflect their brand identity rather than relying solely on generic representations (see W2M custom video avatar).

Choosing between Video Avatars and Photo Avatars is less about preference and more about intent. Video Avatars offer higher visual fidelity and immersion but require more extensive onboarding, such as high-quality recorded video of an avatar talent. Photo Avatars, by contrast, can be created from a single image, enabling a lighter‑weight onboarding process while still delivering a human‑centered experience. The right choice depends on the desired interaction style, visual presence, and target deployment scenario.

What this solution demonstrates

In this post, I walk through how to integrate Azure Speech Photo Avatars — powered by Microsoft Research's VASA-1 model — into a knowledge‑grounded conversational AI agent built on Azure AI Search. The goal is to show how voice, visuals, and retrieval‑augmented generation (RAG) can come together to create a more natural and engaging agent experience.

The solution exposes a web‑based interface where users can speak naturally to the AI agent using their voice. The agent responds in real time using synthesized speech, while live transcriptions of the conversation are displayed in the UI to improve clarity and accessibility. To help compare different interaction patterns, the sample application supports three modes: 1) Photo Avatar mode, which adds a lifelike visual presence. 2) Video Avatar mode, which provides a more immersive, full‑motion experience. 3) Voice‑only mode, which focuses purely on speech‑to‑speech interaction.

Key architectural components

An end‑to‑end architecture for the solution is shown in the diagram below.

The solution is composed of the following core services and building blocks:

Microsoft Foundry — provides the platform for deploying, managing, and accessing the foundation models used by the application.
Azure OpenAI — provides the Realtime API for speech‑to‑speech interaction in the voice‑only mode and the Chat Completions API used by backend services for reasoning and conversational responses.

gpt‑4.1 — LLM used for reasoning tasks such as deciding when to invoke tool calls and summarizing responses.
gpt-realtime-mini — LLM used for speech-to-speech interaction in the Voice-only mode.
text‑embedding‑3‑large — LLM used for generating vector embeddings used in retrieval‑augmented generation.

Azure Speech — delivers the real‑time speech‑to‑text (STT), text‑to‑speech (TTS), and AI avatars capabilities for both Photo Avatar and Video Avatar experiences.
Azure Document Intelligence — extracts structured text, layout, and key information from source documents used to build the knowledge base.
Azure AI Search — provides vector‑based retrieval to ground the language model with relevant, context‑aware content.
Azure Container Apps — hosts the web UI frontend, backend services, and MCP server within a managed container runtime.
Azure Container Apps Environment — defines a secure and isolated boundary for networking, scaling, and observability of the containerized workloads.
Azure Container Registry — stores and manages Docker images used by the container applications.

How you can try it yourself

The complete sample implementation is available in the LiveChat AI Voice Assistant repository, which includes instructions for deploying the solution into your Azure environment. The repository uses Infrastructure as Code (IaC) deployment via Azure Developer CLI (azd) to orchestrate Azure resource provisioning and application deployment.

Prerequisites: An Azure subscription with appropriate services and models' quota is required to deploy the solution.

Getting the solution up and running in just three simple steps:

Clone the repository and navigate to the project

git clone https://github.com/mardianto-msft/azure-speech-ai-avatars.git
cd azure-speech-ai-avatars

Authenticate with Azure
```
azd auth login
```
Initialize and deploy the solution
```
azd up
```

Once deployed, you can access the sample application by opening the frontend service URL in a web browser. To demonstrate knowledge grounding, the sample includes source documents derived from Microsoft’s 2025 Annual Report and Shareholder Letter. These grounding documents can optionally be replaced with your own data, allowing the same architecture to be reused for domain‑specific or enterprise scenarios.

When using the provided sample documents, you can ask questions such as: “How much was Microsoft’s net income in 2025?”, “What are Microsoft’s priorities according to the shareholder letter?”, “Who is Microsoft’s CEO?”

Bringing Conversational AI Agents to Life

This implementation of Azure Speech Photo Avatars serves as a practical starting point for building more engaging, knowledge‑grounded conversational AI agents. By combining voice interaction, visual presence, and retrieval‑augmented generation, Photo Avatars offer a lightweight yet powerful way to make AI agents feel more approachable, trustworthy, and human‑centered — especially in web‑based and enterprise scenarios.

From here, the solution can be extended over time with capabilities such as long‑term memory, richer personalization, or more advanced multi‑agent orchestration. Whether used as a reference architecture or as the foundation for a production system, this approach demonstrates how Azure Speech Photo Avatars can help bridge the gap between conversational intelligence and meaningful user experience. By emphasizing accessibility, trust, and human‑centered design, it reflects Microsoft’s broader mission to empower every person and every organization on the planet to achieve more.

Updated Feb 20, 2026

Version 1.0

ai agents

ai solutions

artifical intelligence

azure ai

azure ai search

azure document intelligence