Azure AI Speech announces public preview of text to speech avatar

Melinda Ma · ‎Nov 15 2023

We are excited to announce the public preview release of Azure AI Speech text to speech avatar, a new feature that enables users to create talking avatar videos with text input, and to build real-time interactive bots trained using human images. In this blog post, we will introduce the features, benefits, and technical details of this feature, and show you some examples of how you can use it for various scenarios.

What is text to speech avatar?

The text to speech avatar system is a text to speech feature with vision capabilities, that allow customers to create synthetic videos of a 2D photorealistic avatar speaking. The Neural text to speech Avatar models are trained by deep neural networks based on the human video recording samples, and the voice of the avatar is provided by text to speech voice model.

Why do we build avatars? There are two main reasons:

Traditional video content creation requires a lot of time and budget, including setting up video shooting environment, filming videos, editing, etc. With text to speech avatar, users can more efficiently create video. Users can use the avatar to build training videos, product introductions, customer testimonials, etc., simply with text input.
With the release of Azure OpenAI Service and neural text to speech, interactive conversation is more natural than before. With text to speech avatar, the users can create more engaging digital interactions. You can use the avatar to build conversational agents, virtual assistants, chatbots, and more.

There are three components in an avatar content generation workflow: text analyzer, the TTS audio synthesizer, and TTS avatar video synthesizer. To generate avatar video, text is first input into the text analyzer, which provides the output in the form of phoneme sequence. Then, the TTS audio synthesizer predicts the acoustic features of the input text and synthesize the voice. These two parts are provided by text to speech voice models. Next, the Neural text to speech Avatar model predicts the image of lip sync with the acoustic features, so that the synthetic video is generated.

Below is an overview of the workflow:

What’s in this release?

We offer two separate text to speech avatar features at this time: prebuilt text to speech avatar and custom text to speech avatar.

Prebuilt text to speech avatar

Microsoft offers prebuilt text to speech avatars as out of box products on Azure for its subscribers. These avatars can speak different languages and voices based on the text input. Customers can select an avatar from a variety of options and use it to create video content or interactive applications with real time avatar responses.

Custom text to speech avatar

A custom text to speech avatar feature enables customers to create a personalized avatar for their product or brand. Customers can upload their own video recording of avatar talent, which the feature uses to train a synthetic video of the custom avatar speaking. Customers can choose either a prebuilt or a custom neural voice for their avatar. If the same person's voice and likeness are used for both the custom neural voice and the custom text to speech avatar, the avatar will closely resemble that person.

As part of Microsoft's commitment to responsible AI, text to speech avatar is designed with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteracting the proliferation of harmful deepfakes and misleading content. For this reason, custom avatar is a Limited Access feature available by registration only, and only for certain use cases. To access and use the feature in your business applications, register your use case here and apply for the access.

We support both UI tool on the Azure AI Speech Studio and API access.

The Text to speech Avatar tool for video content creation on Speech Studio

A Live chat avatar demo tool on Speech Studio

What can text to speech avatar do?

With text to speech avatar, you are enabled to create engaging videos with prebuilt or your custom avatar, such as training video, presentation video, etc.

You can also create engaging experiences for customers, employees, and other audiences by providing applications enriched with an interactive avatar.

Batch video content creation

Real time interaction application

Training video for enterprise
Product introduction or Advertisement materials
CEO digital twin to present in a conference

A chatbot for a travel website
A virtual sales in an live commercial
AI teacher who teaches online and can answer questions
A virtual HR to response to employees’ question

Here are examples of video content creation with a custom avatar and a virtual sales application powered by text to speech avatar and Azure Open AI. In each sample, we provide an introduction of how to create, the result video demo, as well as the sample code.

Video content creation

Engaging avatar video experiences are typically composed of several elements including the talking avatar video, background images or videos, ambient music and other elements to make the video fancy.

Here is a simple workflow of creating rich avatar videos:

Start with a talking script for your avatar using either plaintext format or the Speech Synthesis Markup Language (SSML). SSML allows you to fine-tune the voice of your avatar including pronunciation, and the expression of special terms such as brand names, coupled with specific gestures like a hand wave or pointing to an item. The Audio Content Creation tool of the Speech Studio provides an intuitive user interface to create an SSML input file for your avatar video.
With your talking script ready, you can use the Azure TTS 3.1 API to synthesize your avatar video. Besides the SSML input you can specify the character and style of the avatar (such as standing or sitting) and the desired video format (such as transparent background). The Text to speech avatar tool on the Speech Studio also provides a no-code option to create avatar video.
In many cases, you probably want to add a content image or a video with text, illustrations, animations etc. to the final avatar video. In this sample we exported an animated PowerPoint presentation as a high-resolution video for this purpose.
Finally, combine your assets including the avatar video, content, and optional elements like background music to compose your rich video experience. This can either be done using the FFmpeg tool or a video editor like Microsoft Clipchamp for more control. Using a video editor provides an intuitive way to fine-tune the timings of the video, add engaging effects and animations.

The following video was generated using the above workflow with a custom text to speech avatar.

Check out our notebook to create your avatar video today: https://github.com/Azure/gen-cv/tree/main/avatar/video

Interactive Avatar Experience

Here is an example with an avatar acting as a virtual sales agent of an outdoor equipment online shop. She answers customer questions in real time about products or customer accounts and can also place an order.

This outdoor demo harnesses the capabilities of the text to speech avatar, Azure OpenAI Service, Azure AI Search, and Azure SQL Database to offer the following features:

Customers can engage in verbal dialogues with the shopping assistant avatar in multiple languages.
The interactive avatar utilizes the Azure OpenAI Service GPT-3.5 model to respond to customer queries.
Beyond leveraging Azure OpenAI Service GPT-3.5, it accesses the outdoor shop’s data sources to answer questions about the product portfolio or customer accounts, such as order status and available loyalty points. Azure OpenAI Service automatically determines when to initiate a search in the product knowledge base or execute a database transaction using Azure OpenAI Service Function calling.
Execute business transactions, such as ordering products, are processed in real-time, provided there is sufficient stock.

The demo application is a static Azure Web App with a JavaScript user interface that communicates with Azure AI Speech and other components. The Python-based backend orchestrates the communication between Azure OpenAI Service and Azure AI Search, which serves as the product knowledge base, as well as Azure Storage for product images and Azure SQL Database for customer data management.

Here's a glimpse into the outdoor shopping demo experience, showcasing the multilingual capabilities of the avatar feature:

You can find the required resources for creating your own application based on the outdoor shop example here:https://github.com/Azure/gen-cv/tree/main/avatar/interactive. You can customize the solution for your specific needs.

Customer story

We are happy to have a number of customers working with text to speech avatar, and that we can share their testimonials at public preview.

“We are using Azure AI Services for our AI Banking Avatar due to the unique combination of leading-edge AI and Visualization services in one platform. By using different Azure AI Speech text to speech avatar we will be able to generate a next level customer experience and really simplify banking and banking interactions.” - Gerald Ertl, Managing Director, Commerzbank AG

“We believe that AI-powered brand assistants will transform the way businesses interact with their customers and manage their brands. It’s for this reason that we are excited by the potential of the text to speech avatar. Whether it’s providing the answers to customer questions, assisting a transaction, or providing entertaining content, the use cases that this technology unlocks are numerous. We’re privileged to be working with Microsoft on this program, as we shape the future of digital experiences together.” - Alex Hamilton, Head of Innovation, UK, Dentsu

Get Started

To learn more and get started, you can first try out text to speech avatar prebuilt avatars with the no-code tool provided in Speech Studio (microsoft.com) which allows you to explore the avatar feature with an intuitive user interface. You need an Azure account and an Azure AI Speech resource before you can use Speech Studio (microsoft.com). Please refer to Quick Start to set up.

We are committed to ensuring that our AI solutions are used in a responsible manner, as this is essential for our and our customers' long-term success. Please read the Responsible AI introduction for text to speech avatar on https://aka.ms/TTS-TN

For more information

Apply for access to Custom text to speech avatar
Apply for access to Custom neural voice
Try sample code, Real-time synthesis (SDK), Live chat with Azure Open AI in behind (SDK)
Join Discord to collaborate and share feedback

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs