Microsoft Foundry Blog

7 MIN READ

Create a custom text to speech avatar through self-service

QinyingLiao

Microsoft

Jan 14, 2025

Azure AI Speech service has released a self-service portal in public preview for custom training of text to speech avatars.

AI avatars are revolutionizing the way we interact with technology. They serve as virtual sales agents assisting customers, personalized service assistants providing 24/7 support, digital teachers bringing lessons to life, and brand representatives in advertising.

Today we are excited to announce that Azure AI Speech service has released a self-service portal in public preview for custom training of text to speech avatars. Now creating an avatar for your business that supports both real-time live chats and video generation is easier than ever. All you need is a minimum of a few minutes of video recordings in total as training data and a consent video to get started. A state-of-the-art avatar model is just a click away.

Check out the video below for an overview of the public preview of the custom text to speech avatar self-service portal.

In this article, we provide a comprehensive step-by-step guide for developing a custom text-to-speech avatar tailored to your business needs.

Steps in the journey

Before we begin, here is a step-by-step visualization for creating a custom text to speech avatar. We will explain each of these steps in detail below.

Prepare	Meet Responsible AI requirements	∙ Read ∙ Fill out custom avatar application
	Cast an avatar performer	∙ Define avatar persona ∙ Find an avatar performer
	Record performer	∙ Record permission statement ∙ Record videos
Create in Speech Studio	Start a new project in Speech Studio	∙ Log into Speech Studio using Azure account ∙ Create a new custom avatar project
	Upload video data	∙ Upload permission statement recording ∙ Upload video training data
	Train avatar model	∙ Confirm enough data to train ∙ Check the quality
	Deploy your avatar model	∙ Deploy trained model
Integrate	Integrate your avatar for video creation or creating your own apps	∙ Use your avatar via the TTS avatar tool for video generation, or ∙ Build your app with your avatar using the Speech SDK

Responsible AI

Prioritizing responsible AI is fundamental to our text to speech avatar capability. Custom avatar was developed in strict adherence to our responsible AI principles and is offered as a limited access service with eligibility and use case requirements through a controlled registration and review process. To learn more about the responsible AI considerations in the development and usage of this service, review our Azure Text to Speech Transparency Note.

The first step to creating your own custom avatar is filling out a registration form to gain access to the technology. Please make sure to read through the registration form and fill it out completely.

Once you’ve registered, your eligibility for access is confirmed, and you have committed to using the feature in alignment with our responsible AI principles, you will be granted access.

At this stage we are not offering the service to individual users for personal use.

Persona design

Persona refers to the attributes that make your imaginary character come to life in a way that will resonate with your customers. For example, you may want a 40ish year-old female, who performs with authority and confidence, is directly engaging, thoughtful and unbiased. Think carefully about your persona because this will be a representation of your company when speaking to customers.

Once the persona is defined, you can cast your performer (avatar talent) for data collection. Make sure your avatar talent has experience in the persona and is comfortable with the gestures or movements you would like to capture. Most importantly once you have chosen an avatar talent, make sure that the avatar talent is okay to sign a contract with you stating that they will offer their likeliness to create an avatar, and a synthetic voice if you would like to use the avatar together with a voice that sounds like the person.

Keep in mind that the look and feel of the avatar created heavily depends on the persona you have designed.

Recording

When choosing an avatar talent, it’s a good idea to consider where your recording will take place. We recommend recording in a professional video recording studio or a well-lit place.

If you need a commercial, multi-scene avatar, the background of the video should be clean, smooth, pure-colored, and a green screen is the best choice.

If your avatar only needs to be used in a single scene, you can select a specific scene to record (such as in your office), but the background can't be subtracted and changed.

The custom text to speech avatar doesn't support customization of clothes or looks. Therefore, it's essential to carefully design and prepare the avatar's appearance when recording the training data.

At least three video clips are required:

Consent video: The consent video must represent the same avatar talent speaking, following the requirement of the consent statement.
Naturally speaking: Actor speaks in status 0 but with natural hand gestures from time to time. Minimum 5 minutes, maximum 30 minutes in total.
Silent status: A 1-minute video clip of the actor maintaining status 0 without speaking but relaxed. The video clip is used as the main template for both speaking and listening status for a chatbot.

If you would like to add custom gestures, prepare two additional video clips:

Gestures: One 10-second video clip for each gesture. Each custom avatar model can support no more than 10 gestures.
Status 0 speaking: A video clip with the performer speaking for 3 to 5 minutes, representing the posture that the performer can naturally maintain most of the time while speaking. For example, arms crossed in front of the body or hanging down naturally at the sides.

The quality of your avatar model heavily depends on the quality of the recorded videos used for training. It’s critical that you make sure the data is collected following the requirements. For more detailed instructions, best practice and sample data, check this document.

Uploading data

Go to the speech studio portal and log in with your Azure account. Select Custom avatar (preview) and then Create a project. Go to the project, and Set up avatar talent, then Upload consent video. Navigate to Prepare training data and Upload data that you’ve prepared in the previous steps. You can select to upload data from local files on your computer or provide access to the Azure Blob storage.

Data files are automatically validated when you select Submit. Data validation includes series of checks on the video files to verify their file format, size, and total volume. If there are any errors, fix them and submit again.

After you upload the data, you can check the data overview which indicates whether you have provided enough data to start training. Below is an example of enough data added for training an avatar without additional gestures.

Training your avatar model

Once you have enough data uploaded, you can start to train a model. Enter a Name to help you identify the model. Choose a name carefully. The model name is used as the avatar name in your synthesis request by the SDK and SSML input. Only letters, numbers, hyphens and underscores are allowed. Use different names for different models.

It’s important to note that your avatar model name should be unique. No duplicate names are allowed under the same Speech or Azure AI resource.

Training duration varies depending on how much data you use. It normally takes 20-40 compute hours on average to train a custom avatar. Check the pricing note on how training is charged.

Deploying your avatar model

After you've successfully created your avatar model, you deploy it to your endpoint.

When a model is deployed, you will pay for continuous up time of the endpoint regardless of your interaction with that endpoint. Check the pricing note on how model deployment is charged. You can delete a deployment when the model is not in use to reduce spending and conserve resources.

Custom avatar training is currently only available in some regions. After your avatar model is trained in a supported region, you can copy it to a Speech resource in another region for deployment as needed. For more information, see the Speech service regions.

Integrate your avatar model into your chosen platform

After you deploy your custom avatar, it's available to use in Speech Studio or via API:

The avatar appears in the avatar list of the text to speech avatar tool on Speech Studio.
The avatar appears in the avatar list of the live chat avatar tool on Speech Studio.
You can call the avatar from the API by specifying the avatar model name.

Check out sample code in GitHub for integrating your avatar with the latest generative AI models, such as Azure OpenAI ChatGPT-4o or the real-time API.

If you're also creating a custom neural voice for the actor, the avatar can be highly realistic. For more information, see custom neural voice overview.

Note that custom neural voice and custom text to speech avatar are separate features. You can use them independently or together.

Customer cases

Custom text to speech avatar has enabled many customers and partners around the world to develop engaging customer service solutions for a variety of industries. These include KPMG, Fujifilm, MAPFRE, Dentsu Digital, Bank SinoPac, Herbalife, Coca Cola, and more. (Check out their testimonials here.)

In addition, read the story of how CDW is leveraging Azure text to speech avatar in their business solutions.

Get started

Azure text to speech (TTS) avatar is a powerful tool for developers looking to enhance customer engagement and improve overall experience. With a variety of use cases and customer references, it's clear that Azure TTS avatar is paving the way for a new era of customer engagement and innovation. As developers, you can use Azure TTS avatar to create personalized and engaging experiences for your customers and employees with a rich choice of prebuilt avatars and voices available. You can also leverage custom avatar and custom neural voice to create custom synthetic voices and images that represent your brand.

With responsible AI features that promote transparency and fairness, Azure TTS avatar helps you create inclusive and ethical applications that serve a diverse range of users. For more basics of the text to speech avatar service and its responsible AI considerations, check out this blog.

Learn more:

Create a video using prebuilt avatars

Try our live chat demo with prebuilt avatars

Learn how to create a custom avatar

Try our TTS voice demo

Apply for access to custom avatar and custom neural voice

Updated Jan 14, 2025

Version 2.0

azure ai services

azure ai speech

QinyingLiao

Microsoft