We are pleased to announce two updates to the Azure TTS Avatar offering: the introduction of a high visual fidelity avatar model, enabling customers to create more immersive experiences, and the addition of expressive capabilities for users developing avatar content.
📺 4K Support: Higher Resolution, More Possibilities
Our standard avatar offering is 1080P, which performs effectively across both web and mobile applications. However, to accommodate customers who require higher resolution avatars suited for larger displays, we have introduced the custom avatar 4K avatar option.
As a quick recap, a custom avatar allows you to create a personalized digital persona using your own assets and configurations, giving you full control over appearance and behavior. The custom avatars are built through self-service workflow, making it easy to generate and deploy avatars tailored to your needs. For a detailed walkthrough of creating a custom avatar, check out our previous blog: Create a Custom Text-to-Speech Avatar Through Self-Service
A 4K avatar is created using video training data with 4K resolution, resulting in higher detail accuracy compared to lower resolutions. The fidelity differences between a 4K avatar and a 1080p avatar can be observed in these samples.
4k detail zoom in video sample | 1080p detail zoom in video sample |
And here are videos showcasing how customers such as W2M and ServiceNow are using 4k custom avatar solution on the big screen.
And ServiceNow video.
The 4K avatar works in both landscape and portrait modes. Landscape 4K suits bigger screens and presentations, while Portrait 4K is ideal for mobile devices, vertical signs, and social media.
Training a 4K custom avatar
You can train a 4K avatar using the self-serve tool in Azure AI Foundry.
- Go on to the Azure AI Foundry and log in with your azure account, create AI Foundry resource in one of these 3 service regions: West US 2, West Europe, or Southeast Asia.
- In Fine-tuning, go to the AI Service fine-tuning tab, you can create a custom avatar by clicking the Fine-tune button and selecting Custom Avatar in the pop-up list.
- Create a new avatar task, set up the avatar talent, and during the upload data step ensure the videos are at least 2160p to train a 4K avatar.
- During the model training phase, you may select either landscape or portrait orientation and choose between 1080p or 4K resolution when submitting a new model for training.
Tips for 4K avatar training: The 4K avatar model requires 4K video footage, while the portrait dimension model needs portrait-oriented video footage for training. When recording 4K videos, it is important to ensure that body movements remain centered within the frame.
How to Synthesize 4K Video
By default, the 4K model produces video at a resolution of 3840x2160. However, you may adjust the output resolution while maintaining the original aspect ratio to suit your requirements. For instance, setting the output resolution to 1920x1080 during streaming can help reduce network bandwidth consumption.
Code example:
const videoFormat = new SpeechSDK.AvatarVideoFormat();
videoFormat.width = 1920;
videoFormat.height = 1080;
For further details, please refer to the following link:
https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/0d852b2115e780cfb4b65343e6c23e67953e8f4e/samples/js/browser/avatar/js/basic.js#L224C1-L225C1
Pricing
See our pricing page Azure AI Speech Pricing | Microsoft Azure for details on 4K avatar charges.
🎭 Avatar Emotion control: Voice & Visuals in Sync
A new feature for avatar emotion control has been implemented. This functionality enables users to set the avatar’s emotion, which will be reflected in both the voice and facial expression.
This new feature can be experienced in the lisa-casual-standing avatar in batch mode, and may be accessed through the following methods:
- Access through AI Foundry portal
In Azure AI Foundry, in the Text to Speech avatar tool, users can select the Lisa-casual-standing avatar. Since this avatar supports emotion controls, an Emotion drop-down menu appears above the text box, listing all available emotions for the voice. Changing the voice will update the emotion options according to the supported emotions for each voice.
- Access through SSML
Example to specify emotion during synthesis in SSML:
<speak xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"http://www.w3.org/2001/mstts\" xmlns:emo=\"http://www.w3.org/2009/10/emotionml\" version=\"1.0\" xml:lang=\"en-US\">
<voice name=\"en-US-SaraNeural\">
<mstts:express-as style=\"angry\">I'm angry because I care, and it hurts to see things fall apart like this.</mstts:express-as>
</voice>
</speak>
Facial expressions are determined by the selected voice emotion. Consequently, users are required to specify the desired voice emotion prior to use. If the chosen voice does not support any available voice emotions, or if no emotion is selected, the system will default to a "neutral" emotion, and the avatar will speak with a neutral expression.
The following voice emotions are currently supported for synchronizing with avatar expressions.
Cheerful |
Shouting |
Friendly |
Unfriendly |
Hopeful |
Disgruntled |
Funny |
Surprised |
Angry |
Excited |
Samples
Watch the demo video featuring Lisa Casual Standing with emotion control enabled. The video demonstrates how adjustments in tone and facial expression affect the avatar's appearance. Several samples are included to show the outcome of different emotion settings.
Custom avatar emotion support
Next, we will enable custom avatars with emotion control and preview the feature with a few customers first. Please contact us if you want to build emotion control avatar models.
Subjective evaluation
An Expressiveness Mean Opinion Score test was used to evaluate perceptual quality, measuring how effectively the speaker conveys emotions, intentions, and natural expressions through facial movements and tone. This score indicates how engaging and lifelike the speaker appears in the video.
Solution |
Expressiveness MOS↑ |
Lisa-casual-standing |
4.15 |
Company A |
3.88 |
This update represents substantial progress toward enhancing avatars to be more realistic, expressive, and adaptable for varied content requirements. We look forward to observing the innovative applications users will develop with these advancements.