Azure Custom Neural Voice introduces new emotional styles to support brand voices
Published Dec 13 2022 12:43 AM 5,568 Views
Microsoft

Custom Neural Voice, a feature of Azure Cognitive Services for Speech, is a great way to create a highly natural synthetic voice that sounds almost identical to your voice actor. This synthetic voice can then be used in a variety of different scenarios including voice assistant, customer service bot, audiobooks, language learning, news reading and many more. Since its launch, Custom Neural Voice has empowered organizations such as AT&T, Duolingo, Progressive, and Swisscom to develop branded speech solutions that delight users. (For more details, read the Innovation Stories blog.)

 

Microsoft has made it simple for customers to train a branded voice or character voice by using a small set of recordings from their voice actor (from 300 to 2,000 sentences or short phrases, which is about 30 minutes to 3 hours of speech data). During the past months, we’ve received more and more customer requests to support multiple speaking styles for their voice. For example, customers using the Custom Neural Voice capability to create voices for NPCs (non-player characters) in video games would like to enable their voice to express different emotions like happiness, anger, fear, etc. in different scenarios. Call-center solution providers would like their voice to speak in cheerful and empathetic styles for different customer responses.

 

Today we are glad to introduce the multi-style capability of Custom Neural Voice, a new feature in public preview, which enables users to create a brand or character voice that speaks with different emotions. This new feature makes it extremely easy for customers to create and expand their custom voices to sound more natural with emotions and support more scenarios, no additional training data required.  

 

Custom Neural Voice with multiple speaking styles

 

The multi-style capability of Custom Neural Voice enables two use cases: 1) you can use it to extend your voice to 10 preset styles without adding emotional training data; 2) you can create a custom style by providing your own training data in that speaking style.

 

In our earlier blogs, we introduced what Custom Neural Voice is and how it can benefit customer business and the steps for creating a custom neural voice in details. To create a voice that speaks with multiple styles, you only need to select the ‘Neural – multi style’ training option.

 

QinyingLiao_0-1670919070609.png

 

Then select the styles you want to enable for your voice. You can select up to 10 preset styles.

 

QinyingLiao_1-1670919070616.png

 

By providing your own training data, you can create a custom style, beyond these 10 preset styles. Click “Add a custom style’ and specify your training set for each of your custom styles. Remember to carefully name your custom styles, as the name will be used in your code using the SSML to specify the styles of your voice. Only letters, digits, underscores ('_'), and hyphens ('-') are allowed for the style name.

 

QinyingLiao_2-1670919070619.png

 

Once the voice is created, you’ll be prompted to a set of default samples for each style so you can easily listen to the voice and check how it sounds with each speaking style. Below are a few samples from a female voice created with 300 utterances as training data, in its default (general) speaking style, plus 10 preset emotions enabled.

 

Sample

Female voice

Training data (human voice)

Default style (general)

Angry

Cheerful

Excited

Friendly

Hopeful

Sad

Shouting

Terrified

Unfriendly

Whispering

 

After you've tested the voice and styles, you can deploy the voice to a custom neural voice endpoint (see how to deploy and use your voice model) and use the Audio Content Creation tool to create audio with your deployed voice in different speaking styles. Or, specify the speaking styles using SSML in your codes via the Speech SDK (see details here).  

 

Currently the multi-style capability of Custom Neural Voice only supports English (US). Additional language support is coming soon.

 

This capability has been adopted by some early customers to enable speaking styles for their voices, such as Adthos. Check out a sample of the voices with different styles blow.

 

Technology behind: ‘Style Transfer’

 

To enable the style support for Custom Neural Voice without requiring customer training data for each style, we have applied a technology called “Style Transfer” to build speaking styles efficiently. Style Transfer is a method to apply the speaking tone and prosody (i.e., pace, intonation, rhythm) of one speaker (source speaker) to another speaker (target speaker). The result of the Style Transfer is the target speaker adopts the tone and prosody of the source speaker yet keeps their own voice timbre.

 

Conventionally, to build a voice style for TTS, we need to collect style recording data e.g. emotional speaking data from the original source voice actor. However, sometimes we are unable to gather significant emotional data due to voice actor availability, or gaps in the voice actors’ emotional range.

 

The innovation of Style Transfer solves this customer challenge effectively.  (See our Interspeech 2021 paper for details). The main idea is that we 1) learn the high-level style representation which contributes  to the style expression, 2) predict the fine-grained prosody features which largely embody the expression, 3) post-process the style representation and prosody features to make them compatible to the target speaker, considering some basic differences between the source and target speaker, and 4) feed them to the well-designed TTS decoder which has good response to the input style representation and prosody feature.

 

In this way, the speaking style of the source speaker can be learned and transferred to any target speaker, even for those styles which have obvious differences in vocal effort from normal styles, such as whispering and shouting styles. The style transfer TTS model leverages the building blocks of UniTTS v4. Given as few as 100 recorded utterances of source speaker, we can transfer the style to a target speaker with very good naturalness (MOS gap to source emotional recording < 0.2) and speaker similarity (SMOS 4.2 to target speaker). This technique has been widely adopted in expanding the styles of these en-US platform voices.

 

QinyingLiao_6-1670922503273.png

 

With this technology, we can also transfer the speaking styles from Microsoft voice library to customers’ own voices, allowing Custom Neural Voice customers to create voices that can speak in multiple styles without providing style-specific training data.

 

Get started

 

Interested in building a custom neural voice? Sign up to Speech service on Azure and get started on the Speech Studio. Learn about responsible use of Custom Neural Voice and submit your request form to gain access to the service.

 

Besides the capability to create your own brand voice, Microsoft offers over 400 neural voices covering more than 140 languages and locales. With these Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users.

 

For more information:

Learn more about Responsible Use of Custom Neural Voice/Guidelines/Terms

 

Version history
Last update:
‎Dec 13 2022 01:59 AM
Updated by: