Azure Neural TTS releases 5 new voices and expands emotions in American English
Published May 09 2022 06:00 AM 8,825 Views
Microsoft

 This post is co-authored with Peter Pan, Andy Beatman and Qinying Liao

 

Neural Text to Speech (Neural TTS) is a powerful speech synthesis capability of Azure Cognitive Services. You can convert text to speech in ways that mirror natural speaking styles.

 

Businesses have utilized Neural TTS in various scenarios, such as voice assistants, video games, online-learning, accessibility tools for content read-aloud, and a lot more. Check out these customer stories featuring companies like Vodafone, Vegas and Pearson that are using Neural TTS to transform their business. To better support the diverse customer use cases and make their voice experience even more natural, a richer selection of voice options and a variety of speaking styles especially emotions become critical.

 

Today we are excited to announce the release of 5 new neural voices in American English (en-US) and introduce 10 new speaking styles.  The new speaking styles include 8 emotions, in addition to shouting and whispering.  Customers can access the new speaking styles with nine en-US voices, including the 5 new ones. With these updates, Azure TTS enables customers to develop apps that better mirror human voices and express emotions. Currently the new voices and styles are in preview.

 

5 new neural TTS voices in en-US

With the 5 new voices added to the portfolio, Neural TTS now supports 20 voices in American English, allowing a richer choice of voice personas that addresses wider user scenarios for more customers.

Check out below table for the new members to the en-US voice family and hear how they sound. You can also try your own text with these voices on this demo.

 

Voices

Gender

Sample  

Davis

Male

Jane

Female

Jason

Male

Nancy

Female

Tony

Male

 

Check out a full list of American English voices here.  

 

New styles added to en-US voices

With this release, we extend speaking styles to more voices. Now Azure TTS enables 8 emotions and finally add shouting and whispering for nine en-US voices: Aria, Davis, Guy, Jane, Jason, Jenny, Nancy, Tony, Sara

 

Emotions

We build a number of new emotional styles to both male and female voices. Currently the emotions enabled in en-US voices include cheerful, sad, angry, excited, friendly, unfriendly, hopeful and terrified

Below are the samples from Jenny, one of the voices with emotions enabled. Hear how each emotion differs from others:

 

 

Style description  

Sample (Jenny)

Cheerful

Expresses a positive and happy tone.

Sad

Expresses a sorrowful tone.

Angry

Expresses an angry and annoyed tone.

Excited 

Expresses an upbeat and hopeful tone. It sounds like something great is happening and the speaker is really happy about that.

Friendly 

Expresses a pleasant, inviting and warm tone. It sounds sincere and caring.

Unfriendly

Expresses a cold and indifferent tone.

Hopeful

Expresses a warm and yearning tone. It sounds like something good will happen to the speaker.

Terrified

 

Expresses a very scared tone, with faster pace and a shakier voice. It sounds like the speaker is in an unsteady and frantic status.

 

 

Check out more samples of these emotions on en-US voices:

 

Style

Sample (male)

Sample (female)

Cheerful

Sad

Angry

Excited 

Hopeful

Friendly

Unfriendly

Terrified

 

Shouting and whispering

Azure TTS supports shouting and whispering styles for the first time. With the shouting style, you will be able to hear that someone is speaking from a far distance or trying to be heard clearly in a noisy place. For the whispering style, you can make the voice appear to be speaking in private or telling a secret. These 2 styles make a character speak more vividly with Azure TTS in video game, audiobook, or film.

Here are shouting and whispering TTS samples from Jenny voice.

 

Style

Style description  

Sample

Shouting

Speak like from a far distant or outside and to make self be clearly heard

Whispering

Speak very softly and make a quiet and gentle sound

 

And more samples from  other voices

Style

Sample (male)

Sample (female)

Shouting

Whispering

 

Check out a full list of speaking styles we support for en-US voices here.

 

The technology behind: Style Transfer to build voice styles in scale

To enrich the style support and keep style parity for as many TTS voices as possible, we have applied a technology called “Style Transfer” to build speaking styles efficiently. Style Transfer is a method to apply the speaking tone and prosody (i.e., pace, intonation, rhythm) of one speaker (source speaker) to another speaker (target speaker). The result of the Style Transfer is the target speaker adopts the tone and prosody of the source speaker yet keeps their own voice timbre.

Conventionally, to build a voice style for TTS, we need to collect style recording data e.g. emotional speaking data from the original source voice actor. However, sometimes we are unable to gather significant emotional data due to voice actor availability, or gaps in the voice actors’ emotional range.

The innovation of Style Transfer solves this customer challenge effectively.  (See our Interspeech 2021 paper for details). With as few as 100 recorded utterances, we can learn the speaking style and apply it to a target speaker with good quality (MOS gap to source emotion recording < 0.2) on top of UniTTS v4. This technique is widely adopted in expanding the styles of these en-US platform voices.

MelindaMa_0-1651211335547.png

 

How to use

The new voices and style expansion are in public preview. The 5 new voices—Davis, Jean, Jason, Nancy, and Tony--are only available in three service regions: East US, West Europe, and Southeast Asia. For the existing voices including Aria, Jenny, Guy and Sara, the new and expanded speaking styles are accessible in all service regions.

 

Below is a short SSML snippet of using the 'mstts:express-as' tag to trigger speaking styles:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">

    <voice name="en-US-JennyNeural">

        <mstts:express-as style="cheerful">

            That'd be just amazing!

        </mstts:express-as>

    </voice>

</speak>

 

You can also easily create audio files with these voices and styles using our Audio Content Creation tool, without writing a single line of code.

 

What’s next

We are inspired by how the Style Transfer technology allows us to bring new voice styles to our customers. In the future, we expect to apply this process to other languages to improve global reach and accessibility of TTS. In addition, we are evaluating the potential to implement Style Transfer to Custom Neural Voice (CNV). With this implemented, a custom neural voice would be able to feature multiple styles without needing additional recording data. If you are interested in learning more about Style Transfer for CNV, please send us a note at mstts[at]microsoft.com.

 

Get started

For more information, please visit below:

 

 

6 Comments
Version history
Last update:
‎Jan 17 2023 12:20 AM
Updated by: