Azure AI Foundry Blog

6 MIN READ

Creating a branded AI voice that conveys emotions and speaks multiple languages

Microsoft

Jul 18, 2023

Today at Microsoft Inspire 2023, we're excited to announce the general availability (GA) of the new multi-style and multi-lingual custom neural voice (CNV) features inside Text to Speech, part of the Azure AI Speech capability. This new technology allows you to create a natural branded voice capable of expressing different emotions and speaking different languages.

Custom neural voice, a feature of Azure AI Speech, is a great way to help you create a one-of-a-kind voice that is natural and sounds identical to your voice actor. Since its launch, custom neural voice has empowered organizations such as AT&T, Progressive, Vodafone, and Swisscom to develop branded speech solutions that delight users in various scenarios including voice assistant, customer service bot, audiobooks, language learning, news reading and many more. (For more details, read the Innovation Stories blog.)

To answer the customer requests to support more expressive voices across the globe, we have already released a preview version of the CNV multi-style and multi-lingual capabilities. Now with the general availability of these features, we are upgrading the models to support more languages and enhanced voice naturalness.

In this blog post, we introduce what’s new and share a step-by-step guide to help you harness the power of this new feature.

Multi-style CNV GA: enable your voice to convey emotions

Branded voices have become increasingly popular in various scenarios, such as voice assistants, news reading, and audiobook creation. However, customers often request support for voice emotions and styles that would enhance their end-users' experience. Using the multi-style CNV feature, customers can create voices that speak in multiple styles/emotions without adding new training data, through Style Transfer.

Style Transfer is a method to apply the speaking tone and prosody (i.e., pace, intonation, rhythm) of one speaker (source speaker) to another speaker (target speaker). The result of the Style Transfer is the target speaker adopts the tone and prosody of the source speaker yet keeps their own voice timbre.

With its GA, we have updated the Style Transfer model for English (US) and expanded its support to Chinese (Mandarin, Simplified) and Japanese.

English (US)

The prebuilt styles for en-US include Angry, Cheerful, Excited, Friendly, Hopeful, Sad, Shouting, Terrified, Unfriendly, and Whispering. You can review the previous samples in this blog post. With the GA update, we released a more robust English (US) multi-style model which increased the naturalness of the speaking styles created.

Check out the sample below:

Chinese (Mandarin, Simplified)

Chinese is a new language we support for CNV with multiple styles. For zh-CN, the prebuilt styles are Angry, Calm, Chat, Cheerful, Disgruntled, Fearful, Sad, and Serious.

Sample	Male voice	Female voice
Training data (human voice)
Default style (general)
Angry
Calm
Chat
Cheerful
Disgruntled
Fearful
Sad
Serious

Japanese

Finally, for ja-JP, another new language that we support for multi-style CNV, the prebuilt styles are Angry, Happy and Sad.

Sample	Male voice	Female voice
Training data (human voice)
Default style (general)
Angry
Cheerful
Sad

Beyond these styles, you can also create your own speaking style for the same voice, with your style training data available.

How to create a multi-style voice

To create a multi-style voice, you only need to prepare a small set of voice samples (about 300+ utterances) in its default style.

After your data is imported to the Speech Studio portal, select ‘Neural – multi style’ as the training method.

Then select the target speaking styles you want to enable from a preset style list. Here is when you can also provide your own style data to create new speaking styles for the same voice.

The training could take 20-40 hours to finish based on the training data size, the language and the styles you select. Once the model is created, you can review the system generated audio samples to test the voice quality. Check more detailed instructions here.

After you've tested the voice and styles, you can deploy the voice to a custom neural voice endpoint (see how to deploy and use your voice model) and use the Audio Content Creation tool to create audio with your deployed voice in different speaking styles. Or, specify the speaking styles using SSML in your codes via the Speech SDK (see details here).

Multi-lingual CNV GA: adapt your voice to speak different languages

In today's interconnected world, developers are expected to build voice-enabled applications that can reach a global audience. Enabled with the cross-lingual adaptation technology, CNV feature allows for the creation of a custom voice that can speak dozens of languages without adding language-specific training data.

The cross-lingual model is a single unified model that is trained with data from different speakers and languages. This allows the model to transfer the voice of a speaker from one language to another. The backbone of the cross-lingual model is based on Conformer, which combines convolution neural networks and transformers to efficiently model both local and global dependencies in data sequence. To address the data imbalance issue in different locales, we adopted a data balanced training strategy that improves the model performance in the low-resource locales. Moreover, we jointly trained the model with a speaker classifier, which minimizes the cross-lingual speaker similarity loss and improved the speaker similarity in cross-lingual scenarios. Additionally, the new model can leverage information from L1 (first language) speakers to further improve the naturalness of cross-lingual speech.

With the general availability of the cross-lingual CNV feature, you can create your voice in one of the languages listed and adapt it to speak your desired languages in the list: Chinese (Mandarin, Simplified), Dutch (Netherlands), English (Australia), English (UK), English (US), French (Canada), French (France), German (Germany), Indonesian, Italian, Japanese, Korean, Portuguese (Brazil), Russian, Spanish (Mexico), Spanish (Spain). More languages are available upon request.

Below is an example of an English (US) voice and its localized version to other languages.

Sample	Script	Voice sample
Training data (human voice, English - US)	While birds might be happy, they sing in order to communicate.
Chinese (Mandarin, Simplified)	你可以把刚才的话再重复一遍吗？
Dutch (Netherlands)	Het is hartstikke leuk om op een jonge leeftijd te beginnen met duiken.
English (UK)	I can chill out a little in such a hot summer. Are you good at swimming?
French (Canada)	Quel est le légume le plus explosif? La roquette.
German (Germany)	Das macht mich neugierig. Danke für deine Hilfe.
Italian	Che bello, sono molto contento. Indosserò dei pantaloncini adatti.
Japanese	春の桜は、とてもきれいなので、写真を撮っています。
Korean	아니요. 저는 동물도 광물도 아니고 채소는 더더욱 아니에요.
Portuguese (Brazil)	Aquele bebê é muito engraçado!
Russian	Питайтесь разумно, и вы всегда будете в отличной форме.
Spanish (Mexico)	Bueno, haré mi mejor esfuerzo. Naranja dulce limón partido, dame un abrazo que yo te pido, si fueran falsos mis juramentos, en otro tiempo se olvidarán.

How to localize your voice to different languages

To create a voice that can speak multiple languages, select ‘Neural – cross lingual’ as the training method.

Then select the target language you want the voice to speak. The CNV platform will adapt your voice to another language.

The training takes around 20 hours to finish based on your training data size and the language. Once the model is created, you can evaluate how it goes by checking the test samples in the target language. See more instructions here.

Once the model is deployed, you can provide the text input in the target language and generate your content with the voice, again through the Audio Content Creation tool or through the synthesis service via the Speech SDK (see details here).

Get started

With the multi-style and multi-lingual custom neural voice feature, Microsoft has made significant strides towards creating highly natural branded voices capable of expressing different emotions and speaking multiple languages. This new technology is a game-changer for developers who are looking to build voice-enabled applications that can communicate with a global audience seamlessly.

Custom neural voice is a limited-access service, part of Microsoft’s commitment to responsible AI. Fill out an application to gain access to the technology, and follow the responsible deployment guidelines to use it responsibly.

For more information:

Add Text-to-Speech to your apps today

Try our demo to listen to existing neural voices

Custom neural voice portal

Apply for access to custom neural voice

Learn more about Responsible Use of Custom Neural Voice/Guidelines/Terms

Updated Nov 27, 2024

Version 2.0