Microsoft Foundry Blog

7 MIN READ

Ignite 2020 Neural TTS updates: new language support, more voices and flexible deployment options

QinyingLiao

Microsoft

Sep 22, 2020

Ignite 2020 Neural Text-to-Speech updates: new language support, more voices and flexible deployment options

This post was co-authored by Garfield He, Melinda Ma, Yueying Liu and Yinhe Wei

Neural Text to Speech (Neural TTS), a powerful speech synthesis capability of Cognitive Services on Azure, enables you to convert text to lifelike speech which is close to human-parity. Since its launch, we have seen it widely adopted in a variety of scenarios by many Azure customers, from voice assistants to audio content creation. We continue to push the envelope to enable more developers to add natural-sounding voices to their applications and solutions.

Today, we are happy to announce a series of updates to Neural TTS that extends its reach globally and allows developers to deploy it anywhere the data resides. This includes new languages available, new voices with rich personas, and on-prem deployment through docker containers.

**18 new languages/locales supported**

Neural TTS has now been extended to support 18 new languages/locales. They are Bulgarian, Czech, German (Austria), German (Switzerland), Greek, English (Ireland), French (Switzerland), Hebrew, Croatian, Hungarian, Indonesian, Malay, Romanian, Slovak, Slovenian, Tamil, Telugu and Vietnamese.

You can hear samples of these voices below.

Locale	Language	Gender	Voice	Sample
bg-BG	Bulgarian	Female	Kalina	Архитектурното културно наследство в България е в опасност.
cs-CZ	Czech	Female	Vlasta	Policisté většinou chodí v uniformě a jsou označeni hodnostmi.
de-AT	German (Austria)	Female	Ingrid	Ab Herbst werden Lehrer, die sich dafür interessieren, eigens ausgebildet.
de-CH	German (Switzerland)	Female	Leni	Dreizehn Millionen Liter mehr als im Vorjahr.
el-GR	Greek	Female	Athina	Για να βρεις ποιος σε εξουσιάζει, απλώς σκέψου ποιος είναι αυτός που δεν επιτρέπεται να κριτικάρεις .
en-IE	English (Ireland)	Female	Emily	Now we have seventy members and two dragon boats.
fr-CH	French (Switzerland)	Female	Ariane	Chaque équipe jouera donc 5 matchs de 20 minutes dans sa poule.
he-IL	Hebrew (Israel)	Female	Hila	הכל פתוח במאבק על המקום האחרון לפלייאוף העליון של ליגת העל בכדורגל.
hr-HR	Croatian	Female	Gabrijela	Idemo na pobjedu u Maksimiru, pred našem publikom dat ćemo sto posto.
hu-HU	Hungarian	Female	Noemi	A macska felmászott a tetőre és leugrott.
id-ID	Indonesian	Male	Ardi	Inflasi dapat digolongkan menjadi empat golongan, yaitu inflasi ringan, sedang, berat, dan hiperinflasi.
ms-MY	Malay	Female	Yasmin	Beg berkenaan dibawa ke hospital untuk menjalankan proses pengenalan.
ro-RO	Romanian	Female	Alina	Temperaturile maxime se vor încadra între 15 şi 23 de grade Celsius.
sk-SK	Slovak	Female	Viktoria	Kúzelné miesta nájdete aj za jej hranicami, v malebnej prírode.
sl-SI	Slovenian	Female	Petra	Predlagani zakon vključuje tudi načrt nadaljnjega ukrepanja.
ta-IN	Tamil	Female	Pallavi	உச்சிமீது வானிடிந்து வீழுகின்ற போதினும், அச்சமில்லை அச்சமில்லை அச்சமென்பதில்லையே
te-IN	Telugu	Female	Shruti	అందం ముఖంలో ఉండదు. సహాయం చేసే మనసులో ఉంటుంది
vi-VN	Vietnamese	Female	HoaiMy	Hà Nội là thủ đô của Việt Nam.

With these new voices, Microsoft Azure Neural TTS supports 49 languages/locales in total.

14 additional voices released to enrich the variety

Customers use TTS for different scenarios and their requirements for voice personas can vary. To provide more options to developers, we continue to create more voices in each language. Besides the extension to support new locales, we’ve announced 14 new voices to enrich the variety in the existing languages.

Hear samples of these voices below.

Locale	Language	Gender	Voice	Sample
de-DE	German	Male	Conrad	Je würziger das Fleisch, desto würziger und kräftiger sollte auch der Wein sein.
en-AU	English (Australia)	Male	William	They have told me nothing, and probably cannot tell me anything to the purpose.
en-GB	English (UK)	Male	Ryan	Today’s temperature was a record 26.5 degrees Celsius.
en-US	English (US)	Female	Jenny	For example, we place a session cookie on your computer each time you visit our Website.
es-ES	Spanish (Spain)	Male	Alvaro	Dos helicópteros medicalizados tuvieron que acudir al lugar a rescatar a los heridos.
es-MX	Spanish (Mexico)	Male	Jorge	El niño mencionó que si pudiera caminar, pediría un balón para poder patearlo o una cuerda para poder saltar.
fr-CA	French (Canada)	Male	Jean	Ce jour tant attendu arrive enfin!
fr-FR	French (France)	Male	Henri	Jusqu'ici, nous vous avons toujours fait confiance et accordé le bénefice du doute.
it-IT	Italian	Female	Isabella	I gel igienizzanti sono aumentati di prezzo.
it-IT	Italian	Male	Diego	Domani preparerò dei biscotti con le gocce di cioccolato.
ja-JP	Japanese	Male	Keita	キャッシュレス決済を利用して、支払いを簡単にする。
ko-KR	Korean	Male	InJoon	규모가 더욱 확대되었다.
pt-BR	Portuguese (Brazil)	Male	Antonio	O que você quer ganhar de presente de natal?
th-TH	Thai	Female	Premwadee	วิกฤตแบบนี้บริษัทยิ่งต้องการคนที่พร้อมเผชิญปัญหา

With these updates, Microsoft Azure Text-to-Speech service offers 68 neural voices. Hear all these neural voices saying 'Thank you' in 49 languages/locales in the video below.

Across standard and neural TTS capabilities, we now offer 140+ voices in total. Check the 70+ standard voices.

More than 15 speaking styles available in en-US and zh-CN voices

Today, we’re building upon our Neural TTS capabilities in English (US) and Chinese (CN) with new voice styles. By default, the Text-to-Speech service synthesizes text using a neutral speaking style. With neural voices, you can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm, or optimize the voice for different scenarios like customer service, newscasting and voice assistant that fit your need.

With the English (US) new voice, Jenny, which is created with a friendly, warm and comforting voice persona focusing on conversational scenarios, we provide additional speaking styles including chatbot, customer service, and assistant.

You can hear the different speaking styles in Jenny’s voice below:

Style	Style description	Sample
General	Expresses a neutral tone and available for general use	Valentino Lazaro scored a late winner for Austria to deny Northern Ireland a first Nations League point.
Chat	Expresses a casual and relaxed tone in conversation	Oh, well, that's quite a change from California to Utah.
Customer service	Expresses a friendly and helpful tone for customer support	Okay, great. In the meantime, see if you can reach out to Verizon and let them know your issue. And Randy should be calling you back shortly.
Assistant	Expresses a warm and relaxed tone for digital assistants	United States spans 2 time zones. In Nashville, it's 9:45 PM.

A new speaking style is also available for the en-US male voice, Guy. Guy’s newscast style can be a great choice for a male voice that can read professional and news related content.

In addition, 10 new speaking styles are available with our zh-CN voice, Xiaoxiao. These new styles are optimized for audio content creators and intelligent bot developers to create more engaging interactive audios that express rich emotions.

You can hear the new speaking styles in Xiaoxiao’s voice below:

Calm	Affectionate	Angry
那，那我再问你，你之前有养过宠物嘛？	老公，把灯打开好吗，好黑呀，我很怕。	没想到，我们八年的感情真的完了！
Disgruntled	Fearful	Gentle
这你都不明白吗？真是个榆木脑袋。	先生，你没事吧？要不要我叫医生过来？	我今天运气特别好,如果没有遇到您,还不知道会怎么样呢！
Cheerful	Serious	Sad
太好了，恭喜你顺利通过考核。	不要恋战，等待时机，随时准备突围。	没想到，你居然是这么一个无情无义的的人！

For the Chinese voice Xiaoxiao, the intensity (‘style degree’) of speaking style can be further adjusted to better fit your use case. You can specify a stronger or softer style with 'style degree' to make the speech more expressive or subdued.

没想到，你居然是这么一个无情无义的的人！
Sad=0.5	Sad=1.0
Sad=1.5	Sad=2.0

The style degree can be adjusted from 0.01 to 2 inclusive. The default value is 1 which means the predefined style intensity will be applied. The minimum unit is 0.01, which softens the style with a flatter tone. The value of 2 is the highest, which makes the style intensity obviously stronger than the default.

The SSML snippet below illustrates how the 'style degree' attribute is used to change the intensity of a speaking style.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="zh-CN">

<mstts:express-as style="sad" styledegree="2">

快走吧，路上一定要注意安全，早去早回。

</mstts:express-as>

</voice>

</speak>

The 'style degree' feature currently only applies to the Chinese voice Xiaoxiao and will come to more languages and voices later soon.

Check SSML for the details on how to use these speaking styles, together with other rich voice tuning capabilities.

Neural TTS Container is in public preview with 16 voices available in 14 languages

We have launched Neural TTS Container in public preview, as we are seeing a clear trend towards a future powered by the intelligent cloud and intelligent edge. With Neural TTS Container, developers can run speech synthesis with the most natural digital voices in their own environment for specific security and data governance requirements. Their Speech apps are portable and scalable with greater consistency whether they run on the edge or in Azure.

Currently 14 languages/locales are supported with 16 voices in Neural TTS Containers, as listed below.

Locale	Voice
de-de	KatjaNeural
en-au	NatashaNeural
en-ca	ClaraNeural
en-gb	LibbyNeural
en-gb	MiaNeural
en-us	AriaNeural
en-us	GuyNeural
es-es	ElviraNeural
es-mx	DaliaNeural
fr-ca	SylvieNeural
fr-fr	DeniseNeural
it-it	ElsaNeural
ja-jp	NanamiNeural
ko-kr	SunHiNeural
pt-br	FranciscaNeural
zh-cn	XiaoxiaoNeural

To get started, fill out and submit the request form to request access to the container. Currently Neural TTS Containers are gated and only approved for enterprises (EA customers) and Microsoft partners, and to an extent only for qualified customers.

Azure Cognitive Services Containers including Neural TTS Containers aren't licensed to run without being connected to the metering / billing endpoint. You must enable the containers to communicate billing information with the billing endpoint at all times. Cognitive Services containers don't send customer data, such as the image or text that's being analyzed, to Microsoft. Queries to the container are billed at the pricing tier of the Azure resource that's used for the ApiKey.

Here are the steps of how to install and run the container:

Make sure your machine to host the container meets the hardware requirements.
Get the container image with docker pull. For all the supported locales and corresponding voices of the neural text-to-speech container, please see Neural Text-to-speech image tags.
Run the container with docker run.
Validate that the container is running.
Query the container’s endpoint. Take AriaNeural voice for example, you can run below HTTP post method to get the TTS output audio:

curl -s -v -X POST http://localhost:5000/speech/synthesize/cognitiveservices/v1 \

-H 'Accept: audio/*' \

-H 'Content-Type: application/ssml+xml' \

-H 'X-Microsoft-OutputFormat: riff-24khz-16bit-mono-pcm' \

-d '<speak version="1.0" xml:lang="en-US"><voice name="en-US-AriaNeural">This is a test, only a test.</voice></speak>' > output.wav

Learn more about Container support in Cognitive Services and visit the Frequently Asked Questions on Azure Cognitive Services Containers.

Get started

With these updates, we’re excited to be powering natural and intuitive voice experiences for more customers globally with flexible deployment options. For more information, visit below.