Blog Post

AI - Azure AI services Blog
7 MIN READ

Ignite 2020 Neural TTS updates: new language support, more voices and flexible deployment options

QinyingLiao's avatar
QinyingLiao
Icon for Microsoft rankMicrosoft
Sep 22, 2020

Ignite 2020 Neural Text-to-Speech updates: new language support, more voices and flexible deployment options

 

This post was co-authored by Garfield He, Melinda Ma, Yueying Liu and Yinhe Wei  

   

Neural Text to Speech (Neural TTS), a powerful speech synthesis capability of Cognitive Services on Azure, enables you to convert text to lifelike speech which is close to human-parity.  Since its launch, we have seen it widely adopted in a variety of scenarios by many Azure customers, from voice assistants to audio content creation. We continue to push the envelope to enable more developers to add natural-sounding voices to their applications and solutions.

 

Today, we are happy to announce a series of updates to Neural TTS that extends its reach globally and allows developers to deploy it anywhere the data resides. This includes new languages available, new voices with rich personas, and on-prem deployment through docker containers.

 

18 new languages/locales supported

 

Neural TTS has now been extended to support 18 new languages/locales. They are Bulgarian, Czech, German (Austria),  German (Switzerland), Greek, English (Ireland), French (Switzerland), Hebrew, Croatian, Hungarian, Indonesian, Malay, Romanian, Slovak, Slovenian, Tamil, Telugu and Vietnamese. 

 

You can hear samples of these voices below.

 

Locale  

Language

Gender

Voice 

Sample

bg-BG

Bulgarian

Female

Kalina

Архитектурното културно наследство в България е в опасност. 

cs-CZ

Czech

Female

Vlasta

Policisté většinou chodí v uniformě a jsou označeni hodnostmi.

de-AT

German (Austria)

Female

Ingrid

Ab Herbst werden Lehrer, die sich dafür interessieren, eigens ausgebildet.

de-CH

German (Switzerland)

Female

Leni

Dreizehn Millionen Liter mehr als im Vorjahr.

el-GR

Greek

Female

Athina

Για να βρεις ποιος σε εξουσιάζει, απλώς σκέψου ποιος είναι αυτός που δεν επιτρέπεται να κριτικάρεις .

en-IE

English  (Ireland)

Female

Emily

Now we have seventy members and two dragon boats.

fr-CH

French (Switzerland)

Female

Ariane

Chaque équipe jouera donc 5 matchs de 20 minutes dans sa poule.

he-IL

Hebrew (Israel)

Female

Hila

הכל פתוח במאבק על המקום האחרון לפלייאוף העליון של ליגת העל בכדורגל.

hr-HR

Croatian

Female

Gabrijela

Idemo na pobjedu u Maksimiru, pred našem publikom dat ćemo sto posto.

hu-HU

Hungarian

Female

Noemi

A macska felmászott a tetőre és leugrott.

id-ID

Indonesian

Male

Ardi

Inflasi dapat digolongkan menjadi empat golongan, yaitu inflasi ringan, sedang, berat, dan hiperinflasi.

ms-MY

Malay

Female

Yasmin

Beg berkenaan dibawa ke hospital untuk menjalankan proses pengenalan.

ro-RO

Romanian

Female

Alina

Temperaturile maxime se vor încadra între 15 şi 23 de grade Celsius.

sk-SK

Slovak

Female

Viktoria

Kúzelné miesta nájdete aj za jej hranicami, v malebnej prírode.

sl-SI

Slovenian

Female

Petra

Predlagani zakon vključuje tudi načrt nadaljnjega ukrepanja.

ta-IN

Tamil

Female

Pallavi

உச்சிமீது வானிடிந்து வீழுகின்ற போதினும், அச்சமில்லை அச்சமில்லை அச்சமென்பதில்லையே

te-IN

Telugu

Female

Shruti

అందం ముఖంలో ఉండదు. సహాయం చేసే మనసులో ఉంటుంది

vi-VN

Vietnamese

Female

HoaiMy

Hà Nội là thủ đô của Việt Nam.

 

With these new voices, Microsoft Azure Neural TTS supports 49 languages/locales in total.

 

14 additional voices released to enrich the variety

 

Customers use TTS for different scenarios and their requirements for voice personas can vary. To provide more options to developers, we continue to create more voices in each language. Besides the extension to support new locales, we’ve announced 14 new voices to enrich the variety in the existing languages.

 

Hear samples of these voices below.

 

Locale

Language

Gender

Voice 

Sample

de-DE

German

Male

Conrad

Je würziger das Fleisch, desto würziger und kräftiger sollte auch der Wein sein.

en-AU

English (Australia)

Male

William

They have told me nothing, and probably cannot tell me anything to the purpose.

en-GB

English  (UK)

Male

Ryan

Today’s temperature was a record 26.5 degrees Celsius.

en-US

English (US)

Female

Jenny

For example, we place a session cookie on your computer each time you visit our Website.

es-ES

Spanish (Spain)

Male

Alvaro

Dos helicópteros medicalizados tuvieron que acudir al lugar a rescatar a los heridos.

es-MX

Spanish (Mexico)

Male

Jorge

El niño mencionó que si pudiera caminar, pediría un balón para poder patearlo o una cuerda para poder saltar.

fr-CA

French  (Canada)

Male

Jean

Ce jour tant attendu arrive enfin!

fr-FR

French (France)

Male

Henri

Jusqu'ici, nous vous avons toujours fait confiance et accordé le bénefice du doute.

it-IT

Italian

Female

Isabella

I gel igienizzanti sono aumentati di prezzo.

it-IT

Italian

Male

Diego

Domani preparerò dei biscotti con le gocce di cioccolato.

ja-JP

Japanese

Male

Keita

キャッシュレス決済を利用して、支払いを簡単にする。

ko-KR

Korean

Male

InJoon

규모가 더욱 확대되었다.

pt-BR

Portuguese (Brazil)

Male

Antonio

O que você quer ganhar de presente de natal?

th-TH

Thai

Female

Premwadee

วิกฤตแบบนี้บริษัทยิ่งต้องการคนที่พร้อมเผชิญปัญหา

 

 

With these updates, Microsoft Azure Text-to-Speech service offers 68 neural voices.  Hear all these neural voices saying 'Thank you' in 49 languages/locales in the video below. 

 

 

Across standard and neural TTS capabilities, we now offer 140+ voices in total. Check the 70+ standard voices

 

More than 15 speaking styles available in en-US and zh-CN voices

 

Today, we’re building upon our Neural TTS capabilities in English (US) and Chinese (CN) with new voice styles. By default, the Text-to-Speech service synthesizes text using a neutral speaking style. With neural voices, you can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm, or optimize the voice for different scenarios like customer service, newscasting and voice assistant that fit your need.

 

With the English (US) new voice, Jenny, which is created with a friendly, warm and comforting voice persona focusing on conversational scenarios, we provide additional speaking styles including chatbot, customer service, and assistant.

 

You can hear the different speaking styles in Jenny’s voice below:

 

Style

Style description

Sample

General

Expresses a neutral tone and available for general use

Valentino Lazaro scored a late winner for Austria to deny Northern Ireland a first Nations League point.

Chat

Expresses a casual and relaxed tone in conversation

Oh, well, that's quite a change from California to Utah.

Customer service 

Expresses a friendly and helpful tone for customer support

Okay, great.  In the meantime, see if you can reach out to Verizon and let them know your issue. And Randy should be calling you back shortly.

Assistant

Expresses a warm and relaxed tone for digital assistants

United States spans 2 time zones. In Nashville, it's 9:45 PM.

 

A new speaking style is also available for the en-US male voice, Guy.  Guy’s newscast style can be a great choice for a male voice that can read professional and news related content. 


In addition, 10 new speaking styles are available with our zh-CN voice, Xiaoxiao. These new styles are optimized for audio content creators and intelligent bot developers to create more engaging interactive audios that express rich emotions.  

 

You can hear the new speaking styles in Xiaoxiao’s voice below:

 

Calm

Affectionate

Angry

那,那我再问你,你之前有养过宠物嘛?

老公,把灯打开好吗,好黑呀,我很怕。

没想到,我们八年的感情真的完了!

Disgruntled

Fearful

Gentle

这你都不明白吗?真是个榆木脑袋。

先生,你没事吧?要不要我叫医生过来?

我今天运气特别好,如果没有遇到您,还不知道会怎么样呢!

Cheerful

Serious

Sad

太好了,恭喜你顺利通过考核。

不要恋战,等待时机,随时准备突围。

没想到,你居然是这么一个无情无义的的人!

 

For the Chinese voice Xiaoxiao, the intensity (‘style degree’) of speaking style can be further adjusted to better fit your use case. You can specify a stronger or softer style with 'style degree' to make the speech more expressive or subdued.

 

没想到,你居然是这么一个无情无义的的人!

Sad=0.5

Sad=1.0

Sad=1.5

Sad=2.0

 

The style degree can be adjusted from 0.01 to 2 inclusive. The default value is 1 which means the predefined style intensity will be applied. The minimum unit is 0.01, which softens the style with a flatter tone. The value of 2 is the highest, which makes the style intensity obviously stronger than the default.

 

The SSML snippet below illustrates how the 'style degree' attribute is used to change the intensity of a speaking style.

 

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="zh-CN">

    <voice name="zh-CN-XiaoxiaoNeural">

        <mstts:express-as style="sad" styledegree="2">

            快走吧,路上一定要注意安全,早去早回。

        </mstts:express-as>

    </voice>

</speak>

 

The 'style degree' feature currently only applies to the Chinese voice Xiaoxiao and will come to more languages and voices later soon.

 

Check SSML for the details on how to use these speaking styles, together with other rich voice tuning capabilities.

 

Neural TTS Container is in public preview with 16 voices available in 14 languages

 

We have launched Neural TTS Container in public preview, as we are seeing a clear trend towards a future powered by the intelligent cloud and intelligent edge. With Neural TTS Container, developers can run speech synthesis with the most natural digital voices in their own environment for specific security and data governance requirements. Their Speech apps are portable and scalable with greater consistency whether they run on the edge or in Azure.

 

Currently 14 languages/locales are supported with 16 voices in Neural TTS Containers, as listed below. 

 

Locale

Voice

de-de

KatjaNeural

en-au

NatashaNeural

en-ca

ClaraNeural

en-gb

LibbyNeural

en-gb

MiaNeural

en-us

AriaNeural

en-us

GuyNeural

es-es

ElviraNeural

es-mx

DaliaNeural

fr-ca

SylvieNeural

fr-fr

DeniseNeural

it-it

ElsaNeural

ja-jp

NanamiNeural

ko-kr

SunHiNeural

pt-br

FranciscaNeural

zh-cn

XiaoxiaoNeural

 

To get started, fill out and submit the request form to request access to the container. Currently Neural TTS Containers are gated and only approved for enterprises (EA customers) and Microsoft partners, and to an extent only for qualified customers.

 

Azure Cognitive Services Containers including Neural TTS Containers aren't licensed to run without being connected to the metering / billing endpoint. You must enable the containers to communicate billing information with the billing endpoint at all times. Cognitive Services containers don't send customer data, such as the image or text that's being analyzed, to Microsoft. Queries to the container are billed at the pricing tier of the Azure resource that's used for the ApiKey.

 

Here are the steps of how to install and run the container:

  1. Make sure your machine to host the container meets the hardware requirements.
  2. Get the container image with docker pull. For all the supported locales and corresponding voices of the neural text-to-speech container, please see Neural Text-to-speech image tags.
  3. Run the container with docker run.
  4. Validate that the container is running.
  5. Query the container’s endpoint. Take AriaNeural voice for example, you can run below HTTP post method to get the TTS output audio:

curl -s -v -X POST http://localhost:5000/speech/synthesize/cognitiveservices/v1 \

 -H 'Accept: audio/*' \

 -H 'Content-Type: application/ssml+xml' \

 -H 'X-Microsoft-OutputFormat: riff-24khz-16bit-mono-pcm' \

 -d '<speak version="1.0" xml:lang="en-US"><voice name="en-US-AriaNeural">This is a test, only a test.</voice></speak>' > output.wav

 

Learn more about Container support in Cognitive Services and visit the Frequently Asked Questions on Azure Cognitive Services Containers.    

 

Get started

 

With these updates, we’re excited to be powering natural and intuitive voice experiences for more customers globally with flexible deployment options. For more information, visit below. 

 

Updated Sep 25, 2020
Version 3.0
  • serhat1141's avatar
    serhat1141
    Copper Contributor

    Hi Sir, I am really enjoying the audio content creation so far, but I got to ask a question as there seems to be a problem within the audio content creation page. I am especially using neural voices, but for the last few days, I am adjusting the RATE first, then INTONATION to make the pronounciation better like real speech, but as I am ADJUSTING THE INTONATION, THE RATE JUST GOES TO THE BASE TO 1.00. And also, when adjusting the INTONATION, IF I HAVE A RATE SET BEFORE, IT DOES NOT PREVIEW THE ADJUSTED INTONATION. I have been having this problem for some 3-5 days, and it seems like some kind of n annoying problem as I can't create the voices I have been for the last few weeks. I would appreciate if you can correct this problem. Thank you so much.

  • Hi serhat1141 Thank you for reporting the issue. We are investigating the cause and will fix if it's a bug. 

     

    If you have a support plan and you need technical help, you can create a support request.

    1. For Issue type, select “Technical”.
    2. For Subscription, select your subscription.
    3. For Service, click My services, then select “Cognitive Services”.
    4. For Summary, type a description of your issue.
    5. For Problem type, select “Text to Speech”.
    6. For problem subtype, select “accuracy of speech output”.