This post is co-authored with Melinda Ma, Yueying Liu, Garfield He and Sheng Zhao
Neural Text to Speech (Neural TTS), a powerful speech synthesis capability of Cognitive Services on Azure, enables you to convert text to lifelike speech which is close to human-parity.
Since its launch, Azure Neural TTS has been widely applied to all kinds of scenarios, from voice assistants to news reading and audiobook creation, etc. and we have seen more customer requests to support natural conversations that are casual and less formal. Today we are glad to announce a few updates on Neural TTS with a focus on the new voices that are optimized for casual conversation scenarios.
Conversational voices: scenarios and challenges
More TTS voices are used to support human-machine conversations, or machine-facilitated interpersonal communications (e.g, human conversations supported with speech-to-speech translation). In these scenarios, a more relaxed and casual speaking style is usually expected. We outline three typical scenarios for conversational voices or conversational styles below.
Customer service bot
Many enterprises are using voice-enabled chatbots or IVR systems to provide more efficient customer services and transform their traditional customer care. For example, Vodafone successfully created a natural-sounding customer service bot, TOBi, and used the AI and natural language processing capabilities in Azure to give TOBi a clear personality that could make conversations natural and fun, which drives better customer engagement. After a customer gives their name, instead of a dry request like, “Now tell me your address,” TOBi might say, “Hey, that’s a great name. Now I’d like to know where you live.” In such scenarios, the AI voice is usually expected to sound comforting, friendly, warm, while being professional. Besides providing answers to the customer inquiries, the AI voice is also frequently used to give cheerful greetings and show empathy to customers.
Personal assistant
With the emergence of virtual assistant and virtual reality technology, we’ve seen more customers using neural TTS in supporting chit-chats and daily conversations. One challenge in making the AI-human chat more natural is for the bot to understand the chat language that usually contains special characters, modal particles like “hehe”, “haha”, “ouch”, emojis like , repeated letters like “soooo good” and provide instant responses in tones that are natural. In addition, expressing different emotions with different messages is also a high-demanded ask so the chat bot can better resonate human feelings.
Simultaneous speech translation
Speech-to-speech translation is another typical scenario where a conversational AI voice can be used. With broad coverage of over 70 languages and variances, Azure Neural TTS has been used to provide speech output for various translations. During translation, it has been challenging, however, to keep the original speaker’s styles when his/her speech is translated to another language. Especially in the casual speech scenarios, the simultaneous speaking tones often provide the subtle nuances of the speech and help the audience build emotional connections with the speaker. In such cases, an AI voice that can support simultaneous speech and capture the casual speaking styles can make the speech-to-speech translation more vivid and engaging.
Next, we introduce the latest updates in Azure Neural TTS conversational voices in different languages.
Sara: a new chatbot voice in English (US)
Sara, a new conversational voice in English (US), represents a young female adult that talks more casually and fits best for the chatbot scenarios. On her day 1 release, she is built in with three emotional styles: cheerful, sad, and angry. In addition, she is capable of reading emojis and making laughter, sighs, or special angry sounds and expressing emphasis such as “soooo good”, just like a human being would.
Check out how those sound effects are like with the examples below.
Text input |
With emoji support |
That's great. I'm not working right now. |
|
Uhhh, let me ... let me think, I eat hamburger for dinner. |
Below is an example of Sara used in a chat bot scenario engaged in natural conversations with a human user. (This sample comes from a chitchat between the bot and the human user, and the dialogue is casual and may contain errors.)
With the new Sara voice, additionally, you can adjust the speaking style using SSML and switch between the neutral, cheerful, sad, and angry tones.
Style |
Script |
TTS output |
Cheerful |
I’m so happy to see you.
|
|
Sad |
She felt disheartened when she was not chosen to be on the school team.
|
|
Angry |
Jack’s father was fuming with anger when he could not find Jack in his room.
|
|
Chat |
File this under missed connections cuz i'm lost |
Xiaochen and Xiaoyan: new voices in Chinese optimized for spontaneous speech and customer service scenarios
Two new conversational voices are released in Chinese (Mandarin, simplified): Xiaochen, best used for creating spontaneous speech, and Xiaoyan, best used for customer service scenarios.
These two voices are highlighted with below characteristics:
- A more relaxed and casual speaking style
Conversational voices are different from voices for reading, broadcasting, or storytelling. In conversations, the voices are usually more relaxed, casual, and the prosody changes often. When people talk casually, the pronunciation of each word may not be complete, the sentence may not be accurate, and the control of the voice does not need to be perfect or professional. The new voices, Xiaochen and Xiaoyan, are produced to resonate this casual speaking style very well.
- More natural oral expressions
In spontaneous speech, sentences are often short, and the structure can be simple, or even incomplete. Repetitions, disconnections, supplements, interruptions, disfluency, and redundancy are often observed in spontaneous speech. Both the Xiaochen and Xiaoyan voices deal with the speech expression in these situations well, with our advanced modeling technology. The imperfections in human expression are carefully designed and modeled so the AI voices can learn from these imperfect features, and sound more realistic.
The following is a simulated conversation demo in a customer service scenario. In this sample, Xiaoyan acts as a customer service assistant, and Xiaochen acts as a customer. Hear how relaxed and natural Xiaochen and Xiaoyan are when talking to each other.
Xiaoyan |
喂,你好。 |
Xiaochen |
喂,你好,我刚才接到这个电话打来的,然后我想问一下是有什么包裹吗,还是什么东西。 |
Xiaoyan |
哦,您是要查包裹对吗? |
Xiaochen |
呃对,刚接到这个电话他说我有个包裹,但是我不确定,因为我没有寄东西。 |
Xiaoyan |
嗯,我这里是总机,刚刚可能是分机给您去的电话吧? |
Xiaochen |
对,然后他叫我打这个电话。 |
Xiaoyan |
嗯,那这样吧,麻烦您提供一下姓名,我帮您查一下。 |
Xiaochen |
晓辰。 |
Xiaoyan |
哪个辰? |
Xiaochen |
星辰的辰,晓是那个破晓的晓。 |
Xiaoyan |
嗯好的,您稍等一下好吗?我刚才帮您看了一下,确实有一份由晓辰姓名签收的包裹。号码是一二三四五六七八九八七,这是您本人吗? |
Xiaochen |
是我本人。 |
Xiaoyan |
嗯,因为这个包裹当时是由于地址不详,没有办法准确投递。这样您把这个详细地址跟我讲一下,我马上安排工作人员给您送过去好吗? |
Xiaochen |
哦,我现在在出差。不过也没关系,我到时候找人帮我签收,然后写我名字就可以了,是吧? |
Xiaoyan |
嗯,对的。 |
Xiaochen |
寄到鼓楼大街1号吧。那能查到是谁寄的吗? |
Xiaoyan |
上面没有写的。 |
Xiaochen |
啊那好吧。 |
Xiaoyan |
哦,不过这个包裹显示是从北京寄出的。 |
Xiaochen |
呃您稍等一下哈。诶,是从中关村寄出的吗? |
Xiaoyan |
嗯,是的。 |
Xiaochen |
啊,那我知道了。就是我可不可以报一个电话号码给你,然后叫派送的工作人员直接跟这个人联系,可以吗? |
Xiaoyan |
您说的这个人是也是在原来的地址是吧? |
Xiaochen |
对,你到时候跟她联系的话,就直接送过去,拿给她就行。 |
Xiaoyan |
嗯,好的。 |
Xiaochen |
好,谢谢你呀,那有什么问题我还是可以打这个电话对吗? |
Xiaoyan |
对的,没问题。 |
Xiaochen |
行,谢谢哈,给您添麻烦了。 |
Xiaoyan |
嗯,不客气。 |
Xiaochen |
好,那再见。 |
Xiaoyan |
麻烦您对我的服务进行评价,再见。 |
New styles for Nanami in Japanese
Nanami is a popular Japanese voice. Three new styles are now available with Nanami: chat, customer service, and cheerful. These styles can be used to make your voice experience more engaging in various scenarios.
Voice |
Style |
Description |
ja-JP-NanamiNeural |
style="customerservice" |
Expresses a friendly and helpful tone for customer support |
style="chat" |
Expresses a casual and relaxed tone |
|
style="cheerful" |
Expresses a positive and happy tone |
Try the samples below:
Style |
Script |
TTS output |
style="customerservice" |
注文番号もありますか?
|
|
style="chat" |
家賃はとても安いと思います。
|
|
style="cheerful" |
みなさんお楽しみに!
|
Updates on other languages
With more customers adopting Azure Neural TTS, we also collected more feedback on the pronunciation accuracy of our voices in different cases. With our latest release, 5 voices have been updated with significant improvements in the accuracy and naturalness. This can bring you better pronunciation and a more natural tone in four languages: id-ID, th-TH, da-DK, and vi-VN.
Hear how the improvements are with the samples below.
Locale |
Voice |
Improvement |
Sample script |
Before |
After |
id-ID |
Ardi |
Overall quality |
La lahir pada dua April seribu sembilan ratus sembilan puluh di Surakarta, Indonesia. |
||
th-TH |
Premwadee |
Overall quality |
เริ่มจ่ายเงินผ่าน ธ.ก.ส.ถึงมือชาวนาได้ตั้งแต่วันที่ 6 ธ.ค. 62 – 30 ก.ย.63 |
||
da-DK |
Christel |
Overall quality |
Sagde du noget til mig? |
||
vi-VN |
HoaiMy |
Pronunciation with the Southern accent |
Năm 1990, Liên Xô tan rã. |
||
vi-VN |
NamMinh |
Pronunciation with the Southern accent |
Năm 1990, Liên Xô tan rã. |
|
Get started
With these updates, we’re excited to be powering natural and intuitive voice experiences for more customers. Text to Speech offers over 170 neural voices across over 70 languages . In addition, the Custom Neural Voice capability enables organizations to create a unique brand voice in multiple languages and styles.
For more information:
- Try the demo
- See our documentation
- Check out our sample code