Since its launch, Azure Neural TTS has been widely applied to all kinds of scenarios, from voice assistants to news reading and audiobook creation, etc. and we have seen more customer requests to support natural conversations that are casual and less formal. Today we are glad to announce a few updates on Neural TTS with a focus on the new voices that are optimized for casual conversation scenarios.
Conversational voices: scenarios and challenges
More TTS voices are used to support human-machine conversations, or machine-facilitated interpersonal communications (e.g, human conversations supported with speech-to-speech translation). In these scenarios, a more relaxed and casual speaking style is usually expected. We outline three typical scenarios for conversational voices or conversational styles below.
Customer service bot
Many enterprises are using voice-enabled chatbots or IVR systems to provide more efficient customer services and transform their traditional customer care. For example, Vodafone successfully created a natural-sounding customer service bot, TOBi, and used the AI and natural language processing capabilities in Azure to give TOBi a clear personality that could make conversations natural and fun, which drives better customer engagement. After a customer gives their name, instead of a dry request like, “Now tell me your address,” TOBi might say, “Hey, that’s a great name. Now I’d like to know where you live.” In such scenarios, the AI voice is usually expected to sound comforting, friendly, warm, while being professional. Besides providing answers to the customer inquiries, the AI voice is also frequently used to give cheerful greetings and show empathy to customers.
With the emergence of virtual assistant and virtual reality technology, we’ve seen more customers using neural TTS in supporting chit-chats and daily conversations. One challenge in making the AI-human chat more natural is for the bot to understand the chat language that usually contains special characters, modal particles like “hehe”, “haha”, “ouch”, emojis like , repeated letters like “soooo good” and provide instant responses in tones that are natural. In addition, expressing different emotions with different messages is also a high-demanded ask so the chat bot can better resonate human feelings.
Simultaneous speech translation
Speech-to-speech translation is another typical scenario where a conversational AI voice can be used. With broad coverage of over 70 languages and variances, Azure Neural TTS has been used to provide speech output for various translations. During translation, it has been challenging, however, to keep the original speaker’s styles when his/her speech is translated to another language. Especially in the casual speech scenarios, the simultaneous speaking tones often provide the subtle nuances of the speech and help the audience build emotional connections with the speaker. In such cases, an AI voice that can support simultaneous speech and capture the casual speaking styles can make the speech-to-speech translation more vivid and engaging.
Next, we introduce the latest updates in Azure Neural TTS conversational voices in different languages.
Sara: a new chatbot voice in English (US)
Sara, a new conversational voice in English (US), represents a young female adult that talks more casually and fits best for the chatbot scenarios. On her day 1 release, she is built in with three emotional styles: cheerful, sad, and angry. In addition, she is capable of reading emojis and making laughter, sighs, or special angry sounds and expressing emphasis such as “soooo good”, just like a human being would.
Check out how those sound effects are like with the examples below.
With emoji support
That's great. I'm not working right now.
Uhhh, let me ... let me think, I eat hamburger for dinner.
Below is an example of Sara used in a chat bot scenario engaged in natural conversations with a human user. (This sample comes from a chitchat between the bot and the human user, and the dialogue is casual and may contain errors.)
With the new Sara voice, additionally, you can adjust the speaking style using SSML and switch between the neutral, cheerful, sad, and angry tones.
I’m so happy to see you.
She felt disheartened when she was not chosen to be on the school team.
Jack’s father was fuming with anger when he could not find Jack in his room.
File this under missed connections cuz i'm lost
Xiaochen and Xiaoyan: new voices in Chinese optimized for spontaneous speech and customer service scenarios
Two new conversational voices are released in Chinese (Mandarin, simplified): Xiaochen, best used for creating spontaneous speech, and Xiaoyan, best used for customer service scenarios.
These two voices are highlighted with below characteristics:
A more relaxed and casual speaking style
Conversational voices are different from voices for reading, broadcasting, or storytelling. In conversations, the voices are usually more relaxed, casual, and the prosody changes often. When people talk casually, the pronunciation of each word may not be complete, the sentence may not be accurate, and the control of the voice does not need to be perfect or professional. The new voices, Xiaochen and Xiaoyan, are produced to resonate this casual speaking style very well.
More natural oral expressions
In spontaneous speech, sentences are often short, and the structure can be simple, or even incomplete. Repetitions, disconnections, supplements, interruptions, disfluency, and redundancy are often observed in spontaneous speech. Both the Xiaochen and Xiaoyan voices deal with the speech expression in these situations well, with our advanced modeling technology. The imperfections in human expression are carefully designed and modeled so the AI voices can learn from these imperfect features, and sound more realistic.
The following is a simulated conversation demo in a customer service scenario. In this sample, Xiaoyan acts as a customer service assistant, and Xiaochen acts as a customer. Hear how relaxed and natural Xiaochen and Xiaoyan are when talking to each other.
Nanami is a popular Japanese voice. Three new styles are now available with Nanami: chat, customer service, and cheerful. These styles can be used to make your voice experience more engaging in various scenarios.
Expresses a friendly and helpful tone for customer support
Expresses a casual and relaxed tone
Expresses a positive and happy tone
Try the samples below:
Updates on other languages
With more customers adopting Azure Neural TTS, we also collected more feedback on the pronunciation accuracy of our voices in different cases. With our latest release, 5 voices have been updated with significant improvements in the accuracy and naturalness. This can bring you better pronunciation and a more natural tone in four languages: id-ID, th-TH, da-DK, and vi-VN.
Hear how the improvements are with the samples below.
La lahir pada dua April seribu sembilan ratus sembilan puluh di Surakarta, Indonesia.