This post is co-authored with Yulin Li, Yinhe Wei, Qinying Liao, Yueying Liu, Sheng Zhao
Voice is becoming increasingly popular in providing useful and engaging experiences for customers and employees. The Text-to-Speech (TTS) capability of Speech on Azure Cognitive Services allows you to quickly create intelligent read-aloud experience for your scenarios.
In this blog, we’ll walk through an exercise which you can complete in under two hours, to get started using Azure neural TTS voices and enable your apps to read content aloud. We’ll provide high level guidance and sample code to get you started, and we encourage you to play around with the code and get creative with your solution!
Read-aloud is a modern way to help people to read and consume content like emails and word documents more easily. It is a popular feature in many Microsoft products, which has received highly positive user feedback. A few latest examples:
With all these examples and more, we’ve seen clear trending of providing voice experiences for users consuming content on the go, when multi-tasking, or for those who tend to read in an audible way. With Azure neural TTS, it is easy to implement your own read-aloud that is pleasant to listen to for your users.
Azure neural TTS allows you to choose from more than 140 highly realistic voices across 60 languages and variants that enables fluid, natural-sounding speech, with rich customization capabilities available at the same time.
Why is neural TTS so much better? Traditional TTS is a multi-step pipeline, and a complex process. Each step could involve human, expert rules or individual models. There is no end-to-end optimization in between, so the quality is not optimal. The AI based neural TTS voice technology has simplified the pipeline into three major components. Each component can be modeled by advanced neural deep learning networks: a neural text analysis module, which generates more correct pronunciations for TTS to speak; a neural acoustic model, like uni-TTS which predicts prosody much better than the traditional TTS, and a neural vocoder, like HiFiNet which creates audios in higher fidelity.
With all these components, Azure neural TTS makes the listening experience much more enjoyable than the traditional TTS. Our studies repeatedly show that the read-aloud experience integrated with the highly natural voices on the Azure neural TTS platform can significantly increase the time that people spend on listening to the synthetic speech continuously, and greatly improve the effectiveness of their consumption of the audio content.
Usually, the reading content is available in many different languages. To read aloud more content and reach more users, TTS needs to support various locales. Azure neural TTS now supports more than 60 languages off the shelf. Check out the details in the full language list.
By offering more voices across more languages and locales, we anticipate developers across the world will be able to build applications that change experiences for millions. With our innovative voice models in the low-resource setting, we can also extend to new languages much faster than ever.
Azure neural TTS provides you a rich choice of different styles that resonate your content. For example, the newscast style is optimized for news content reading in a professional tone. The customer service style supports you to create a more friendly reading experience for conversational content focusing on customer support. In addition, various emotional styles and role-play capabilities can be used to create vivid audiobooks in synthetic voices.
Here are some examples of the voices and styles used for different types of content.
Language |
Content type |
Sample |
Note |
English (US) |
Newscast |
Aria, in the newscast style |
|
English (US) |
Newscast |
Guy, in the general/default style |
|
English (US) |
Conversational |
Jenny, in the chat style |
|
English (US) |
Audiobook |
Jenny, in multiple styles |
|
Chinese (Mandarin, simplified) |
Newscast |
Yunyang, in the newscast style |
|
Chinese (Mandarin, simplified) |
Conversational |
Yunxi, in the assistant style |
|
Chinese (Mandarin, simplified) |
Audiobook |
Multiple voices used: Xiaoxiao and Yunxi
Different styles used: lyrical, calm, angry, disgruntled, angry, embarrassed, with different style degrees applied
|
These styles can be adjusted using SSML, together with other tuning capabilities, including rate, pitch, pronunciation, pauses, and more.
Besides the rich choice of prebuilt neural voices, Azure TTS provides you a powerful capability to create a one-of-a-kind custom voice that can differentiate your brand from others. Using Custom Neural Voice, you can build a highly realistic voice using less than 30 minutes of audio as training data. You can then use your customized voices to create a unique read-aloud experience that reflects your brand identity or resonate the characteristics of your content.
Next, we’ll walk you through the coding exercise of developing the read-aloud feature with Azure neural TTS.
It is incredibly easy to add the read-aloud capability using Azure neural TTS to your application with the Speech SDK. Below we describe two typical designs to enable read-aloud for different scenarios.
If you don't have an Azure subscription, create a free account before you begin. If you have a subscription, log in to the Azure Portal and create a Speech resource.
In this design, the client directly interacts with Azure TTS using the Speech SDK. The following steps with the JavaScript code sample provide you the basic process to implement the read-aloud.
First, create the synthesizer with the selected language and voices. Make sure you select a neural voice to get the best quality.
const config = SpeechSDK.SpeechConfig. fromAuthorizationToken(“YourAuthorizationToken”, “YourSubscriptionRegion”);
config.SpeechSynthesisVoiceName = voice;
config.speechSynthesisOutputFormat = SpeechSDK.SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm;
// set the endpoint id if you are using custom voice
// config.endpointId = "YourEndpointId";
const player = new SpeechSDK.SpeakerAudioDestination();
const audioConfig = SpeechSDK.AudioConfig.fromSpeakerOutput(player);
var synthesizer = new SpeechSDK.SpeechSynthesizer(config, audioConfig);
Then you can hook up the events from the synthesizer. The event will be used to update the UX while the read-aloud is on.
player.onAudioEnd = function (_) {
window.console.log("playback finished");
// update your UX
};
The word boundary event is fired during synthesis. Usually, the synthesis speed is much faster than the playback speed of the audio. The word boundary event is fired before you get the corresponding audio chunks. The application can collect the event and the time stamp information of the audio for your next step.
var wordBoundaryList = [];
synthesizer.wordBoundary = function (s, e) {
window.console.log(e);
wordBoundaryList.push(e);
};
You can then highlight the word as the audio plays, using the code sample below.
setInterval(function () {
if (player !== undefined) {
const currentTime = player.currentTime;
var wordBoundary;
for (const e of wordBoundaryList) {
if (currentTime * 1000 > e.audioOffset / 10000) {
wordBoundary = e;
} else {
break;
}
}
if (wordBoundary !== undefined) {
highlightDiv.innerHTML = synthesisText.value.substr(0, wordBoundary.textOffset) +
"" + wordBoundary.text + "" +
synthesisText.value.substr(wordBoundary.textOffset + wordBoundary.wordLength);
} else {
highlightDiv.innerHTML = synthesisText.value;
}
}
}, 50);
See the full example here for more details.
In this design, the client interacts with a middle layer service, which then interacts with Azure TTS through the Speech SDK. It is suitable for below scenarios:
Below is a reference architecture for such design:
The roles of each component in this architecture are described below.
Check here for the sample code to call Azure TTS API from server.
Comparing to the client-side read-aloud design, the server-side read-aloud is a more advanced solution. It can cost higher but is more powerful to handle more complicated requirements.
The section above shows you how to build a read-aloud feature in the client and service scenarios. Below are some recommended practices that can help to make your development more efficient and improve your service experience.
When the content to read is long, it’s a good practice to always segment your reading content to sentences or short paragraphs in each request. Such segmentation has several benefits.
Using the Speech SDK’s PullAudioOutputStream, the synthesized audio in each turn could be easily merged into one stream.
Streaming is critical to lower the latency. When the first audio chunk is available, you can start the playback or start to forward the audio chunks immediately to your clients. The Speech SDK provides PullAudioOutputStream, PushAudioOutputStream, Synthesizing event, and AudioDateStream for streaming. You can select the one that best suites the architecture of your application. Find the samples here.
Besides, with the stream objects of the Speech SDK, you can get the seek-able in-memory audio stream, which works easily for any downstream services.
Whether you are building a voice-enabled chatbot or IoT device, an IVR solution, adding read-aloud features to your app, converting e-books to audio books, or even adding Speech to a translation app, you can make all these experiences natural sounding and fun with Neural TTS.
Let us know how you are using or plan to use Neural TTS voices in this form. If you prefer, you can also contact us at mstts [at] microsoft.com. We look forward to hearing about your experience and developing more compelling services together with you for the developers around the world.
Add voice to your app in 15 minutes
Explore the available voices in this demo
Deploy Azure TTS voices on prem with Speech Containers
Learn more about other Speech scenarios
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.