Neural Text-to-Speech (Neural TTS), a powerful speech synthesis capability of Azure Cognitive Services, enables users to convert text to lifelike speech. It is used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, and more. Azure Neural TTS has been incorporated into Microsoft’s flagship products such as Edge Read Aloud, Immersive Reader, and Word Read Aloud. It’s also been adopted by many customers such as AT&T, Duolingo, Progressive, and more. Users can choose from multiple pre-set voices or record and upload their own sample to create custom voices instead. Over 120 languages and variances are supported, including a wide array of language variants, also known as locales.
Although TTS quality has been much improved in recent years, we keep receiving increasingly higher expectations from customers for making the TTS voices more natural in scenarios like reading different content, and making dynamic conversations etc. To continue to advance the state-of-the-art neural TTS, we work closely with research scientists on innovating speech synthesis models that mirror human speech and rolling out these new model architectures to Azure Neural TTS service, so it benefits all developers.
Since last December, Azure Neural TTS has been updated with UniTTSv4 model which shows no significant difference to natural human recording at sentence level using MOS as metrics. In this blog, we introduce a new research innovation, code named NaturalSpeech, which brings a new milestone to neural TTS achieving no significant difference with natural human recordings using side-by-side CMOS as metrics on a popular TTS dataset (LJSpeech) for the first time.
The new technical innovations in this research will be further integrated and shipped to Azure Neural TTS through the Azure TTS API for all voices that we support moving forward.
Text-to-speech quality is usually measured by the Mean Opinion Score (MOS), a widely recognized scoring method for speech quality evaluation. For MOS studies, participants rate speech characteristics for both recordings of peoples’ voices and TTS voices on a five-point scale. These characteristics include sound quality, pronunciation, speaking rate, and articulation.
While MOS can be used to compare the quality difference, it’s not sensitive enough to the difference in voice quality because voice samples from two systems are not paired during the rating. Comparative MOS (CMOS) tests, on the contrary, compare each utterance from two systems side by side and a 7-point scale (-3 to 3) is used to measure the difference.
During research of the end-to-end TTS model NaturalSpeech, we conducted both MOS and CMOS tests to compare the TTS generated output and the human recordings. Experiment evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset. And it is also much better than the previous TTS systems on this dataset.
Previously Neural TTS models are mainly separated as two models: acoustic model and vocoder. The two models are trained separately. There are mismatches during training and inference which could lead to less optimal results.
We developed a fully end-to-end text-to-waveform generation system called NaturalSpeech to reduce the mismatch and bridge the quality gap to recordings (see Figure 1). The whole system is based on variational auto-encoder (VAE), with several designs:
Figure 1: System overview of NaturalSpeech.
With the above designs, NaturalSpeech has several advantages:
We conduct experimental evaluation on LJSpeech dataset to measure the voice quality of NaturalSpeech system. We first compare the speech generated by NaturalSpeech with recordings under MOS and CMOS evaluations, as shown in Table 1 and 2. NaturalSpeech system achieves similar quality scores with human recordings in both MOS and CMOS. Importantly, our system achieves −0.01 CMOS compared to recordings, with a Wilcoxon p-value p >> 0.05, which demonstrates the speech generated by our system has no statistically significant difference from human recordings.
Table 1: MOS comparison between NaturalSpeech and human recordings. Wilcoxon rank sum test is used to measure the p-value in MOS evaluation.
Table 2: CMOS comparison between NaturalSpeech and human recordings. Wilcoxon signed rank test is used to measure the p-value in CMOS evaluation.
Listen to the samples below to compare the TTS output generated using NaturalSpeech vs. the human recordings:
Script: Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.
NaturalSpeech | Human recording |
---|---|
We have also done the ability study which shows the design of each component can help improve the quality in the CMOS test. As the next step, we will work on shipping the innovations in this research model into future Azure TTS service updates. On the other hand, while we achieve great quality with the new model, it doesn’t mean TTS is a totally solved problem. There are more challenging scenarios including making the voice expressive, natural in long-form content reading, and spontaneous etc., which will require more advanced modelling techniques to model the expressiveness and variation nature of human speech.
We are excited about the future of Neural TTS with human-centric and natural-sounding quality under the XYZ-Code AI framework. Like other publicly available models, Neural TTS models are trained with billions of pages of publicly available text, and hence may have picked up biases around gender, race, and more from these public documents. Mitigating negative effects from these biases is a difficult, industrywide issue, and Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these Microsoft AI principles into practice throughout the company and have taken extensive precautionary measures to prevent these implicit biases from getting exhibited when using the models in our products. We strongly encourage developers to do the same by putting appropriate guardrails and mitigations in place before taking these models to production.
Azure AI Neural TTS offers over 340 neural voices across over 120 languages and locales. In addition, the capability enables organizations to create a unique brand voice in multiple languages and styles. To explore the capabilities of Neural TTS with some of its different voice offerings, try the demo.
For more information:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.