Neural Text-to-Speech (Neural TTS), a powerful speech synthesis capability of Azure Cognitive Services, enables users to convert text to lifelike speech. It is used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, and more. Azure Neural TTS has been incorporated into Microsoft’s flagship products such as Edge Read Aloud, Immersive Reader, and Word Read Aloud. It’s also been adopted by many customers such as AT&T, Duolingo, Progressive, and more. Users can choose from multiple pre-set voices or record and upload their own sample to create custom voices instead. Over 120 languages and variances are supported, including a wide array of language variants, also known as locales.
Although TTS quality has been much improved in recent years, we keep receiving increasingly higher expectations from customers for making the TTS voices more natural in scenarios like reading different content, and making dynamic conversations etc. To continue to advance the state-of-the-art neural TTS, we work closely with research scientists on innovating speech synthesis models that mirror human speech and rolling out these new model architectures to Azure Neural TTS service, so it benefits all developers.
Since last December, Azure Neural TTS has been updated with UniTTSv4 model which shows no significant difference to natural human recording at sentence level using MOS as metrics. In this blog, we introduce a new research innovation, code named NaturalSpeech, which brings a new milestone to neural TTS achieving no significant difference with natural human recordings using side-by-side CMOS as metrics on a popular TTS dataset (LJSpeech) for the first time.
The new technical innovations in this research will be further integrated and shipped to Azure Neural TTS through the Azure TTS API for all voices that we support moving forward.
Measuring TTS quality: MOS and CMOS
Text-to-speech quality is usually measured by the Mean Opinion Score (MOS), a widely recognized scoring method for speech quality evaluation. For MOS studies, participants rate speech characteristics for both recordings of peoples’ voices and TTS voices on a five-point scale. These characteristics include sound quality, pronunciation, speaking rate, and articulation.
While MOS can be used to compare the quality difference, it’s not sensitive enough to the difference in voice quality because voice samples from two systems are not paired during the rating. Comparative MOS (CMOS) tests, on the contrary, compare each utterance from two systems side by side and a 7-point scale (-3 to 3) is used to measure the difference.
During research of the end-to-end TTS model NaturalSpeech, we conducted both MOS and CMOS tests to compare the TTS generated output and the human recordings. Experiment evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset. And it is also much better than the previous TTS systems on this dataset.
NaturalSpeech: end-to-end text-to-waveform neural TTS model
Previously Neural TTS models are mainly separated as two models: acoustic model and vocoder. The two models are trained separately. There are mismatches during training and inference which could lead to less optimal results.
We developed a fully end-to-end text-to-waveform generation system called NaturalSpeech to reduce the mismatch and bridge the quality gap to recordings (see Figure 1). The whole system is based on variational auto-encoder (VAE), with several designs:
Figure 1: System overview of NaturalSpeech.
We leverage large-scale pre-training on the phoneme encoder to extract better representations from phoneme sequence.
We leverage a fully differentiable durator that consists of a duration predictor and an upsampling layer to improve the duration modeling.
We design a bidirectional prior/posterior module based on flow models and a memory-based VAE to improve representation and mapping capabilities.
With the above designs, NaturalSpeech has several advantages:
Reduce training-inference mismatch. In previous cascaded acoustic model/vocoder pipeline and explicit duration prediction, both mel-spectrogram and duration suffer from training-inference mismatch since ground-truth values are used in training the vocoder and mel-spectrogram decoder while predicted values are used in inference. Our fully end-to-end text-to-waveform generation and differentiable durator can avoid the training-inference mismatch.
Alleviate one-to-many mapping problem. One text sequence can correspond to multiple speech utterances with different variation information (e.g., pitch, duration, speed, pause, prosody, etc). Previous works only using variance adaptor to predict pitch/duration cannot well handle the one-to-many mapping problem. Our memory-based VAE and bidirectional prior/posterior can reduce the complexity of posterior and enhance the prior, which helps relieve the one-to-many mapping problem.
Improve representation capacity. Previous models are not powerful enough to extract good representations from phoneme sequence and learn complicated data distribution in speech. Our large-scale phoneme pre-training, memory mechanism, and powerful generative models such as flow and VAE can learn better text representations and speech data distributions.
We conduct experimental evaluation on LJSpeech dataset to measure the voice quality of NaturalSpeech system. We first compare the speech generated by NaturalSpeech with recordings under MOS and CMOS evaluations, as shown in Table 1 and 2. NaturalSpeech system achieves similar quality scores with human recordings in both MOS and CMOS. Importantly, our system achieves −0.01 CMOS compared to recordings, with a Wilcoxon p-value p >> 0.05, which demonstrates the speech generated by our system has no statistically significant difference from human recordings.
Table 1: MOS comparison between NaturalSpeech and human recordings. Wilcoxon rank sum test is used to measure the p-value in MOS evaluation.
Table 2: CMOS comparison between NaturalSpeech and human recordings. Wilcoxon signed rank test is used to measure the p-value in CMOS evaluation.
Listen to the samples below to compare the TTS output generated using NaturalSpeech vs. the human recordings:
Script: Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.
We have also done the ability study which shows the design of each component can help improve the quality in the CMOS test. As the next step, we will work on shipping the innovations in this research model into future Azure TTS service updates. On the other hand, while we achieve great quality with the new model, it doesn’t mean TTS is a totally solved problem. There are more challenging scenarios including making the voice expressive, natural in long-form content reading, and spontaneous etc., which will require more advanced modelling techniques to model the expressiveness and variation nature of human speech.
Working to advance AI with XYZ-code in a responsible way
We are excited about the future of Neural TTS with human-centric and natural-sounding quality under the XYZ-Code AI framework. Like other publicly available models, Neural TTS models are trained with billions of pages of publicly available text, and hence may have picked up biases around gender, race, and more from these public documents. Mitigating negative effects from these biases is a difficult, industrywide issue, and Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these Microsoft AI principles into practice throughout the company and have taken extensive precautionary measures to prevent these implicit biases from getting exhibited when using the models in our products. We strongly encourage developers to do the same by putting appropriate guardrails and mitigations in place before taking these models to production.
Get started with Azure Neural TTS
Azure AI Neural TTS offers over 340 neural voices across over 120 languages and locales. In addition, the capability enables organizations to create a unique brand voice in multiple languages and styles. To explore the capabilities of Neural TTS with some of its different voice offerings, try the demo.