This post is co-authored with Bohan Li, Yanqing Liu, Xu Tan and Sheng Zhao.
At the //Build conference in 2021, Microsoft announced a few key updates to Neural TTS, a Speech capability of Cognitive Services. These updates include a multilingual voice (JennyMultilingualNeural) that can speak 14 languages, and a new preview feature in Custom Neural Voice that allows customers to create a brand voice that speaks different languages.
In this blog, we introduce the technology advancement behind these feature updates: Uni-TTSv3.
Recent challenges in Neural TTS
Neural TTS converts text into lifelike speech. The more natural the voice is, the more convincing the AI becomes. With the Uni-TTS model introduced in July 2020, we were able to produce near-human-parity voices with high performance. Later in November 2020, with the new LR-Uni-TTS model, we managed to expand neural TTS to more locales quickly including those low-resource languages. Applying the same Uni-TTS technology to the Custom Neural Voice feature made it possible for customers including BBC, Progressive, Swisscom, AT&T and more to further create a natural brand voice for their business.
With a growing number of customers adopting neural TTS and Custom Neural Voice for more scenarios, we saw new challenges that urge us to bring the technology to the next level.
A single voice that can speak multiple languages. Now more than ever, developers are expected to build voice-enabled applications that can reach a global audience. For example, NPCs in virtual games can talk to users with the same voice in different languages. Customer service bots can switch languages and respond to their user inquiries in different markets seamlessly.
A more reliable and efficient platform that can run on different voice data including those from customers. To support more customization scenarios, neural TTS must be robust enough to serve different customer data as customers may not record their voice samples in the ideal environment or provide consistent and clean audios as training data.
Uni-TTSv3, the next generation of neural TTS, was released to address these challenges and empower these features.
In previous blogs, we introduced three key components of neural TTS. Neural text analysis converts text to pronunciations. Neural acoustic model predicts acoustic features like the mel spectrogram from text. Neural vocoder converts acoustic features into wave samples.
Acoustic features represent the content, speaker, and prosody information of the utterances.
Speaker identity usually represents the brand of the voice, so it is important to keep. Prosody like tone, break or emphasis impacts the naturalness of synthetic speech. Neural acoustic models, like Microsoft Transformer TTS and FastSpeech models, can predict acoustic features much better by learning the recording data than traditional acoustic models. Thus, it can generate better prosody and speaker similarity.
The previous versions (v2) of Uni-TTS are a teacher-student based model. It requires 3 stages in training: training a teacher model, fine-tuning the teacher, and training student models. This is complex and costly, particularly for Custom Neural Voice which allows customers to create a voice model through complete self-service.
To scale the model training to support a larger number of voices, recently we have upgraded the neural acoustic model into Uni-TTSv3 which is more robust and cost effective.
How Uni-TTSv3 works
Below diagram describes the training pipeline of Uni-TTSv3.
Training pipeline of Uni-TTSv3
Uni-TTSv3 model is trained from 3,000 hours of human recording data covering multiple speakers and multiple locales. This well-trained model can speak multiple languages with different speaker identities and serve as the Uni-TTSv3 base model (or, source model). This base model is then fine-tuned to create a TTS model for a target speaker with the target speaker data. In the finetuning stage, a denoising module is used to eliminate the effect of noises in the recording so the output model quality is improved from the data.
Uni-TTSv3 models are based on FastSpeech 2 with additional enhancements. Below diagram describes the model structure:
UniTTSv3 model structure
Uni-TTSv3 model is a non-autoregressive text-to-speech model and is directly trained from recording, which does not need a teacher-student training process.
The encoder module takes phoneme, the unit of sound that can distinguish one word from another in a particular language, as input and outputs representation of the input text. Duration and pitch predictors predict the total time taken by each phoneme (a.k.a the phoneme duration), and the relative highness or lowness of a tone as perceived by humans (a.k.a the pitch). The phoneme duration and pitch are extracted by an external tool during the training stage, and in the synthesis stage, they are predicted by the predictor modules. After pitch and duration prediction, the encoder outputs are expanded according to the duration and fed into the mel-spectrogram decoder. The decoder then outputs the mel-spectrogram - the final acoustic features.
In addition, Uni-TTSv3 model also has a phoneme-level fine-grained style embedding, which can help boost the naturalness of synthesized speech. This fine-grained style module extracts fine-grained prosody from the mel-spectrogram in the training stage and is predicted by the encoder during speech synthesis. To train this model on multi-speaker multi-lingual data, speaker id and locale id are added to control synthesized speech timbre and accent. Finally, the encoder output, speaker information, locale information and fine-grained style are fed into the decoder, to generate acoustic features.
Benefits of Uni-TTSv3
Uni-TTSv3 model has empowered the Azure Text-to-Speech platform and Custom Neural Voice to support multi-lingual voices. With Uni-TTSv3, Custom Neural Voice training pipeline is also upgraded to allows customers to create a high-quality voice model with less training time.
Less training time for a new voice
Uni-TTS v3 simplified the training process compared to the teacher-student training. It reduces training time significantly to around 50% on acoustic training parts. With this improvement, Custom Neural Voice now allows customers to create a custom voice model in a much shorter time, and thus helps customers to save significantly on voice training costs.
More robust to handle customer training data at different quality levels
Extensive tests on Uni-TTSv3 in more than 40 languages demonstrated its capability in achieving better or at least the same voice quality compared to the teach-student model (Uni-TTSv2), while the training time is reduced. The whole training process is stable based on the phoneme alignments. Bad cases are also reduced, including the skipping or repeating issues as seen in some attention-based models.
In addition, the denoising module helps to remove the typical noise patterns in the training data while keeping the voice fidelity. This makes the Custom Neural Voice platform more robust in handling different customer data.
(Average compute hours of training: 50)
(Average compute hours of training: 30)
Chinese (Mandarin, Simplified)
More support for cross-lingual and multi-lingual voices
Trained on multi-lingual and multi-speaker datasets, Uni-TTSv3 can enable a voice to speak multiple languages even without training data from the same human speaker in those target languages.
With Uni-TTSv3, a powerful multilingual voice, JennyMultilingualNeural, is released to the Azure TTS platform and enables developers to keep their AI voice consistent while serving different locales. Check the samples here. The JennyMultilingualNeural voice is as far as we know the first real-time production voice in the industry that can speak multiple languages naturally with the same timbre. The average MOS score of Jenny in all languages supported is above 4.2 on a scale of 5.
Uni-TTSv3 has also been integrated into Custom Neural Voice to power cross-lingual features. It allows customers to create a voice that speaks different languages by providing speech samples collected in just one language as training data. This helps customers to save effort and cost in creating voices to support different markets, as usually it requires selecting a new voice talent and doing new recordings through a lengthy process for each language respectively.
Whether you are building a voice-enabled chatbot or IoT device, an IVR solution, adding read-aloud features to your app, converting e-books to audiobooks, or even adding Speech to a translation app, you can make all these experiences natural sounding and fun with Neural TTS.
Let us know how you are using or plan to use Neural TTS voices in this form. You can also open issues here for speech SDK issues.
We look forward to hearing about your experience and developing more compelling services together with you for developers around the world.