This post is co-authored with Bohan Li, Yanqing Liu, Xu Tan and Sheng Zhao.
At the //Build conference in 2021, Microsoft announced a few key updates to Neural TTS, a Speech capability of Cognitive Services. These updates include a multilingual voice (JennyMultilingualNeural) that can speak 14 languages, and a new preview feature in Custom Neural Voice that allows customers to create a brand voice that speaks different languages.
In this blog, we introduce the technology advancement behind these feature updates: Uni-TTSv3.
Neural TTS converts text into lifelike speech. The more natural the voice is, the more convincing the AI becomes. With the Uni-TTS model introduced in July 2020, we were able to produce near-human-parity voices with high performance. Later in November 2020, with the new LR-Uni-TTS model, we managed to expand neural TTS to more locales quickly including those low-resource languages. Applying the same Uni-TTS technology to the Custom Neural Voice feature made it possible for customers including BBC, Progressive, Swisscom, AT&T and more to further create a natural brand voice for their business.
With a growing number of customers adopting neural TTS and Custom Neural Voice for more scenarios, we saw new challenges that urge us to bring the technology to the next level.
Uni-TTSv3, the next generation of neural TTS, was released to address these challenges and empower these features.
In previous blogs, we introduced three key components of neural TTS. Neural text analysis converts text to pronunciations. Neural acoustic model predicts acoustic features like the mel spectrogram from text. Neural vocoder converts acoustic features into wave samples.
Acoustic features represent the content, speaker, and prosody information of the utterances.
Speaker identity usually represents the brand of the voice, so it is important to keep. Prosody like tone, break or emphasis impacts the naturalness of synthetic speech. Neural acoustic models, like Microsoft Transformer TTS and FastSpeech models, can predict acoustic features much better by learning the recording data than traditional acoustic models. Thus, it can generate better prosody and speaker similarity.
The previous versions (v2) of Uni-TTS are a teacher-student based model. It requires 3 stages in training: training a teacher model, fine-tuning the teacher, and training student models. This is complex and costly, particularly for Custom Neural Voice which allows customers to create a voice model through complete self-service.
To scale the model training to support a larger number of voices, recently we have upgraded the neural acoustic model into Uni-TTSv3 which is more robust and cost effective.
Below diagram describes the training pipeline of Uni-TTSv3.
Uni-TTSv3 model is trained from 3,000 hours of human recording data covering multiple speakers and multiple locales. This well-trained model can speak multiple languages with different speaker identities and serve as the Uni-TTSv3 base model (or, source model). This base model is then fine-tuned to create a TTS model for a target speaker with the target speaker data. In the finetuning stage, a denoising module is used to eliminate the effect of noises in the recording so the output model quality is improved from the data.
Uni-TTSv3 models are based on FastSpeech 2 with additional enhancements. Below diagram describes the model structure:
Uni-TTSv3 model is a non-autoregressive text-to-speech model and is directly trained from recording, which does not need a teacher-student training process.
The encoder module takes phoneme, the unit of sound that can distinguish one word from another in a particular language, as input and outputs representation of the input text. Duration and pitch predictors predict the total time taken by each phoneme (a.k.a the phoneme duration), and the relative highness or lowness of a tone as perceived by humans (a.k.a the pitch). The phoneme duration and pitch are extracted by an external tool during the training stage, and in the synthesis stage, they are predicted by the predictor modules. After pitch and duration prediction, the encoder outputs are expanded according to the duration and fed into the mel-spectrogram decoder. The decoder then outputs the mel-spectrogram - the final acoustic features.
In addition, Uni-TTSv3 model also has a phoneme-level fine-grained style embedding, which can help boost the naturalness of synthesized speech. This fine-grained style module extracts fine-grained prosody from the mel-spectrogram in the training stage and is predicted by the encoder during speech synthesis. To train this model on multi-speaker multi-lingual data, speaker id and locale id are added to control synthesized speech timbre and accent. Finally, the encoder output, speaker information, locale information and fine-grained style are fed into the decoder, to generate acoustic features.
Uni-TTSv3 model has empowered the Azure Text-to-Speech platform and Custom Neural Voice to support multi-lingual voices. With Uni-TTSv3, Custom Neural Voice training pipeline is also upgraded to allows customers to create a high-quality voice model with less training time.
Uni-TTS v3 simplified the training process compared to the teacher-student training. It reduces training time significantly to around 50% on acoustic training parts. With this improvement, Custom Neural Voice now allows customers to create a custom voice model in a much shorter time, and thus helps customers to save significantly on voice training costs.
Extensive tests on Uni-TTSv3 in more than 40 languages demonstrated its capability in achieving better or at least the same voice quality compared to the teach-student model (Uni-TTSv2), while the training time is reduced. The whole training process is stable based on the phoneme alignments. Bad cases are also reduced, including the skipping or repeating issues as seen in some attention-based models.
In addition, the denoising module helps to remove the typical noise patterns in the training data while keeping the voice fidelity. This makes the Custom Neural Voice platform more robust in handling different customer data.
(Average compute hours of training: 50)
(Average compute hours of training: 30)
Chinese (Mandarin, Simplified)
Trained on multi-lingual and multi-speaker datasets, Uni-TTSv3 can enable a voice to speak multiple languages even without training data from the same human speaker in those target languages.
With Uni-TTSv3, a powerful multilingual voice, JennyMultilingualNeural, is released to the Azure TTS platform and enables developers to keep their AI voice consistent while serving different locales. Check the samples here. The JennyMultilingualNeural voice is as far as we know the first real-time production voice in the industry that can speak multiple languages naturally with the same timbre. The average MOS score of Jenny in all languages supported is above 4.2 on a scale of 5.
Uni-TTSv3 has also been integrated into Custom Neural Voice to power cross-lingual features. It allows customers to create a voice that speaks different languages by providing speech samples collected in just one language as training data. This helps customers to save effort and cost in creating voices to support different markets, as usually it requires selecting a new voice talent and doing new recordings through a lengthy process for each language respectively.
Whether you are building a voice-enabled chatbot or IoT device, an IVR solution, adding read-aloud features to your app, converting e-books to audiobooks, or even adding Speech to a translation app, you can make all these experiences natural sounding and fun with Neural TTS.
We look forward to hearing about your experience and developing more compelling services together with you for developers around the world.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.