Low-resource technology updates for Azure Neural Text-to-Speech

Microsoft

Jan 20, 2023

Azure Neural Text-to-Speech (Neural TTS) is a powerful tool that allows users to turn text into lifelike speech. It has a wide range of applications, including voice assistants, content read-aloud capabilities, and accessibility tools. In the past few months, Azure Neural TTS has made significant improvements in naturalness of speech, variety of voices, language support, and more.

In this blog, we summarize the latest updates on our low-resource technology, which has enabled Azure Neural TTS to expand to global languages quickly and allows speakers of under-represented languages to equally benefit from our product.

Since the first release of low-resource voices in November 2020, more and more languages have been supported. As of now, the low-resource technology has enabled over 40 languages with more than 80 voices added to the Azure Neural TTS portfolio, with an average MOS (Mean Opinion Score) higher than 4.3. At the same time, the low-resource Neural TTS technology itself has continued to advance, bringing in higher voice quality and richer capabilities. This improvement has been demonstrated in voice tests and public contests based on open data. A recent achievement is winning first place in the 2022 NCMMSC Inner Mongolian TTS contest using the low-resource TTS technology as the voice backbone (NCMMSC 2022).

Technique upgrade

The low-resource TTS technology has been upgraded to a new generation, with three key updates: simplified workflow, higher-quality model structure and new features supported.

First, we have updated the low-resource TTS technology with a simplified workflow, moving from the previous teacher-student training strategy to the new speaker/prosody transfer strategy (a description of the speaker/prosody transfer method can be found in this blog). With this update, the training process is simplified to just one step for fine-tuning. Low-quality parallel data in the new language and high-quality parallel data in/out of the new language is used to fine tune for the new voices acoustic model. Compared to the previous low-resource locale TTS training method established two years ago, the training process is sped up 4X with the new technology.

Updated workflow diagram of the low-resource TTS technology

Second, the higher-quality Uni-TTS V4 model is adopted for improving naturalness of low-resource voices. With Uni-TTS V4, a multi-lingual multi-speaker source model is pretrained on 3k+ hours of speech data, using phone/character sequence as input. The phone/character alignment is done by leveraging a GMM-HMM model. A fine-tuning process is then conducted based on the Uni-TTS V4 source model to build the new voice model in the target low-resource language, which is similar to Uni-TTS V3. In addition, a high-fidelity 48kHz vocoder is used to get better fidelity for the voices created.

Uni-TTS V4 model diagram

Last but not the least, thanks to the new technology, more use cases are enabled for low-resource languages, such and spelling out. During the low-resource model training process, both the target language data and English data are used to enable the model to speak target language and read English script. When synthesizing, Neural TTS will automatically detect an English word or spell out letters with correct pronunciation. This process helps low-resource voices to read better in more contexts.

English word reading and letter spell-out schematic diagram

Quality improvement

Along with the technology advancement, low-resource voices are proven to be highly human-like, demonstrated in our own voice tests and public contests based on open data.

According to a set of MOS (Mean Opinion Score) tests, the average MOS score of our low-resource voices is higher than 4.30, which is on par with the average MOS score of the voices in high-resource languages, allowing speakers of under-represented languages to equally benefit from Azure Neural TTS products.

Check out a few samples below to hear how natural these low-resource voices sound or try them with your own text in our demo.

Locale	Language	MOS	Voice name	Audio sample
cy-GB	Welsh (UK)	4.61	Nia	Mae'r ysgol ar agor drwy'r wythnos.
fil-PH	Filipino (Philippines)	4.36	Blessica	Ano ang kasalukuyang nangungunang palabas sa telebisyon?
is-IS	Icelandic (Iceland)	4.47	Gudrun	Gífurlegir vatnavextir eru í ám og vötnum á svæðinu.
kk-KZ	Kazakh (Kazakhstan)	4.48	Daulet	Біз отанымыздан алыс жерде кездесеміз деп кім ойлаған!
sw-KE	Swahili (Kenya)	4.50	Rafiki	Furahia kutumia maandishi kwa hotuba!
zu-ZA	Zulu (South Africa)	4.14	Themba	Umbuso wakwaZulu nawo uhamba ngegama lamaZulu umbuso noma uMbuso wakwaZulu.

The leading position of our low-resource TTS technology has been further demonstrated in a recent public TTS voice benchmark challenge. Out of tens of participants from academia and industry, Microsoft won first place in the contest to build natural and accurate Mongolian TTS based on limited data (only 1000 sentences). Our naturalness MOS score is 4.468 and intelligibility MOS score is 4.496, higher than other models (NCMMSC 2022 inner Mongolian TTS contest).

All these results show that the low-resource recipe is powerful and competitive, enabling Azure Neural TTS to reach to more global audience at the same level of quality in different languages.

What’s next

Leveraging the low-resource TTS technology, Custom Neural Voice (CNV) will be able to allow customers to create one-of-a-kind, brand or character voices in any new language, with just 500 to 2000 recorded sentences. Submit a request here if you want to create a voice with the CNV capability.

In the future, we also plan to support more low-resource languages, with the mission to empower ‘every organization’ and ‘every person’ on the planet to achieve more. We believe that every language has its unique value. With AI Speech expanding to more languages, we can help to close the language barrier for more people and narrow the digital divide for society. Contact us if you want to request a new TTS language or are interested in contributing to the language expansion work.

Get started

Microsoft offers over 400 neural voices covering more than 140 languages and locales. With these Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users. In addition, with the Custom Neural Voice capability, you can easily create a brand voice for your business.