Azure AI Neural TTS enhanced expressiveness for Chinese voices with upgraded prosody breaks
Published Jun 28 2023 03:34 AM 6,868 Views

Co-authors: Zhihang Xu, Yuchao Zhang, Zheng Niu, Xi Wang, Lei He, Sheng Zhao


Chinese is one of the most widely spoken languages worldwide. Azure Neural TTS offers dozens of highly human-like digital voices in Chinese for numerous customers. With the emerging trend of using TTS to generate enjoyable AI content, we see higher requirements for more natural and expressive voices.

Today we are excited to announce our latest update to the prosody model which enhances the synthetic voice experience for Chinese.
This update is our response to two key challenges identified: frontend prosody prediction errors, particularly in the Chinese (zh-CN) language, and the growing demand from our valued customers for improved prosody. Noticeably, effective prosody plays a crucial role in making synthetic speech sound natural and expressive, while poor prosody can unintentionally alter the intended semantic meaning of a sentence.

We have dedicated our efforts to addressing these issues by introducing a semi-supervised method that utilizes automatic annotation with pre-trained text-to-speech machine-learning models. With this approach, our prosody model can now make predictions based on both text and wave information, going beyond the limitations of relying solely on text input, which largely improves the prosody results.

This update is released to 20+ voices in Chinese (Mandarin, simplified).


Prosody break in Chinese and the challenges


We recognize the vital importance of correctly employing Chinese prosodic breaks in text-to-speech (TTS) tasks.

Just like humans learning Chinese as a language, achieving fluency in the language is difficult in many ways. One challenge is to get correct prosody boundaries based on word breaks. Since each Chinese characters can represent an entire word, with different breaks it can convey different meaning or emotion. Prosodic breaks are an indicator of the boundaries between words and phrases.

Developing a robust TTS prosody model for Chinese is thus crucial. It improves the pronunciation accuracy and makes speeches easy to understand, especially on some 33-tone and polyphone boundaries. Take “只有孙四海[BREAK]顶着不肯喝” as an example, if the BREAK is long, it means “孙四海” is not willing to drink. But if the BREAK is short, it sounds like “只有孙四还[BREAK]顶着不肯喝”, where “孙四” is a name entity and “还” stands for ‘still’ - and the whole sentence means “孙四” is still not willing to drink.


只有孙四海[BREAK]顶着不肯喝 - Long break 只有孙四海[BREAK]顶着不肯喝 - Short break


The biggest challenge we encountered while developing a prosody model for Chinese TTS systems was the absence of real prosody labels. Lack of real prosody labels caused overfitting and poor generalization ability to different domains for break prediction itself. Additionally, there were discrepancies between the prosodic break labels and the acoustic model. Discrepancies led to issues such as missing prosody breaks within words, unnatural pronunciation, and muffled pauses, impacting the overall quality of synthesized speeches.

To address these challenges, our Azure TTS team has introduced an automatic prosody annotation model that leverages self-supervised method to use more unlabeled audio data in model training. By incorporating both text and wave information, this automatic prosody annotation model provides more accurate prosody predictions. These nuances are essential for achieving authentic and meaningful synthesized speech. Our enhanced prosody model ensures that the output closely aligns with the intended semantic meaning of the original text, resulting in a more natural and expressive TTS output.


New model quality

Accurately identifying prosodic boundaries across different levels presents a challenge in prosody prediction. Mandarin speech synthesis systems often employ a hierarchical prosodic structure to improve accuracy. This structure classifies prosodic boundaries into five levels: Character (CC), Lexicon Word (LW), Prosodic Word (PW), Prosodic Phrase (PPH), and Intonational Phrase (IPH). To illustrate this classification, consider the following Chinese example:





















In our objective test on a third-party dataset, we observed notable improvements in the prosody performance. The F1 score increased by 1% for Lexicon Words, 8% for Prosodic Words (PW), and 6% for Prosodic Phrases (PPH), compared to the previous model.

As part of our standard model update process, we conducted a side-by-side comparative MOS (CMOS) test to measure the improvement. We evaluated the enhanced prosody with two popular voices, YunXi and XiaoXiao, across various scenarios, including audiobook, dubbing, and in-car, etc. The CMOS scores demonstrated significant gains, indicating for the new model.


CMOS results of zh-CN-Yunxi

 Text domain






Mean score 







CMOS results of zh-CN-Xiaoxiao

Text domain

Long Sentence








Mean score










(Note: CMOS is a well-accepted method in the speech industry for comparing the voice quality of two TTS systems. A CMOS test is similar to an A/B testing, where participants listen to different pairs of audio samples generated by two systems and provide their subjective opinions on how A compares to B. Normally in one test, we recruit 30-60 anonymous testers with qualified language expertise to evaluate around 50 pairs of audio samples side by side. The result is reported as CMOS gap, which measures the average of the difference in the opinion score between the two systems. In the cases where the absolute value of a CMOS gap is <0.1, we claim system A and B are on par. When the absolute value of a CMOS gap is >=0.1, then one system is reported better than the other. If the absolute value of a CMOS gap is >=0.2, we say one system is significantly better than the other.)



To provide you with a more intuitive understanding of the improvements in the prosody, we have included some TTS samples below. It is important to note that these recordings are not part of the training set, showcasing the effectiveness of the enhanced prosody model in generating natural and expressive speech for new content.











Voice assistant in car







The new prosody model: How it is achieved

The prosody annotation model consists of an audio encoder, a text encoder, and a fusion decoder, working together to achieve high-quality labeling that resembles human-like prosody. With this model, Azure Neural TTS can achieve improved prosody prediction even without relying on real prosody labels.

Our objective is to develop a prosody model for Chinese TTS that addresses various challenges, including increasing the amount of training data, benefiting acoustic model training, and resolving mismatches between text processor and acoustic models. To achieve this, we employ an automatic annotation model that extracts pseudo-prosody labels from unlabeled clean and high-quality data. By annotating break tags according to real audio data during data preparation, we can alleviate issues caused by mismatches.

The framework of our work is illustrated in the accompanying diagram below. The process involves acquiring high-quality text-to-speech paired data, using an annotation model to obtain pseudo prosodic labels, augmenting the human-labeled dataset with the pseudo label, and fine-tuned the pretrained NLP model. Through extensive evaluation of the development set, we select a reliable checkpoint to obtain a proficient prosodic break prediction model for inference.




To enhance performance, our solution incorporates data augmentation techniques and fine-tuning of a pre-trained Chinese RoBERTa model using our training data. We further improve efficiency through knowledge distillation and ONNX transformer-based optimization methods which make little difference on time cost compared to the previous production system.


Get started


Azure Neural TTS has introduced enhanced prosody breaks for 21 voices in Chinese (zh-CN). To explore these voices, simply sign up for the Speech service on Azure and access the Speech Studio Voice Gallery.


Microsoft provides a wide range of neural voices, offering over 400 options in more than 140 languages and locales. These text-to-speech voices enable you to quickly integrate read-aloud functionality into your applications for enhanced accessibility. They also empower chatbots to deliver more engaging and natural conversations to your users. Additionally, with the Custom Neural Voice capability, you can easily create a unique brand voice tailored specifically for your business.


For more information

1 Comment
Version history
Last update:
‎Jun 28 2023 05:20 PM
Updated by: