Finetune neural text-to-speech output with advanced customization features
Published Apr 30 2020 06:00 AM 11.3K Views

This post was co-authored by @Qinying Liao, Yueying Liu, Sheng Zhao, @Anny Dow , Bohan Li and Jun-wei Gan


Neural Text to Speech (TTS) converts text to lifelike speech for more natural interfaces. With natural-sounding speech that matches the stress patterns and intonation of human voices, neural TTS significantly reduces listening fatigue when users are interacting with AI systems.


Common use cases for neural TTS include, but are not limited to, voice assistants, connected cars, smart-home devices, and various e-learning systems as well as reading apps. While neural TTS provides you a set of voices that already sound natural and human-like, you may still want to modify the speech properties to make voices better fit your scenario and context.


A wide range of fine-tuning features are available through Speech Synthesis Markup Language (SSML) and a code-free Audio Content Creation tool for you to adapt TTS output, such as adding or removing a pause/break, changing the pronunciation, adjusting the speaking rate, volume, pitch and more.


In this article, we’ll deep dive into the latest advanced features that can help you adapt the intonation and stress patterns of neural TTS output as well as define custom lexicon for your applications.


Control the prosody of your neural TTS output

Prosody, as one of the SSML elements, can be used to specify changes to pitch, contour, range, rate, duration, and volume for the TTS output, making your audio result easier to follow.


We are glad to share that the adjustments around contour, breaks/pauses and speaking rates of neural TTS are smoothly supported today. Now you can easily tailor the prosody of your TTS output using SSML or the Audio Content Creation tool.


Pitch contour

Pitch contour represents changes in pitch at specified times in speech output. tuning the pitch contour, you can make the intonation of your synthesized output sound different. For example, you can use it to emphasize different parts of your sentence or change the tone to make it sound more natural.


Here are some examples of adjusting pitch contour with SSML.





I never said he stole your money


<prosody contour="(11%, +65%) (60%, -43%) (80%, -34%)">

I never said he stole your money.



That's how you pronounce it ?


<prosody contour="(60%, -11%) (85%, +85%)">

That's how you pronounce it ?




You can insert pauses (or breaks) between words or adjust pauses automatically added by the neural voices





Now 50 years after the event, he may finally have an answer.



Now <break time="100ms" />50 years after the event, he may finally have an answer.







通过语音合成技术,我们可以<mstts:ttsbreak strength="none" />创造出不同风格的智能语音。




Adjust rate

Rate indicates the speed at which text is read aloud. You can adjust the speed of a whole sentence or a part of a sentence read by neural voices.



Tune in SSML

Sometimes somebody will bring something that you really like.

Sometimes somebody will bring something that you <prosody rate="-51.00%">really </prosody>like.



Adjust neural voice prosodies through the audio content creation tool

Besides SSML, we also offer an easy-to-use Audio Content Creation tool to help you fine-tune TTS output. Paste or upload your text in the audio content creation tool, specify the voice you want to use, and then adjust the voice parameters in the tuning panel. You can switch your view to check the SSML format generated along with your adjustments and use the SSML in your code, or generate audio directly from the tool for your further use.


See below for a demo showing how prosody is adjusted using the code-free tool.




Define lexicon for your neural TTS output

Sometimes TTS does not pronounce words accurately in the way you want, such as a company or person’s name. To improve pronunciation, you can define the reading of these entities in SSML using the <phoneme> and <sub> tags. However, defining multiple entities one by one during speech synthesis can be time-consuming. The new custom lexicon capability makes this process easier.


With custom lexicon, simply specify the reading of entities in a list stored as an .xml or .pls file, provide a web link for your list and refer to this list in SSML. The right pronunciation will be applied to all specified custom words  at once.


Here is a sample:

For your scenario, you may want to adjust the pronunciations of “BTW,” “Alki Beach” and “Jean” from the default TTS. Hear the differences in the samples below



Default reading

Applied custom lexicon

BTW, we will arrive Alki Beach probably 8:00 tomorrow morning. 

Could you help leave a message to Jean Pierre for me? 


This is how the custom lexicon list is defined for the above sample: 









    <alias>By the way</alias>  
    <grapheme> Alki </grapheme>  
    <phoneme> æl.kaɪˈ</phoneme> 
    <grapheme> Jean </grapheme>  
    <phoneme alphabet="ipa" ph="ʒɑˈn">Jean </phoneme>









You can upload the list online and put it in a data store like Azure Blob Storage.

During speech synthesis, use below SSML to refer to the list and apply custom lexicon to the input text. Speech synthesis will then reflect your defined pronunciations in the output all at once.









<lexicon uri=""/> 
BTW, we will arrive Alki beach probably 8:00 tomorrow morning. 
Could you help leave a message to Jean Pierre  for me? 









For more information about custom lexicon, please see our documentation.


Get started

Since the release of our Neural TTS less than two years ago, this field has advanced rapidly. New research models including Transformer TTS and FastSpeech have been proposed and improved the state of art. With these research innovations, we’ve not only improved the controllability of the neural voice output, but also made the synthesized speech more robust and largely improved the performance of neural TTS.


Get started with Text to Speech on Azure today.

Version history
Last update:
‎Feb 06 2023 01:16 AM
Updated by: