This post is co-authored by Dongxu Han, Junwei Gan and Sheng Zhao
Neural Text-to-Speech (Neural TTS), part of Speech in Azure Cognitive Services, enables you to convert text to lifelike speech for more natural user interactions. Neural TTS has powered a wide range of scenarios, from audio content creation to natural-sounding voice assistants, for customers from all over the world. For example, BBC, Progressive and Motorola Solutions are using Azure Neural TTS to develop conversational interfaces for their voice assistants in English speaking locales. Swisscom and Poste Italiane are adopting neural voices in French, German and Italian to interact with their customers in the European market. Hongdandan, a non-profit organization, is adopting neural voices in Chinese to make their online library audible for the blind people in China.
In this blog, we introduce our latest innovation in the Neural TTS technology that helps to improve the pronunciation accuracy significantly: Unified Neural Text Analyzer.
Neural TTS converts plain text into wave form via three modules: neural text analyzer, neural acoustic model and neural vocoder. Text analyzer converts plain text to pronunciations, acoustic model converts pronunciations to acoustic features and finally vocoder generates waveforms. Text analyzer is the first link of the entire TTS system with results directly affecting the acoustic model and vocoder. The correct pronunciation of a word or phrase is the basic expectation in TTS, which delivers the right information to use but it’s not always easy. For example, “live” should be read different in “We live in a mobile world” and “TV Apps and live streaming offerings from The Weather Network” depending on context. If TTS reads them incorrectly, the intelligibility and naturalness of the content will be significantly influenced. Thus, text analyzer is important to TTS.
Recent updates on Neural TTS include a major innovation to the text analyzer, called “UniTA” (Unified Neural Text Analyzer). UniTA is a unified text analyzer model, which seamlessly simplifies text analyzer workflow and reduces time latency in the runtime server. It adopts a multitask learning approach, jointly training all ambiguity models to solve context ambiguity and generate correct pronunciation and as a result reduces over 50% of pronunciation errors.
Generally, different natural languages have different linguistic grammar. In TTS, text analyzer needs to follow the same grammar of languages in order to generate correct pronunciations, which contains but isn’t limited to the following required grammar categories:
Category |
Example |
Word Segmentation |
[English] [Chinese] 在圣诞节纽约大都会有演出 --> 在 / 圣诞节 / 纽约 / 大 / 都会(du1 hui4) / 有 / 演出 [Chinese] 在圣诞节纽约大都会有演出 --> 在/ 圣诞节 / 纽约 / 大都(da4 dou1) / 会 / 有 / 演出 |
Part-of-Speech Tagging |
[Noun, | l ai v s |] [Verb, | l I v s |] I also discovered the very angry raccoon that lives near my porch. |
Morphology |
[Singular] 1km --> one kilometer [Plural] 5km --> five kilometers |
Text Normalization |
[Fraction, nine out of ten] The O.S. Speed T1202 ups the ante for race-winning performance, resulting in a power plant that will dominate 9/10 scale competition. [Date, September tenth] 1st episode will air 9/10 with never before seen video of her birth! |
Abbreviation Expansion |
[Street] Oh man, biking from 24th St BART to the 29th St bikeshare station, that will be sweet. [Saint] We continue to ask anyone who was in the wider area near St Heliers School between 7.30am and 9am and witnessed any suspicious activity to contact police |
Polyphone Disambiguation |
[p r ih - z eh 1 n t] The prices will present the estimated discount utilizing the drug discount card. [p r eh 1 - z ax n t] But our present situation is not a natural one. |
Most pronunciations are affected by these categories based on syntactic or semantic context, and these categories are all challenging disambiguation problems. The traditional TTS approach is a pipeline-based module called “text analyzer” with a series of models aimed at solving grammar disambiguation problems, which causes some of the following issues:
Compared to the traditional pipeline-based text analyzers, our Neural TTS proposes a Unified Neural Text Analyzer model (UniTA) to improve TTS pronunciation.
Firstly, UniTA converts the input text to word embedding vectors through a pre-trained model. Word embedding is a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from vocabulary are mapped to vectors of real numbers. Conceptually, it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. Pre-training models like XYZ-Code have demonstrated unprecedented effectiveness for learning universal language representations based on unlabeled corpus with the method achieving great success in many tasks like language understanding and language generation.
Secondly, a sequence tagging fine-tune strategy is adopted in the UniTA model. UniTA is designed as a typical word classification task, in which
Different from the traditional text analyzer training models , UniTA adopts a multitask learning approach to jointly train all categories together including word segmentation, part-of-speech tagging, morphology, abbreviation expansion, text normalization and polyphone disambiguation. The multitask learning approach shares hidden layers’ information and jointly trains across different tasks, which has achieved state-of-art achievements on many NLP tasks. In UniTA, hidden information is also shared in models when training.
For example, the sentence “St. John had a 10-3 run to build its lead to 78-64 with 4:44 left.” in the training corpus is annotated as showed in the table below. “--” means there is no related tag in the category. In the word segmentation column, the phrase “10-3” is segmented as “10”, “-” and “3”; in the morphology column, the word “had” is annotated as “past tense”; in the text normalization column, “10-3” belongs to interpreting word “to” instead of “-“ while “4:44” belongs to the pattern using time format; In the abbreviation column, word “St.” is expanded as “Saint” rather than “Street”; and in the polyphone disambiguation column, the word “lead” is pronounced as [l i: d]. Actually, the word “lead” has two pronunciations, it is pronounced as [l i: d] when its POS is noun while pronounced as [l e d] when its POS is verb. This means the POS results and Polyphone results can share the inner information. In this way, multitask model improves UniTA accuracy.
Word |
Word Segmentation |
Part-of-Speech |
Morphology |
Text Normalization |
Abbreviation |
Polyphone disambiguation |
St. |
-- |
Noun |
-- |
-- |
Saint |
-- |
John |
-- |
Noun |
-- |
-- |
-- |
-- |
had |
-- |
Verb |
Past tense |
-- |
-- |
-- |
a |
-- |
Det |
-- |
-- |
-- |
-- |
10-3 |
10 / - / 3 |
Num |
-- |
numbers are predicted as “ten to three” |
-- |
-- |
run |
-- |
Noun |
Singular |
-- |
-- |
-- |
to |
-- |
Particle |
-- |
-- |
-- |
-- |
build |
-- |
Verb |
-- |
-- |
-- |
-- |
its |
-- |
Det |
-- |
-- |
-- |
-- |
lead |
-- |
Noun |
Singular |
-- |
-- |
l i: d |
to |
-- |
Particle |
-- |
-- |
-- |
-- |
78-64 |
78 / - / 64 |
Num |
-- |
numbers are predicted as “seventy-eight to sixty-four” |
-- |
-- |
with |
-- |
Prep |
-- |
-- |
-- |
-- |
4:44 |
4 / : / 44 |
Num |
-- |
numbers are predicted as time format |
-- |
-- |
left |
-- |
Verb |
Past participle |
-- |
-- |
-- |
. |
-- |
Symbol |
-- |
-- |
-- |
-- |
UniTA model predicts categories’ results together in the neural TTS runtime service. The same as training, UniTA converts the plain texts to word embeddings and then the multitask sequence tagging model predicts all the categories’ results. Some auxiliary modules are embedded after fine-tuning categories to further improve pronunciations. Finally, the pronunciation results are generated from UniTA.
Here is the figure of the UniTA model structure in Neural TTS:
Compared with the traditional TTS text analyzer, UniTA reduces over 50% of pronunciation errors in improving pronunciation accuracy. It is already used many neural voice languages such as English (United States), English (United Kingdom), Chinese (Mandarin, simplified), Russian (Russia), German (Germany), Japanese (Japan), Korean (Korea), Polish (Poland) and Finnish (Finland). Due to varying types of grammar in language, not all categories are suitable for every language. For example, Chinese and Japanese heavily depend on word segmentation and polyphone while these languages don’t need morphology or abbreviation expansion.
Here are some samples of the pronunciation improvement using UniTA.
Category |
Language |
Input text (target word bolded) |
Previous pronunciation |
Current pronunciation |
Word Segmentation |
Chinese (Mandarin, simplified) |
太子与三殿下行过礼后坐了片刻就离开了。 |
“三殿 / 下行 / 过礼” |
“三殿下 / 行过礼” |
Word Segmentation |
Chinese (Mandarin, simplified) |
叶奎最终还是在剧痛下泄了气 |
“剧痛 / 下泄了气” |
“剧痛下 / 泄了气” |
Word Segmentation |
German (Germany) |
kulturform |
kult+urform |
kultur+form |
Word Segmentation |
Korean (Korea) |
해외감염병 |
h̬ɛwɛg̥mjʌmbjʌŋ |
h̬ɛwɛg̥mjʌmpjʌŋ |
Morphology - case ambiguity |
Russian (Russia) |
Количество ударов по воротам (15 против 7) также говорит о преимуществе чемпионов мира |
Семь |
Семи |
Abbreviation Expansion |
English (United States) |
Joined TX Army National Guard in 1979. |
T.X. |
Texas |
Text Normalization |
English (United States) |
The Downtown Cabaret Theatre’s Main Stage Theatre division concludes its 2010/11 season with the Tony Award winning musical, in the heights by Lin-Manuel Miranda. |
November 2010 |
2010 to 2011 |
Polyphone disambiguation |
Chinese (Mandarin, simplified) |
卓文君听琴后,理解了琴曲的含意,不由脸红耳热,心驰神往。 |
qu1 |
qu3 |
Polyphone disambiguation |
English (United States) |
I received a copy early in November, and read and contemplated it's provisions with great satisfaction. |
||
Polyphone disambiguation |
Japanese (Japan) |
パッケージには、富士屋ホテルが発刊した「We Japanese」内の説明用の挿絵を採用。 |
うち (w u - ch i) |
ない (n a - y i) |
Hear how the Cortana voice pronounces each word accurately with UniTA.
Get started
With these updates, we’re excited to continue to power accurate, natural and intuitive voice experiences for customers world-wide. Azure Text-to-Speech service provides more than 200 voices in over 50 languages for developers all over the world.
Let us know how you are using or plan to use Neural TTS voices in this form. If you prefer, you can also contact us at mstts [at] microsoft.com. We look forward to hearing your experience and developing more compelling services together with you for the developers around the world.
For more information:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.