Microsoft Foundry Blog

6 MIN READ

Azure Neural TTS improves English word reading for mixed-lingual text

Former Employee

Apr 04, 2023

Mixed-lingual or code-mixed text occurs when speakers or writers mix languages within a single sentence or phrase. As a result of advancements in technology, communication, and transportation that connect the world more than ever before, code-mixed text has become prevalent in modern news, web articles, tweets, and messages. For instance, “Spread Your Wings, It's Late en My Melancholy Blues sluiten deze laatste BBC sessie af.” is a typical example of English-Dutch code-mixed sentences. Accurately reading such mixed-lingual text poses a significant challenge for text-to-speech (TTS) systems.

The challenge varies across languages. For non-alphabetic languages like Chinese, Japanese, Korean, and Arabic, identifying mixed English words is usually easy due to distinct character sets. However, in alphabetic languages such as German, French, Spanish, and Italian, accurately distinguishing between English and target language components in code-mixed texts can be particularly challenging. The situation becomes even more complicated when dealing with informal text that may contain a mixture of different punctuations and emojis.

Today we are glad to announce that Azure Text-to-Speech, part of Microsoft Azure Cognitive Services, has recently enhanced its capabilities to read text in code-mixed scenarios where English words are used within sentences of another language. This new functionality has been integrated into six languages (da-DK, de-DE, es-MX, fr-CA, it-IT and nl-NL) and is now available for major voices in these locales.

Code-mixed text reading improvement

With this update, English word pronunciation in these six languages have been greatly improved. On average, these languages have seen a reduction in errors of around 40%. You can find more details of the error-reduction rate for each language in the figure below.

Figure 1: Improved English pronunciation accuracy across languages (ordered alphabetically)

In the next section, you can hear the improvement with several samples. With these samples, we compare the previous version and the updated version of Azure Neural TTS, highlighting significant improvements in English word pronunciations for each language.

English-German code-mixing

	Before	After
Auch die "Grenzgänger" feierten ihre Kinderkarneval-Premiere mit einem farbenfrohen Wagen zum Thema "*Back to the Eighties*".
Dieses Lied ist derzeit sehr beliebt, lass uns die TFboys mit "*Because I met You*" anhören.
Wenn Ihr sein Hupen hier auf der Seite wiedererkennt, dann winken Euch mit ein bisschen Glück 2 VIP Tickets für das Open Source Festival.

English-Italian code-mixing

	Before	After
Lo riferisce lo *United States Geological Survey*, precisando che il sisma registrato aveva magnitudo 7.2.
Oltre i confini del mare, ma la sua celebrità è legata alla saga di *Hunger Games*.
Un gran numero di storie scritte da *Henry James* è ambientato a Roma.

English-Spanish (Mexico) code-mixing

	Before	After
Cantó temas como "*Father Stretch My Hands", "Famous", "All of the Lights", "Black Skinhead" y "Touch the Sky*".
Mientras tanto, puedes visitar este enlace para leer todas las noticias relacionadas con *Terra Battle* 2.
(Foto: *Getty Images) En un artículo publicado por Daily Mail* aparece la cantidad aproximada que Jen ha invertido para lograrse ver de 31 años -cuando en realidad tiene 48-.

English-French (Canada) code-mixing

	Before	After
Il agira maintenant à titre de président des opérations hockey des *Blue Jackets* de *Columbus*.
Mardi, la *Food and Drug Administration* (FDA) a conséquemment recommandé de suspendre son administration aux États-Unis.
Bientôt, le film sera disponible sur la plateforme *Youtube*.

English-Dutch code-mixing

	Before	After
*Spread Your Wings, It's Late* en *My Melancholy Blues* sluiten deze laatste *BBC* sessie af.
*Oklahoma City Thunder* won met 111-107 van *Houston Rockets*.
Dat betekent dat de *Galaxy* A40 tot ergens halverwege 2021 nog ondersteund wordt.

English-Danish code-mixing

	Before	After
Det vil Tuyen Phan have bugt med i sit projekt *Flash Your Trash*, men er det muligt at få folk med på vognen?
For få år siden var begreber som *Valentines Day* og *Black Friday* helt ukendte for danske forbrugere.
Snart kan Super Mario også spilles på *iPhone*.

As you can hear from these samples, English words are now read more naturally and fluently in these six languages than before. The updated version is now available for production use and comes with easy-to-follow instructions for synthesizing additional samples. You can further test these languages using the Audio Content Creation tool on Speech Studio.

Technology behind: accurate token-level language identification

Our approach to processing English-TargetLang code-mixed sentences emulates the way humans read such sentences. Specifically, we start by quickly identifying the English portion of the sentence and then proceed to process the English and native parts separately, using distinct lexicons and grammatical rules. This initial step is commonly referred to as language identification (LID). It's worth noting that many of the LID models reported in the literature operate at the sentence level, whereas token-level LID presents a well-known challenge due to its fine granularity, ambiguity, and typically short and contextually unclear nature.

Figure 2: Text-to-speech workflow for code-mixed text

In our previous work, we unveiled the unified neural text analyzer (UniTA), that comprises a comprehensive pipeline of all neural network-based text analysis components. Although we have already achieved good pronunciation accuracy with UniTA for monolingual text, we recognize the difficulty in accurately processing code-mixed text. As a result, we have been working diligently to enhance our capabilities in this area by harnessing the power of Z-code. In Figure 2 (presented above), we show where the Z-code based LID model intersects with our TTS workflow. This model acts as a coordinator, accurately identifying the locale of each token and directing them to the appropriate UniTA components for generating mixed-lingual phone sequence, which then passed to TTS backend components for waveform synthesis.

Z-code is a component of Microsoft’s XYZ-code initiative, which combines AI models for text, vision, audio and language. By leveraging transfer learning to eliminate language barriers and share linguistic elements, Z-code has considerably advanced multilingual pretraining for natural language processing and significantly improved the quality of translation and general natural language understanding tasks.

We have leveraged Z-code's powerful cross-lingual and cross-domain representation capabilities and its ability to improve language modeling. With these features, we have created a single, comprehensive token-level LID model that spans tens of languages and is specifically designed for English-TargetLang code-mixed scenarios. Thanks to Z-code's remarkable cross-lingual knowledge transfer capabilities, we were able to minimize the need for human annotations to just a few major languages. Despite the limited data, Z-code was able to learn from it and extend the learned task ability to most of the supported languages.

Microsoft has been making consistent efforts over the last decade to advance its TTS engine and establish new industry standards. We strive to create computer-generated speech that is as close to human speech as possible, in terms of both pronunciation accuracy and voice quality. In addition to the Z-code model, the other two TTS backbones have also contributed to improving the reading of code-mixed text, resulting in the best multilingual voice experience in the industry. One is the UniTTSv4 acoustic model which aligns closely with human speech and supports TTS to speak the phones of more than 110 languages. The other is the HiFiNet2 vocoder, which can generate 48kHz waveforms with an exceptional hi-fidelity sound quality while maintaining high efficiency and scalability.

Get Started

Azure Neural TTS now offers improved English word pronunciation across six languages including de-DE (Katja, Conrad), it-IT (Elsa), es-MX (Dalia, Jorge), fr-CA (Sylvie), nl-NL (Fenna) and da-DK (Christel). To test these voices, you can easily sign up for the Speech service on Azure and start using the Speech Studio.

Microsoft offers over 400 neural voices covering more than 140 languages and locales. With these Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users. In addition, with the Custom Neural Voice capability, you can easily create a brand voice for your business.