AI - Azure AI services Blog

6 MIN READ

Azure Neural TTS now available on devices for disconnected and hybrid scenarios

Microsoft

Jan 16, 2023

Azure Neural Text-to-Speech (Neural TTS) is a powerful AIGC (AI Generated Content) service that allows users to turn text into lifelike speech. It has been applied to a wide range of scenarios, including voice assistants, content read-aloud capabilities, and accessibility uses. During the past months, Azure Neural TTS has achieved parity with the natural human recordings (see details) and has been extended to support more than 140 languages and variances (see details). These highly natural voices are available on cloud, and on prem through containers.

At the same time, we have received many customer requests to support Neural TTS on devices, especially for scenarios where devices do not have network availability or the network is not stable, and scenarios that require extremely low latency or have privacy constraint. For example, users of screen readers for accessibility (such as the speech feature on Windows) are asking to improve the voice experience with better on-device TTS quality. Automobile manufacturers are requesting features to enable voice assistants in cars when disconnected.

To address the needs for high-quality embedded TTS, we developed a new-generation device neural TTS technology which has significantly improved the embedded TTS quality compared to the traditional on-device TTS voices, e.g, those based on the legacy SPS (Statistical Parametric Speech Synthesis) technology. Thanks to this new technology, natural voices on-device have been released to Microsoft’s flagship products such as Windows 11 Narrator, and are now available in Speech services for Azure customers.

Neural voices on-device

A set of natural on-device voices lately became available with Narrator on Windows 11. Check the video below to hear how natural the new voices sound and how much better they are than the old-generation embedded voices.

Windows11’s Narrator Natural Voices, start from 30’27’’ to 32’44’’

With Azure Speech service, you can embed the same natural on-device voices into your own business scenarios easily. Check the demo below to see how seamlessly a mobile speech experience is switched from a connected environment to disconnected, with neural TTS voices available both on cloud and embedded.

Seamless switch between cloud TTS and device TTS with Azure device neural TTS technology

This new generation on-device neural TTS has three key advances: high quality, high efficiency, and high responsiveness.

High quality

Traditional TTS on-device voices are built with the legacy SPS technology and the voice quality is significantly lower than the cloud-based TTS, typically with a MOS (Mean Opinion Score) gap higher than 0.5. Now, with the new device neural TTS technology, we have closed the gap between the device TTS and cloud TTS. Our MOS and CMOS (Comparative Mean Opinion Score) tests have shown that the device neural TTS voice quality is very close to the cloud TTS.

Check below table for a comparison of voice naturalness, output support and features available among traditional device TTS (SPS), embedded neural TTS and cloud neural TTS. Here ‘traditional device TTS’ is the device SPS technology we shipped on Windows 10 and the previous Windows versions, which is also the major technology used for embedded TTS in the current industry.

	Traditional device TTS	Device neural TTS	Cloud neural TTS
MOS gap (on-device neural TTS as the base)	~-0.5	0	~+0.05
16kHz fidelity	Yes	Yes	Yes
24kHz fidelity	No	Yes	Yes
48kHz fidelity	No	No	Yes
Styles/emotions	No	No	Yes

As you can tell from the above comparison, with the new technology, the naturalness of device neural TTS voices have reached near parity with the cloud version. Hear how close they sound with below samples.

Voice

Device neural TTS

Cloud neural TTS

Jenny, En-US

Guy, En-US

Xiaoxiao, Zh-CN

Yunxi, Zh-CN

High efficiency

Deploying neural network models to IoT devices is a big challenge for both those performing AI research as well as multiple industries today. For device TTS scenarios and customers, the challenge is even bigger due to lower end devices and lower CPU usage reservation in the system according to our customers’ experience. So, we must create a super highly efficient solution for our device neural TTS.

Below are the metrics and the score card for our device neural TTS system. Overall, its efficiency is close to some traditional device TTS systems and can meet almost all customers’ requirements on efficiency.

Metrics	Values
CPU usage (DIMPs)	~1200
RTF¹ (820A², 1 thread)	~0.1
Output sample rate	24 kHz
Model Size (Bytes)	~5 Mb (Acoustic Model + Vocoder)
Memory usage (Bytes)	<120 Mb
NPU³ support	Yes

Notes:

RTF, or Real-Time Factor, is the measurement of the time in seconds to generate the audio of 1 second in length.
820A is a type of CPU that is broadly used in car systems currently. It is a typical platform that device TTS is running on and most customers can adopt, so we use this CPU as our platform for measurement.
NPU, or Neural Process Unit, is one of the critical components in the CPU, especially for AI related processing. It can accelerate the neural network inferencing efficiently without increasing general CPU usage. Recently more and more IoT devices like car manufacturers are using NPU to accelerate their systems.

High responsiveness

High synthesizing speed and low latency are critical factors that affect the user experience in a text-to-speech system. To ensure a highly responsive system, we designed the device NTTS to synthesize in a streaming mode, which means that the latency is independent of the length of the input sentence. This allows for a consistently small latency and a highly responsive experience when synthesizing. To achieve streaming synthesizing, both the acoustic model and vocoder must be able to be inferenced in a streaming manner.

With the streaming inference design, we achieved 100ms latency on 820A with 1 thread.

How did we do that?

To improve the device TTS technology with neural networks, overall, we adopted a pipeline architecture similar to the cloud TTS. The pipeline contains three major components: text analyzer, acoustic model, and vocoder.

For acoustic model, we designed a totally brand-new model architecture, named “LeanSpeech”, which is a super light efficiency model with high learning capability. We use LeanSpeech as a student model to learn from the service model which acts as a teacher. With this design, we achieved an acoustic model with a 2.9Mb size in bytes and close quality to the service acoustic model on cloud.

In addition, we developed the device vocoder based on our last service HiFiNet vocoder on cloud. The biggest challenge we faced was the computation cost. If we just simply applied LeanSpeech + HiFiNet, HiFiNet contributed to higher than 90% of the computation cost, and the total CPU usage would block adoption on some low-end devices or systems that only have limited CPU usage budget , like in many on-car assistant scenarios.

To solve these challenges, we re-designed our HiFiNet, using highly efficient model units and applied model compression methods like model distillation. Finally, we reduced the on-device model size to 7x smaller and decreased the computation cost by 4x compared to the service vocoder on cloud.

Get started

Embedded Speech with device neural TTS is in public preview with limited access. You can check how to use it with the Speech SDK here. Apply for access through the Azure Cognitive Services embedded speech limited access review. For more information, see Limited access for embedded speech.

We have below languages and voices released through Azure Embedded Speech public review. More languages will be supported bases on business needs.

Locale Name	Voice Name	Gender
en-US	Jenny	Female
en-US	Aria	Female
zh-CN	Xiaoxiao	Female
de-DE	Katja	Female
en-GB	Libby	Female
ja-JP	Nanami	Female
ko-KR	SunHi	Female
en-AU	Annette	Female
en-CA	Clara	Female
es-ES	Elvira	Female
es-MX	Dalia	Female
fr-CA	Sylvie	Female
fr-FR	Denise	Female
it-IT	Isabella	Female
pt-BR	Francisca	Female
en-US	Guy	Male
zh-CN	Yunxi	Male
de-DE	Conrad	Male
en-GB	Ryan	Male
ja-JP	Keita	Male
ko-KR	InJoon	Male
en-AU	William	Male
en-CA	Liam	Male
es-ES	Alvaro	Male
es-MX	Jorge	Male
fr-CA	Jean	Male
fr-FR	Henri	Male
it-IT	Diego	Male
pt-BR	Antonio	Male

Microsoft offers the best-in-class AI voice generator with Azure Cognitive Services. Quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users with over 400 highly natural voices available across more than 140 languages and locales. Or easily create a brand voice for your business with the Custom Neural Voice capability.

For more information: