Microsoft Foundry Blog

4 MIN READ

Introducing AI-generated voices for Azure neural text to speech service

Microsoft

Sep 27, 2022

This blog is co-authored with Lei He, Melinda Ma, Qinying Liao, Binggong Ding and Sheng Zhao

Neural text to speech (Neural TTS) is a powerful speech synthesis capability of Azure cognitive services. It enables users to convert text to lifelike speech, and can be used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, etc. Neural TTS has been incorporated into Microsoft’s flagship products and adopted by many customers, such as AT&T, Duolingo, Progressive, and more. It has supported over 400 neural voices across 140 languages and variants.

To better support customer feedback for the diversity of the neural TTS voices, we introduce a technology that can quickly generate a variety of new voices on demand. This technology is called controllable new voice generation, and the generated TTS voice is called an AI-generated voice. With this innovation, we are able to create richer TTS voices in need, such as sweet, dark, husky voices, etc., much faster than before. In this blog, we introduce two new voices created by this approach: a masculine voice named AIGenerate1 and a feminine voice named AIGenerate2. Here is a deeper view on the technology behind AI-generated voices.

The traditional TTS system can only synthesize the voices that appear in model training, whereas the controllable new voice generation technology can create voices that do not sound like any speakers used in training. The voice generation model is a single end-to-end (E2E) model trained with data from various speakers with different genders, age, and voice timbres. To achieve controllable voice generation, the voice attributes of each speaker, such as gender, age and pitch, are also used in model training. Then the model can explicitly learn how these attributes make up a voice during training. In inference or generation, by using a new combination of attributes, a new voice can be generated on demand.

The definition of attributes

The voice attributes are the voice characteristics, and combinations of these attributes can represent different voice timbres. It is worth noting that these attributes might not be independent of each other, but different combinations can represent various voice timbres. For example, the attributes can be defined as voice gender, age group, speaking rate, pitch level and other voice characteristics. These attributes can be labelled by human judges to characterize different voices.

For example, below table shows how a voice can be labelled using attributes like gender, age, rates, and more.

Table 1. Voice samples and the labelled attributes

Voice gender	Age group	Speaking rate	Pitch level	Sweet
Masculine	Adult	Medium	Low	No
Feminine	Child	Medium	Medium	Yes

These labelled attributes are transformed to numbers and normalized, then these normalized scores can be used in model training.

The model framework

In this section, the general framework of the controllable new voice generation model is introduced. The model is based on the encoder-decoder framework, and can be illustrated as follows:

Figure 1. The model framework

In Figure 1, the dotted arrows and blocks indicate the modules only used in model training. In contrast to the conventional neural TTS system, an additional attribute encoder-decoder is introduced to model the voice attributes. As the defined attributes might not be adequate to represent the voice characteristics of a speaker, a random variable (which represents the undefined voice attributes) can be concatenated with the defined attributes as inputs to the attribute decoder. In order to increase the controllability of the attributes, the attributes can also be used as extra conditions to the variance adaptor and the conformer decoder.

In inference or voice generation, the attribute encoder is not used, and the attributes are directly used as inputs to the attribute decoder. By using a combination of attributes which does not appear in model training, then a new voice can be generated.

AI generated voices by using this technology

When the new voice generation model is trained, it can be used to create various voices by giving different combinations of attributes. For example, we can tune the age attribute to generate voices with different age groups, or tune the gender attribute to generate voices with different genders. The following table gives the generated audio samples by tuning specific voice attributes.

Table 2. Audio samples by tuning attributes

	Masculine	Feminine
Young
Sweeter
Adult
More upbeat

We used this technology to generate two new voices: a masculine voice named AIGenerate1, and a feminine voice named AIGenerate2. These two voices are currently in public preview, available in selected regions (check details here). You can easily try these voices in the Azure TTS demo just by selecting the corresponding voice names in language English (United States). Some audio samples are given in the following table:

Table 3. Audio samples of AIGenerate1 and AIGenerate2.

	AIGenerate1	AIGenerate2
It seems clear that SpaceX has a significant lead over its competitors in the commercial space industry.
A new era of commercialized space travel begins.
Cooking is not about fast or slow, it is about truth.
Both of the proposals had clear goals to solve problems.

We evaluated these two voices by using crowd-sourcing judges, the mean opinion scores (MOS) of these two voices are higher than 4.1 and comparable to human recordings.

In addition to generating platform voices for the Azure Cognitive Services voice portfolio, this technology can potentially be used to enable companies to create their own custom voices without providing additional training data, and avoid generating voices with real human's voice likeliness in the training corpus.

If you have any comments or suggestions on this technology or the generated voices (AIGenerate1 and AIGenerate2), please feel free to give us feedback.

Get started

Today, over 140 languages and variants are supported in Azure TTS. Users can choose from more than 400 pre-set voices or use our Custom Neural Voice service to create their own synthetic voice instead. To explore the capabilities of Neural TTS with its different voice offerings, we offer an interactive demo.

For more information:

Read our documentation
Check out our quickstarts
Check out the code of conduct for integrating Neural TTS into your apps.

Updated Nov 28, 2024

Version 4.0

azure ai services

neural tts

JingzhouYang

Microsoft

Joined August 07, 2022

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity

9 Comments

brumby0890
Copper Contributor
Feb 06, 2024
Are there any gender neutral (non-binary) voices available?
dond-53
Former Employee
Jan 06, 2023
Hey all - the play buttons aren't working. It'd' be great to hear these.
QinyingLiao
Microsoft
Oct 25, 2022
alanerickson The pricing is $16/m characters. Cognitive Speech Services Pricing | Microsoft Azure

The $100/m characters pricing only applies to the long audio API.
GarfieldHe
Microsoft
Oct 23, 2022
Fang627426 this will be checked in shortly, please stay tuned!
Fang627426
Brass Contributor
Oct 22, 2022
Hello! On the TTS demo page, when I clicked on Armenian, the sample text was in English and not Armenian. Could you please fix this issue? I've attached an image above to help you better understand the problem.
alanerickson
Copper Contributor
Oct 21, 2022
I do like the quality, but are there any plans to improve the pricing for long audio text-to-speech conversions? $100 per 1M characters is crazy expensive. Thanks.
GarfieldHe
Microsoft
Oct 17, 2022
Fang627426 thank for the feedback! the product is still under deployment, it will be synced shortly!
Fang627426
Brass Contributor
Oct 17, 2022
GarfieldHe Melinda Ma QinyingLiao Hi there. I noticed that with Armenian, Basque, and some of the Chinese voices on the TTS demo page. When I click on them, the demo text in the text box did not display in the voice's language. Instead, they were displayed in English. Could you please fix that issue where possible? Thank you so much!
Fang627426
Brass Contributor
Oct 05, 2022
Dear GarfieldHe
I noticed that you added Armenian and Basque to the TTS demo page. Are you planning to do new Portuguese and French accents next, I wonder?

Blog Post

Introducing AI-generated voices for Azure neural text to speech service

The definition of attributes

The model framework

AI generated voices by using this technology

Get started