Blog Post

AI - Azure AI services Blog
4 MIN READ

Introducing AI-generated voices for Azure neural text to speech service

JingzhouYang's avatar
JingzhouYang
Icon for Microsoft rankMicrosoft
Sep 28, 2022

This blog is co-authored with Lei He, Melinda Ma, Qinying Liao, Binggong Ding and Sheng Zhao

 

Neural text to speech (Neural TTS) is a powerful speech synthesis capability of Azure cognitive services. It enables users to convert text to lifelike speech, and can be used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, etc. Neural TTS has been incorporated into Microsoft’s flagship products and adopted by many customers, such as AT&TDuolingoProgressive, and more. It has supported over 400 neural voices across 140 languages and variants.

 

To better support customer feedback for the diversity of the neural TTS voices, we introduce a technology that can quickly generate a variety of new voices on demand. This technology is called controllable new voice generation, and the generated TTS voice is called an AI-generated voice. With this innovation, we are able to create richer TTS voices in need, such as sweet, dark, husky voices, etc., much faster than before. In this blog, we introduce two new voices created by this approach: a masculine voice named AIGenerate1 and a feminine voice named AIGenerate2. Here is a deeper view on the technology behind AI-generated voices.

 

The traditional TTS system can only synthesize the voices that appear in model training, whereas the controllable new voice generation technology can create voices that do not sound like any speakers used in training. The voice generation model is a single end-to-end (E2E) model trained with data from various speakers with different genders, age, and voice timbres. To achieve controllable voice generation, the voice attributes of each speaker, such as gender, age and pitch, are also used in model training. Then the model can explicitly learn how these attributes make up a voice during training. In inference or generation, by using a new combination of attributes, a new voice can be generated on demand.

 

The definition of attributes

The voice attributes are the voice characteristics, and combinations of these attributes can represent different voice timbres. It is worth noting that these attributes might not be independent of each other, but different combinations can represent various voice timbres.  For example, the attributes can be defined as voice gender, age group, speaking rate, pitch level and other voice characteristics. These attributes can be labelled by human judges to characterize different voices.

 

For example, below table shows how a voice can be labelled using attributes like gender, age, rates, and more. 

 

Table 1. Voice samples and the labelled attributes

  Voice gender Age group Speaking rate Pitch level Sweet
Masculine Adult Medium Low No
Feminine Child Medium Medium Yes

These labelled attributes are transformed to numbers and normalized, then these normalized scores can be used in model training.

 

The model framework   

In this section, the general framework of the controllable new voice generation model is introduced. The model is based on the encoder-decoder framework, and can be illustrated as follows:

Figure 1. The model framework

 

In Figure 1, the dotted arrows and blocks indicate the modules only used in model training. In contrast to the conventional neural TTS system, an additional attribute encoder-decoder is introduced to model the voice attributes. As the defined attributes might not be adequate to represent the voice characteristics of a speaker, a random variable (which represents the undefined voice attributes) can be concatenated with the defined attributes as inputs to the attribute decoder. In order to increase the controllability of the attributes, the attributes can also be used as extra conditions to the variance adaptor and the conformer decoder.

 

In inference or voice generation, the attribute encoder is not used, and the attributes are directly used as inputs to the attribute decoder. By using a combination of attributes which does not appear in model training, then a new voice can be generated.

 

AI generated voices by using this technology

When the new voice generation model is trained, it can be used to create various voices by giving different combinations of attributes. For example, we can tune the age attribute to generate voices with different age groups, or tune the gender attribute to generate voices with different genders. The following table gives the generated audio samples by tuning specific voice attributes.

 

Table 2. Audio samples by tuning attributes

 

Masculine

Feminine

Young

    Sweeter

Adult

    More upbeat

 

We used this technology to generate two new voices: a masculine voice named AIGenerate1, and a feminine voice named AIGenerate2. These two voices are currently in public preview, available in selected regions (check details here). You can easily try these voices in the Azure TTS demo just by selecting the corresponding voice names in language English (United States). Some audio samples are given in the following table:

 

Table 3. Audio samples of AIGenerate1 and AIGenerate2.

 

AIGenerate1

AIGenerate2

It seems clear that SpaceX has a significant lead over its competitors in the commercial space industry.

A new era of commercialized space travel begins.

Cooking is not about fast or slow, it is about truth.

Both of the proposals had clear goals to solve problems.

 

We evaluated these two voices by using crowd-sourcing judges, the mean opinion scores (MOS) of these two voices are higher than 4.1 and comparable to human recordings.

 

In addition to generating platform voices for the Azure Cognitive Services voice portfolio, this technology can potentially be used to enable companies to create their own custom voices without providing additional training data, and avoid generating voices with real human's voice likeliness in the training corpus.

 

If you have any comments or suggestions on this technology or the generated voices (AIGenerate1 and AIGenerate2), please feel free to give us feedback.

 

Get started

Today, over 140 languages and variants are supported in Azure TTS. Users can choose from more than 400 pre-set voices or use our Custom Neural Voice service to create their own synthetic voice instead. To explore the capabilities of Neural TTS with its different voice offerings, we offer an interactive demo.

 

For more information:

Updated Nov 28, 2024
Version 4.0
  • Hey all - the play buttons aren't working. It'd' be great to hear these. 

  • Fang627426's avatar
    Fang627426
    Brass Contributor

    Dear GarfieldHe 

    I noticed that you added Armenian and Basque to the TTS demo page. Are you planning to do new Portuguese and French accents next, I wonder?

  • Fang627426's avatar
    Fang627426
    Brass Contributor

    GarfieldHe Melinda Ma QinyingLiao Hi there. I noticed that with Armenian, Basque, and some of the Chinese voices on the TTS demo page. When I click on them, the demo text in the text box did not display in the voice's language. Instead, they were displayed in English. Could you please fix that issue where possible? Thank you so much!

  • alanerickson's avatar
    alanerickson
    Copper Contributor

    I do like the quality, but are there any plans to improve the pricing for long audio text-to-speech conversions?  $100 per 1M characters is crazy expensive.  Thanks.

  • Fang627426's avatar
    Fang627426
    Brass Contributor

    Hello! On the TTS demo page, when I clicked on Armenian, the sample text was in English and not Armenian. Could you please fix this issue? I've attached an image above to help you better understand the problem.