This blog is co-authored with Lei He, Melinda Ma, Qinying Liao, Binggong Ding and Sheng Zhao
Neural text to speech (Neural TTS) is a powerful speech synthesis capability of Azure cognitive services. It enables users to convert text to lifelike speech, and can be used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, etc. Neural TTS has been incorporated into Microsoft’s flagship products and adopted by many customers, such as AT&T, Duolingo, Progressive, and more. It has supported over 400 neural voices across 140 languages and variants.
To better support customer feedback for the diversity of the neural TTS voices, we introduce a technology that can quickly generate a variety of new voices on demand. This technology is called controllable new voice generation, and the generated TTS voice is called an AI-generated voice. With this innovation, we are able to create richer TTS voices in need, such as sweet, dark, husky voices, etc., much faster than before. In this blog, we introduce two new voices created by this approach: a masculine voice named AIGenerate1 and a feminine voice named AIGenerate2. Here is a deeper view on the technology behind AI-generated voices.
The traditional TTS system can only synthesize the voices that appear in model training, whereas the controllable new voice generation technology can create voices that do not sound like any speakers used in training. The voice generation model is a single end-to-end (E2E) model trained with data from various speakers with different genders, age, and voice timbres. To achieve controllable voice generation, the voice attributes of each speaker, such as gender, age and pitch, are also used in model training. Then the model can explicitly learn how these attributes make up a voice during training. In inference or generation, by using a new combination of attributes, a new voice can be generated on demand.
The definition of attributes
The voice attributes are the voice characteristics, and combinations of these attributes can represent different voice timbres. It is worth noting that these attributes might not be independent of each other, but different combinations can represent various voice timbres. For example, the attributes can be defined as voice gender, age group, speaking rate, pitch level and other voice characteristics. These attributes can be labelled by human judges to characterize different voices.
For example, below table shows how a voice can be labelled using attributes like gender, age, rates, and more.
Table 1. Voice samples and the labelled attributes
Voice gender | Age group | Speaking rate | Pitch level | Sweet | |
Masculine | Adult | Medium | Low | No | |
Feminine | Child | Medium | Medium | Yes |
These labelled attributes are transformed to numbers and normalized, then these normalized scores can be used in model training.
The model framework
In this section, the general framework of the controllable new voice generation model is introduced. The model is based on the encoder-decoder framework, and can be illustrated as follows:
Figure 1. The model framework
In Figure 1, the dotted arrows and blocks indicate the modules only used in model training. In contrast to the conventional neural TTS system, an additional attribute encoder-decoder is introduced to model the voice attributes. As the defined attributes might not be adequate to represent the voice characteristics of a speaker, a random variable (which represents the undefined voice attributes) can be concatenated with the defined attributes as inputs to the attribute decoder. In order to increase the controllability of the attributes, the attributes can also be used as extra conditions to the variance adaptor and the conformer decoder.
In inference or voice generation, the attribute encoder is not used, and the attributes are directly used as inputs to the attribute decoder. By using a combination of attributes which does not appear in model training, then a new voice can be generated.
AI generated voices by using this technology
When the new voice generation model is trained, it can be used to create various voices by giving different combinations of attributes. For example, we can tune the age attribute to generate voices with different age groups, or tune the gender attribute to generate voices with different genders. The following table gives the generated audio samples by tuning specific voice attributes.
Table 2. Audio samples by tuning attributes
|
Masculine |
Feminine |
Young |
||
Sweeter |
||
Adult |
||
More upbeat |
We used this technology to generate two new voices: a masculine voice named AIGenerate1, and a feminine voice named AIGenerate2. These two voices are currently in public preview, available in selected regions (check details here). You can easily try these voices in the Azure TTS demo just by selecting the corresponding voice names in language English (United States). Some audio samples are given in the following table:
Table 3. Audio samples of AIGenerate1 and AIGenerate2.
|
AIGenerate1 |
AIGenerate2 |
It seems clear that SpaceX has a significant lead over its competitors in the commercial space industry. |
||
A new era of commercialized space travel begins. |
||
Cooking is not about fast or slow, it is about truth. |
||
Both of the proposals had clear goals to solve problems. |
We evaluated these two voices by using crowd-sourcing judges, the mean opinion scores (MOS) of these two voices are higher than 4.1 and comparable to human recordings.
In addition to generating platform voices for the Azure Cognitive Services voice portfolio, this technology can potentially be used to enable companies to create their own custom voices without providing additional training data, and avoid generating voices with real human's voice likeliness in the training corpus.
If you have any comments or suggestions on this technology or the generated voices (AIGenerate1 and AIGenerate2), please feel free to give us feedback.
Get started
Today, over 140 languages and variants are supported in Azure TTS. Users can choose from more than 400 pre-set voices or use our Custom Neural Voice service to create their own synthetic voice instead. To explore the capabilities of Neural TTS with its different voice offerings, we offer an interactive demo.
For more information:
- Read our documentation
- Check out our quickstarts
- Check out the code of conduct for integrating Neural TTS into your apps.