Microsoft Foundry Blog

9 MIN READ

Build a natural custom voice for your brand

Microsoft

Feb 04, 2021

Custom Neural Voice is a Text-to-Speech (TTS) feature of Speech in Azure Cognitive Services that allows you to create a one-of-a-kind customized synthetic voice for your brand. Since its preview in September 2019, Custom Neural Voice has empowered organizations such as AT&T, Duolingo, Progressive, and Swisscom to develop branded speech solutions that delight users. (For more details, read the Innovation Stories blog).

Today, we are excited to announce that Custom Neural Voice is now generally available (GA). It is important to note that although Custom Neural Voice is GA from a technological standpoint, interested customers must apply and be approved to use it. Alternatively, developers can add TTS capabilities to their apps quickly by creating an Azure Speech instance and selecting from over 200 pre-built TTS and Neural TTS voices across 54 languages/locales.

In this blog, we’ll introduce how Custom Neural Voice works and share best practices in responsibly creating a highly natural brand voice for your apps. If you have questions, join us at our ‘Ask-Microsoft-Anything’ on Wednesday, 2/10 at 9AMPT. Add to Calendar.

Your voice, your brand

In a world where voice-based interactions are increasingly becoming the norm, your voice is your brand. A recognizable digital voice helps your customers connect with your brand in new ways.

In recent years we have seen increased interest from a broad range of companies across Media and Entertainment, Telecom, Automobile, Education, and Hospitality, who consider voice-based interactions from a range of devices like phones, speakers, TV/cable boxes, and cars as a key interaction point with their customers. These organizations are looking to have a consistent, branded experience delivered directly to their customers. To highlight one such example, below is an audio sample of the 'Flo' virtual chatbot from Progressive.

Voice Sample: 'Flo' from Progressive

Custom Neural Voice empowers people and organizations in many ways. The following scenarios are examples of use cases where customers find Custom Neural Voice particularly useful and valuable:

Customer Service Chatbots – Companies can automate their call center operation with conversational AI to answer calls from customers with a natural-sounding voice that conveys friendliness, empathy, and professionalism and other values that are important to companies. For example, Progressive is using Custom Neural Voice to enable their virtual version of Flo to help their customers with ‘everything from getting a free car insurance to general insurance questions’. Read the full story.

Voice Assistants – Companies developing smart assistants on appliances, cars, and homes can use Custom Neural Voice to create a unique synthetic voice that conveys the brand of the company, the persona of the assistant and a speaking style that enables the best experience for their target users. With Custom Neural Voice, Swisscom was able to create a multilingual voice assistant that sound human and unique to Swisscom and resonates with its audience. Read the full story.

Online Learning – Education providers can add speech to their learning material with a voice that is suitable for the subjects and the students, thereby improving the engagement of the students and the effectiveness of the learning. Duolingo is using the Custom Neural Voice capability to develop stylized voices for their virtual characters for their online learning experience. Learn more.

Audio Books – Content publishers can turn written content into audio that is spoken with a synthetic voice to make it more accessible to the global audience. With Custom Neural Voice, the content publishers can create one or more unique voices with natural reading styles that match the subject and context of the content as well as the preference of the listeners. The Beijing Hongdandan Visually Impaired Service Center is using the Custom Neural Voice capability to produce audiobooks based on the voice of Lina, a trainer at the organization who is familiar to the people who are blind in China.

Assistive Technology and Real-time Translations – Custom Neural Voice can be used in situations to assist people in need or improve accessibility. When used as an assistive technology, people with speech impairment could use the technology to enable them to communicate with others with a voice that sounds like them. Custom Neural Voice can used in other situations such as real-time translation allowing people to communicate with others in a foreign language in a familiar voice.

Public Service Announcement – Public service organizations can use Custom Neural Voice to create a voice that is suitable for public announcements, whether it is in an airport, a train terminal, or other venues. The use of synthetic voice provides the ability to generate announcements with dynamic content that cannot be recorded ahead of time.

Benefit of Custom Neural Voice

Traditionally, TTS requires a large volume of voice data—in the range of 10,000 lines or more—to produce a fluent voice model. Consequently, TTS models with fewer recorded lines tend to sound noticeably robotic.

With the innovation of deep neural networks and a powerful base model built with speech data from many different speakers, Neural TTS can 'learn' the way phonetics are combined in natural human speech rather than using classical programming or statistical methods.

Empowered with this technology, Custom Neural Voice enables users to build highly-realistic voices with just a small number of training audios. This new technology allows companies to spend a tenth of the effort traditionally needed to prepare training data while at the same time significantly increasing the naturalness of the synthetic speech output when compared to traditional training methods.

Listen to the samples created with Custom Neural Voice below. Or try more demos on the Speech Studio.

Language	Voice	Human	TTS (Custom Neural Voice)
Chinese (Mandarin, simplified)	Lina (Hongdandan)
English (Australia)	Thomas
English (United States)	Angela
French (France)	Zoe (Swisscom)
German (Germany)	Lara (Swisscom)

How it works

Custom Neural Voice is based on Neural TTS technology that creates a natural-sounding voice. The realistic and natural sounding voice of Custom Neural Voice can represent brands, personify machines, and allow users to interact with applications conversationally in a natural way.

The underlying Neural TTS technology used for Custom Neural Voice consists of three major components: Text Analyzer, Neural Acoustic Model, and Neural Vocoder. To generate natural synthetic speech from text, the text is first input into Text Analyzer, which provides output in the form of phoneme sequence. A phoneme is a basic unit of sound that distinguishes one word from another in a particular language. A sequence of phonemes defines the pronunciations of the words provided in the text. Then the phoneme sequence goes into the Neural Acoustic Model to predict acoustic features that define speech signals, such as the timbre, speaking style, speed, intonations, and stress patterns, etc. Finally, the Neural Vocoder converts the acoustic features into audible waves so that synthetic speech is generated.

Neural TTS voice models are trained using deep neural networks based on real voice recording samples. With the customization capability of Custom Neural Voice, you can adapt the Neural TTS engine to better fit your user scenarios. To create a custom neural voice, visit the Speech Studio to upload the recorded audio and corresponding scripts, train the model, and deploy the voice to a custom endpoint. Depending on the use case, Custom Neural Voice can be used to convert text into speech in real-time (e.g., used in a smart virtual assistant) or generate audio content offline (e.g., used as in audiobook or instructions in e-learning applications) with the text input provided by the user. This is made available through REST APIs, Speech SDK, or a no-code Audio Content Creation tool.

Building a Custom Neural Voice

As part of Microsoft’s commitment to responsible AI, we are designing and releasing Custom Neural Voice with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteract the proliferation of harmful deepfakes and misleading content. For this reason, we have limited the access and use of Custom Neural Voice. Submit an intake form here.

Microsoft requires every customer to obtain explicit written permission from the voice talent before creating a voice model (see Disclosure for Voice Talent). In addition, you must not use custom neural voice for certain prohibited use cases (see Code of Conduct) and must disclose the synthetic nature of the service to your users upon deployment of the custom voice model (see Disclosure Guidelines).

When preparing your recording script, make sure you include the following sentence to acquire the voice talent’s acknowledgement of using their voice data to create a TTS voice model and generate synthetic speech.

“I [state your first and last name] am aware that recordings of my voice will be used by [state the name of the company] to create and use a synthetic version of my voice.”

As a technical safeguard intended to prevent misuse of Custom Neural Voice services, Microsoft will use this recording to verify that the voice talent’s voice in the script matches the voice provided in the training data through the Speaker Verification technology. Read more about this process in the Data and Privacy document.

In the video below, we introduce how to use the Speech Studio to create a highly natural voice with your own data.

Creating a great custom voice requires careful quality control in each step, from voice design, data preparation, to the deployment of the voice model to your system. This docs page outlines in more detail the characteristics, limitations and the best practices in designing and building a custom neural voice.

Below are some key steps to take when creating a custom neural voice for your organization. (Note: this presumes you have applied and have been approved for use of Custom Neural Voice.)

Step 1: Persona design

First, design a persona of the voice that represents your brand using a persona brief document that defines elements such as the features of the voice, and the character behind the voice. This will help to guide the process of creating a custom voice model, including defining the scripts, selecting your voice talent, training and voice tuning.

Step 2: Script selection

Carefully select the recording script to represent the user scenarios for your voice. For example, you can use the phrases from bot conversations as your recording script if you are creating a customer service bot. Include different sentence types in your scripts, including statements, questions, exclamations, etc.

Step 3: Preparing training data

We recommend that the audio recordings be captured in a professional quality recording studio to achieve a high signal-to-noise ratio. The quality of the voice model heavily depends on your training data. Consistent volume, speaking rate, pitch, and consistency in expressive mannerisms of speech are required.

Common issues with recordings include speaking style mismatch (e.g., not in an ‘excited’ manner that you want to the voice to be), unnatural speed, unstable breaks, wrong pronunciation on words, etc. It is recommended that you work with a voice director to control the recording quality. Follow the recording guidance here.

Once the recordings are ready, follow the instructions here to prepare the training data in the right format.

Step 4: Testing

Prepare test scripts for your voice model that cover the different use cases for your apps. It’s recommended that you use scripts within and outside the training dataset so you can test the quality more broadly for different content.

Step 5: Tuning and adjustment

The style and the characteristics of the trained voice model depend on the style and the quality of the recordings from the voice talent used for training. However, several adjustments can be made using SSML (Speech Synthesis Markup Language) when you make the API calls to your voice model to generate synthetic speech. SSML is the markup language used to communicate with the TTS service to convert text into audio. The adjustments include change of pitch, rate, intonation, and pronunciation correction. If the voice model is built with multiple styles, SSML can also be used to switch the styles.

All of the SSML markups mentioned above can be passed directly to the API. We also provide an online tool, Audio Content Creation, that allows customers to fine-tune their audio output using a friendly UI.

Get started

Interested in building a custom neural voice? Check the languages supported. Sign up to Speech service on Azure and get started on the Speech Studio.

Besides the capability to customize TTS voice models, Microsoft offers over 200 neural and standard voices covering 54 languages and locales. With these Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design, or give a voice to chatbots to provide a richer conversational experiences to your users.