Microsoft Foundry Blog

11 MIN READ

How to Create a Custom Neural Voice

Microsoft

Dec 02, 2021

Custom neural voice, a feature of Azure Cognitive Services for Speech, is a great way to create a highly natural synthetic voice that sounds almost identical to your voice actor. This synthetic voice can then be used in a variety of different scenarios including audiobooks, language learning, reading news content and many more.

Microsoft has made it simple to train a custom neural voice by using a small set of recordings from the target voice (from 300 – 2,000 sentences or short phrases, which is about 30 minutes to 3 hours of speech data). So, you’re thinking great! How do I get started?

Steps in the Journey

Before we begin, here is a step-by-step visualization for creating a custom neural voice. We will explain each of these steps in detail below.

Prepare	1.	Meet Responsible AI requirements	∙ Read ∙ Fill out custom neural voice application
	2.	Cast a voice actor	∙ Define voice persona ∙ Find a voice actor
	3.	Create a script	∙ Download prepared general scripts ∙ Write domain script
	4.	Record actor	∙ Record permission statement ∙ Record prepared scripts
Create in Speech Studio	5.	Start a new project in Speech Studio	∙ Log into Speech Studio using Azure account ∙ Create a new custom neural voice project
	6.	Upload voice data	∙ Upload permission statement recording ∙ Upload voice recordings + script
	7.	Train voice model	∙ Select appropriate training data + voice talent profile to train ∙ Listen to the test samples to check out the quality
	8.	Deploy your voice	∙ Deploy trained model ∙ Test your endpoint to make sure it works fine
Integrate	9.	Integrate your voice for audio content creation or creating your own apps	∙ Use your voice via audio content creation, or ∙ Build your app with your custom neural voice using the Speech SDK

Responsible AI

The first step to creating your own custom neural voice is filling out an application to gain access to the technology. As part of Microsoft’s commitment to responsible AI, we have designed and released custom neural voice with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteracting the proliferation of harmful deepfakes and misleading content. For this reason, we have limited the access and use of custom neural voice. Once you’ve applied and after your applications are reviewed and you have committed to using it in alignment with our responsible AI principles you will be granted access.

The application contains several questions such as your company or organizations name, the Azure subscription ID you where you would like to deploy the voice, intended use case and if you have permission from the actor to create a synthetic version of their voice. These are just some of the questions that are asked so please make sure to read through the application and fill it out completely and with details. To gain a better understanding of the application process please check out this video. Applying for Custom Neural Voice | Azure Videos | Channel 9 (msdn.com)

So now you’ve applied to create a custom neural voice and been granted access. What’s next? We suggest a series of steps that will help you create a great voice. We will discuss these steps in detail below. This includes persona and domain design, script selection, voice talent selection, recording, and quality checking. We’ll also discuss how to train and deploy your model as well as how you can fine tune your voice with our audio content creation tool.

Persona and Domain Design

What is a voice persona? Persona refers to the attributes that make your imaginary character come to life in a way that will resonate with your customers. For example, you may want a 40ish year-old female, who speaks with authority and confidence, is directly engaging, thoughtful and unbiased. Think carefully about your persona because this will be a representation of your company when speaking to customers. In addition to conveying your brand intent, the persona also ensures that all aspects of voice production are consistent. This includes casting, script development, voice directing during recording and evaluation of the final output.

Script Selection

When considering script selection, the key to writing scripts is to make sure they are written so the actor can easily capture the persona in his or her delivery and maintain that persona over the course of several recording sessions. You should first write scripts with this in mind.

Your sentences and short phrases don't need to come from the same source, or the same kind of source. They don't even need to have anything to do with each other. However, if you will use set phrases (for example, "You have successfully logged in") in your speech application, make sure to include them in your script. This will give your custom neural voice a better chance of delivering those phrases well. And if you decide to use a recording in place of synthesized speech, you'll already have it in the same voice.

We recommend the recording script include both general sentences and your domain sentences. For example, if you plan to record 2,000 sentences, 1,000 of them could be general sentences, 1,000 of them could be sentences from your target domain or application.

For a general script, to help you get to 2,000 sentences you can use Microsoft shared scripts provided here. We offer some sentences selected from the news domain for the languages that are supported by custom neural voice.

For domain script, here are some tips when selecting or creating your script:

Select a balanced script coverage in your domain including general sentences, question sentences, exclamation sentences, long sentences, and short sentences.
Question sentences should take up about 10% of the domain script. They should be evenly divided between questions that can be answered with a yes or no, and any other types of questions.
One sentence per line. Don’t put multiple sentences into one line.
Make sure the sentences are natural and easy to read in a way that is consistent with your persona. Generally, do not overuse numbers or abbreviations unless your use case will includes many questions. Also it’s a good idea to avoid difficult proper nouns for places and names unless you expect a lot of these in your application.
Some applications use numbers or acronyms. You can include these, but it’s best to do some normalization of these numbers or acronyms into spoken form. For example:

For lines with abbreviations, instead of "001 BTW", you have "001 by the way".

For lines with digits, instead of "002 911", you have "002 nine one one".

With that, the voice talent will pronounce these words in an expected way that will allow matching of the recordings and script during the training process. Make sure to check the script carefully for errors. If possible, have someone else check too. It is crucial that the text in the script matches exactly what the actor says. Note that you can change the script after the recording and before training if there is a mismatch.

Voice Talent Selection

Voice talent selection or casting can start as soon as the voice persona is defined. Finding the right voice talent is just as important as designing your voice persona and selecting your scripts. It can seem daunting to wade through the many voice talents available but here are some tips on selecting a voice talent:

Experience: Make sure your voice talent has experience in the persona and content you would like to capture. For example, if you are recording a voice talent for audiobooks, make sure that the voice talent has experience reading long narration.
Sound: Make sure you listen for the qualities in the voice actor’s voice such as pitch and tone. Listen for clarity in pronunciation. Make sure the actors natural voice fits the persona you are looking for.
Most importantly once you have chosen a voice actor, make sure that the voice actor is okay to sign a contract with you stating that they will offer their voice to create a synthetic voice.

Recording

When choosing a voice actor, it’s a good idea to consider where your recording will take place. If you are using your own studio or a local studio, make sure you choose local talent or there may be traveling fees. Ask if the studio has call-in abilities to avoid the travel costs. When looking for a recording studio make sure you ask if the studio has experience with recording voice talent for synthetic voice. If they don’t, you can still use this recording studio but make sure that the recording engineer knows exactly what you are looking for and if possible, hire a voice director who knows about the uniqueness of recording a voice for text-to-speech synthesis.

When scheduling the recording sessions make sure you schedule sessions in 2 or 3-hour blocks. Each session should be separated by at least one day off. Make sure the studio delivers several recordings immediately after the first session to be checked for proper audio specifications. Assume there will be about 100 lines per hour as a starting point. The remaining schedule can be adjusted based on the number of lines per hour completed in the first two sessions.

When recording, each sentence should be broken up into individual sentences or phrases. For example, the voice actor should say one sentence or phrase, pause, and then say the next. The sentences or phrases will then need to be sectioned out into separate .wav files and numbered according to the transcript by line.

Below is an example of how the transcripts are organized in a .txt file:

0000000001[tab] This is the waistline, and it's falling.

0000000002[tab] We have trouble scoring.

0000000003[tab] It was Janet Maslin.

For more information on organizing transcripts check out this article. How to prepare data for Custom Voice - Speech service - Azure Cognitive Services | Microsoft Docs

Don’t forget to record the voice actor saying the that is asked for in the speech studio custom neural voice portal as well.

“I [state your first and last name] am aware that recordings of my voice will be used by [state the name of the company] to create and use a synthetic version of my voice.”

You can find a version of this consent statement in multiple languages as well as sample scripts for recording here.

Quality Checking

It’s a good idea to check your audio for quality before closing out the contract with the recording The audio files must be in a RIFF (.wav) format and the sampling rate must be at least 24,000 Hz. The sample format must be PCM, 16-bit. When naming your files, the file name must be numeric with a .wav file extension and there are no duplicate file names allowed. All audio files must be shorter than 15 seconds. You can read more about audio properties required here.

Create and use your voice model

Great! For a detailed understanding of how to create a custom neural voice using the Speech Studio Portal please check out this video.

Uploading Voice Data

Go to the speech studio portal and log in with your Azure account. Make sure you specify the language for the voice you want to build. From there go to the project page click on “set up and create a profile for your voice. Select the use scenario you would like the voice to be used for and upload the voice talent verbal statement audio from there upload your training data. A minimum of 300 sentences or phrases is required to be prepared as training data for custom neural voice. We recommend 2,000 sentences or phrases if your goal is to create a voice for production use. If you do not have the transcripts of the audio recordings or if your recordings are not segmented into individual sentences and phrases, you can use the long audio and audio only feature.

Review the report provided after uploading your training data and make sure the pronunciation accuracy of your speech data is good and signal noise ratio is acceptable. The quality of your voice model heavily depends on the quality of the recorded voice used for training. It must have a consistent volume, speaking rate, speaking pitch, and consistency in expressive mannerisms. We recommend that the audio recording should be captured in a professional quality recording studio so that a higher signal to noise ratio is achieved. For more information on how to upload your voice data please check out this document Create a Custom Voice - Speech service - Azure Cognitive Services | Microsoft Docs.

Training your voice model

Once you are satisfied with the training data you can submit the data for voice model creation. Select the data sets you want to use for training and associate them with the right voice talent profile you have created. If the voice talent verbal statement doesn't match the voice in the training data, your training request will not be processed. Once the model is successfully trained, review the quality You can also provide your test script in this step. If you provide your custom test script (up to 100 lines), we'll return 100 default test audios, plus 100 custom test audios based on your test script.

Deploying your voice model

Deploy your voice model to get a unique ID for your speech synthesis API endpoint. You can then integrate this voice model into your apps with the text-to-speech SDK or use it in the audio content creation tool without writing a single line of code. Once a voice model is deployed you will be charged for hosting, however, endpoints can be suspended and resumed so you can use your voice when needed. You will also be charged per 1 million characters. You can find pricing here.

Audio Content Creation

Once your voice has been created and deployed you can now take advantage of our audio content creation tool. With audio content creation, you can fine-tune text-to-speech voices and design customized audio experiences in an efficient and low-cost way.

The tool is based on Speech Synthesis Markup Language (SSML). It allows you to adjust text-to-speech output attributes in real time or batch synthesis, such as voice characters, voice styles, speaking speed, pronunciation, and prosody.

To get started with our Audio Content Creation tool please check out this video.

Integrating your voice into your chosen platform

Your voice can now be accessed through our audio content creation tool or by using the Speech SDK. You can read more about how to integrate your voice using the Speech SDK here.

So that’s it! Now you are ready to get started creating a highly natural branded voice for your apps. For further questions please check out the frequently asked questions below.

Frequently Asked Questions

General

Where can I find the application to apply for access to use Azure Custom Neural Voice?

Where can I find a general overview of Azure Custom Neural Voice?

Where can I find information on getting started with Azure Custom Neural Voice?

Where can I find information on creating and using an Azure Custom Neural Voice?

What are the characteristics and limitations of using an Azure Custom Neural Voice?

Do you have a getting started video on Azure Custom Neural Voice?

Preparing and Recording Data

How do I prepare my training data?

What are some best practices for recording voice samples?

Responsible Use of Custom Neural Voice/Guidelines/Terms