Try out Custom Neural Voice in 5 minutes with a Lite project

Microsoft

Mar 29, 2022

Custom Neural Voice, a feature of Azure Cognitive Services for Speech, is a great way to create a highly natural synthetic voice that sounds almost identical to your voice actor. This synthetic voice can then be used in a variety of different scenarios including audiobooks, language learning, reading news content and many more. Since its launch, Custom Neural Voice has empowered organizations such as AT&T, Duolingo, Progressive, and Swisscom to develop branded speech solutions that delight users. (For more details, read the Innovation Stories blog).

Microsoft has made it simple to train a professional custom neural voice by using a small set of recordings from the target voice (from 300 to 2,000 sentences or short phrases, which is about 30 minutes to 3 hours of speech data). However, the studio recording process takes time and many customers are looking for a way to try the voice customization capability more easily.

Today we are glad to introduce Custom Neural Voice Lite, a new feature in public preview, which enables users to clone their voice by recording just 5 minutes of speech data. This new feature makes it extremely easy for customers to create a synthetic voice that sounds natural.

Custom Neural Voice Lite

Custom Neural Voice (CNV) now supports two project types, Pro and Lite. The Pro version is best for professional scenarios like brand and character voices for chat bots, or audio content reading. In this blog, we provide detailed instructions on how to create a professional custom neural voice. The new Lite version is best for producing quick demos or creating personal voice clones.

Due to the sensitivity of the technology, we have limited the access and use of Custom Neural Voice. However, every customer with a valid Azure Speech resource can create CNV Lite voices by recording their own voice for evaluation purposes. After creating a CNV Lite voice, the customer must submit an application with their use case to gain full access to the Custom Neural Voice capability before they can use the voice for business scenarios.

The following table summarizes the key differences between the CNV Pro and CNV Lite project types.

*Items*	*Lite (Preview)*	*Pro*
Target scenarios	Demonstration or evaluation	Professional scenarios like brand and character voices for chat bots, or audio content reading
Training data	Record online from your own computer using Speech Studio	Bring your own data. Recording in a professional studio is recommended.
Scripts for recording	Provided in Speech Studio	Use your own scripts that match the use case scenario. Microsoft provides example scripts for reference.
Required data size	20-50 utterances	300-2,000 utterances
Training time	Less than 1 compute hour	Approximately 20-40 compute hours
Voice quality	Moderate quality	High quality
Availability	Anyone can record samples online and train a model for demo and evaluation purpose. Full access to Custom Neural Voice is required if you want to deploy the CNV Lite model for business use.	Data upload is not restricted, but you can only train and deploy a CNV Pro model after access is approved. CNV Pro access is limited based on eligibility and usage criteria. Request access on the intake form.
Pricing	Per unit prices apply equally for both the CNV Lite and CNV Pro projects. Check the pricing details here.	Per unit prices apply equally for both the CNV Lite and CNV Pro projects. Check the pricing details here.

To get an idea of how a Lite voice sounds like, check the samples below.

*Language*	*Human recording*	*TTS (CNV Lite)*
English
Chinese

How it works

A Speech service resource is required before you can create a Custom Neural Voice project. If you do not have a Speech resource in Azure, follow these instructions to create one. Make sure you select one of these regions for your resource – East US, Southeast Asia, or UK South where Custom Neural Voice training is supported. Select S0 for the pricing tier. Free tiers are not available for Custom Neural Voice.

Creating a Speech resource

To build your CNV Lite voice, go to Speech Studio. Log in with the right Speech resource selected. Then click on the ‘Custom Voice’ tile and select to create a Custom Neural Voice Lite project. CNV Lite now supports English and Chinese (Mandarin).

Creating a CNV Lite project

Once the project is successfully created, you can start to build your voice. Before you move forward, make sure you read and understand the Voice Talent Terms of Use, and provide your agreement for Microsoft Speech Studio to collect your voice data (at this step, for evaluation purpose). To protect each user’s voice identity, the Lite project will be removed within 90 days if your company does not have its business use case approved by Microsoft (check the limited access policy), or the voice talent whose data is used for training does not provide explicit agreement for using his/her voice to generate synthetic speech outside of the evaluation purpose (check the voice talent disclosure requirement).

Once you have accepted the terms of use, you can start to record your voice samples. Read the recording instructions carefully. The quality of your recording data is critical to the training output. Check your environmental noise and do not record if noises are detected.

Noise check before voice recording

Tips for recording:

Increase the clarity of your samples by using a high-quality microphone. Speak about 8 inches away from the microphone to avoid mouth noises.
Relax and speak naturally. Allow yourself to express emotions as you read the sentences.
To keep a consistent energy level, record all sentences in one session.
Pronounce each word correctly and speak clearly. After recording each sample, check its quality metric before continuing to the next one.
Although you can create a model with just 20 samples, it's recommended that you record up to 50 to get better quality.

After each sample is recorded, double check the audio quality before you click to record the next. Several metrics are provided to help you review the quality, enabled with the pronunciation assessment technology.

As shown in the screenshot below, mispronunciations are automatically detected on each audio. It’s recommended that you make sure your recorded audio is green with accepted quality.

“Clearness” indicates the speech signal against the noise. You get a higher clearness score if the noise level is lower.
“Pronunciation” shows the accuracy of your pronunciation at the sentence level. You should make sure you pronounce each word correctly with no omission or insertion.
“Volume” of your voice in the recording should be kept stable. Don't speak too far or too close to your mic. An audio that’s too loud or too low volume is not acceptable.

Recording voice samples with quality check

After you have recorded at least 20 samples, and checked the quality is all good, you can click the ‘Train model’ button at the bottom of the page to start your voice training. It’s estimated that each training takes about 40 minutes. Check the pricing page to get an idea of the cost before you hit ‘Create’.

Once the model is successfully created, you can listen to the sample output for demo and evaluation purpose.

Sample output of a CNV Lite voice model

To deploy your voice model and use it in your applications, you must get full access to Custom Neural Voice and explicit consent from your voice talent. You can submit a request form here. For guidance on applying for Custom Neural Voice, you can watch this short video. With the full access approved, you can get your CNV Lite voice integrated with your apps, or move to create a CNV Pro project with professional studio recordings for an even more natural voice. Check this blog for the instructions to create a high-quality professional voice.

Learn more

We are excited about the future of Neural TTS with human like, diverse and delightful quality under the high-level architecture of XYZ-Code AI framework. Our technology advancements are also guided by Microsoft’s Responsible AI process, and our principles of fairness, inclusiveness, reliability & safety, transparency, privacy & security, and accountability. We put these ethical standards into practice through the Office of Responsible AI (ORA), which sets our rules and governance processes, the AI, Ethics, and Effects in Engineering and Research (Aether) Committee, which advises our leadership on the challenges and opportunities presented by AI innovations, and Responsible AI Strategy in Engineering (RAISE), a team that enables the implementation of Microsoft responsible AI rules across engineering groups.

Besides the Custom Neural Voice capability, you can also select a prebuilt voice from a rich portfolio that offers over 330 neural voice options across 129 languages and variants.

Get started with Azure Neural TTS: