Custom Neural Voice, a feature of Azure Cognitive Services for Speech, is a great way to create a highly natural synthetic voice that sounds almost identical to your voice actor. This synthetic voice can then be used in a variety of different scenarios including audiobooks, language learning, reading news content and many more. Since its launch, Custom Neural Voice has empowered organizations such as AT&T, Duolingo, Progressive, and Swisscom to develop branded speech solutions that delight users. (For more details, read the Innovation Stories blog).
Microsoft has made it simple to train a professional custom neural voice by using a small set of recordings from the target voice (from 300 to 2,000 sentences or short phrases, which is about 30 minutes to 3 hours of speech data). However, the studio recording process takes time and many customers are looking for a way to try the voice customization capability more easily.
Today we are glad to introduce Custom Neural Voice Lite, a new feature in public preview, which enables users to clone their voice by recording just 5 minutes of speech data. This new feature makes it extremely easy for customers to create a synthetic voice that sounds natural.
Custom Neural Voice Lite
Custom Neural Voice (CNV) now supports two project types, Pro and Lite. The Pro version is best for professional scenarios like brand and character voices for chat bots, or audio content reading. In this blog, we provide detailed instructions on how to create a professional custom neural voice. The new Lite version is best for producing quick demos or creating personal voice clones.
Due to the sensitivity of the technology, we have limited the access and use of Custom Neural Voice. However, every customer with a valid Azure Speech resource can create CNV Lite voices by recording their own voice for evaluation purposes. After creating a CNV Lite voice, the customer must submit an application with their use case to gain full access to the Custom Neural Voice capability before they can use the voice for business scenarios.
The following table summarizes the key differences between the CNV Pro and CNV Lite project types.
Items |
Lite (Preview) |
Pro |
Target scenarios |
Demonstration or evaluation |
Professional scenarios like brand and character voices for chat bots, or audio content reading |
Training data |
Record online from your own computer using Speech Studio |
Bring your own data. Recording in a professional studio is recommended. |
Scripts for recording |
Provided in Speech Studio |
Use your own scripts that match the use case scenario. Microsoft provides example scripts for reference. |
Required data size |
20-50 utterances |
300-2,000 utterances |
Training time |
Less than 1 compute hour |
Approximately 20-40 compute hours |
Voice quality |
Moderate quality |
High quality |
Availability |
Anyone can record samples online and train a model for demo and evaluation purpose. Full access to Custom Neural Voice is required if you want to deploy the CNV Lite model for business use. |
Data upload is not restricted, but you can only train and deploy a CNV Pro model after access is approved. CNV Pro access is limited based on eligibility and usage criteria. Request access on the intake form. |
Pricing |
Per unit prices apply equally for both the CNV Lite and CNV Pro projects. Check the pricing details here. |
Per unit prices apply equally for both the CNV Lite and CNV Pro projects. Check the pricing details here. |
To get an idea of how a Lite voice sounds like, check the samples below.
Language |
Human recording |
TTS (CNV Lite) |
English |
||
Chinese |
How it works
A Speech service resource is required before you can create a Custom Neural Voice project. If you do not have a Speech resource in Azure, follow these instructions to create one. Make sure you select one of these regions for your resource – East US, Southeast Asia, or UK South where Custom Neural Voice training is supported. Select S0 for the pricing tier. Free tiers are not available for Custom Neural Voice.
To build your CNV Lite voice, go to Speech Studio. Log in with the right Speech resource selected. Then click on the ‘Custom Voice’ tile and select to create a Custom Neural Voice Lite project. CNV Lite now supports English and Chinese (Mandarin).
Once the project is successfully created, you can start to build your voice. Before you move forward, make sure you read and understand the Voice Talent Terms of Use, and provide your agreement for Microsoft Speech Studio to collect your voice data (at this step, for evaluation purpose). To protect each user’s voice identity, the Lite project will be removed within 90 days if your company does not have its business use case approved by Microsoft (check the limited access policy), or the voice talent whose data is used for training does not provide explicit agreement for using his/her voice to generate synthetic speech outside of the evaluation purpose (check the voice talent disclosure requirement).
Once you have accepted the terms of use, you can start to record your voice samples. Read the recording instructions carefully. The quality of your recording data is critical to the training output. Check your environmental noise and do not record if noises are detected.
Tips for recording:
- Increase the clarity of your samples by using a high-quality microphone. Speak about 8 inches away from the microphone to avoid mouth noises.
- Relax and speak naturally. Allow yourself to express emotions as you read the sentences.
- To keep a consistent energy level, record all sentences in one session.
- Pronounce each word correctly and speak clearly. After recording each sample, check its quality metric before continuing to the next one.
- Although you can create a model with just 20 samples, it's recommended that you record up to 50 to get better quality.
After each sample is recorded, double check the audio quality before you click to record the next. Several metrics are provided to help you review the quality, enabled with the pronunciation assessment technology.
As shown in the screenshot below, mispronunciations are automatically detected on each audio. It’s recommended that you make sure your recorded audio is green with accepted quality.
- “Clearness” indicates the speech signal against the noise. You get a higher clearness score if the noise level is lower.
- “Pronunciation” shows the accuracy of your pronunciation at the sentence level. You should make sure you pronounce each word correctly with no omission or insertion.
- “Volume” of your voice in the recording should be kept stable. Don't speak too far or too close to your mic. An audio that’s too loud or too low volume is not acceptable.
After you have recorded at least 20 samples, and checked the quality is all good, you can click the ‘Train model’ button at the bottom of the page to start your voice training. It’s estimated that each training takes about 40 minutes. Check the pricing page to get an idea of the cost before you hit ‘Create’.
Once the model is successfully created, you can listen to the sample output for demo and evaluation purpose.
To deploy your voice model and use it in your applications, you must get full access to Custom Neural Voice and explicit consent from your voice talent. You can submit a request form here. For guidance on applying for Custom Neural Voice, you can watch this short video. With the full access approved, you can get your CNV Lite voice integrated with your apps, or move to create a CNV Pro project with professional studio recordings for an even more natural voice. Check this blog for the instructions to create a high-quality professional voice.
Learn more
We are excited about the future of Neural TTS with human like, diverse and delightful quality under the high-level architecture of XYZ-Code AI framework. Our technology advancements are also guided by Microsoft’s Responsible AI process, and our principles of fairness, inclusiveness, reliability & safety, transparency, privacy & security, and accountability. We put these ethical standards into practice through the Office of Responsible AI (ORA), which sets our rules and governance processes, the AI, Ethics, and Effects in Engineering and Research (Aether) Committee, which advises our leadership on the challenges and opportunities presented by AI innovations, and Responsible AI Strategy in Engineering (RAISE), a team that enables the implementation of Microsoft responsible AI rules across engineering groups.
Besides the Custom Neural Voice capability, you can also select a prebuilt voice from a rich portfolio that offers over 330 neural voice options across 129 languages and variants.
Get started with Azure Neural TTS:
- Try the demo
- See our documentation
- Check out our sample code
Learn more about Responsible Use of Custom Neural Voice/Guidelines/Terms
Where can I find the transparency note and use cases for Azure Custom Neural Voice?
Where can I find Microsoft’s general design guidelines for using synthetic voice technology?
Where can I find information about disclosure for voice talent?
Where can I find disclosure information on design guidelines?
Where can I find disclosure information on design patterns?
Where can I find Microsoft’s code of conduct for text-to-speech integrations?
Where can I find information on data, privacy and security for Azure Custom Neural Voice?
Where can I find information on limited access to Azure Custom Neural Voice?
Where can I find licensing resources on Azure Custom Neural Voice?