Learnings from a Custom Neural Voice Proof of Concept

Microsoft

Aug 22, 2022

Custom Neural Voice (CNV), a Speech capability of Cognitive Services on Azure, allows for the creation of a highly realistic humanlike voice that can convert text input into speech. This can be used to personalize a customer experience or enhance a brand image with a custom persona all while enabling localization and accessibility through multiple languages. It’s an incredible technical achievement and one that is just beginning to unravel its potential use cases.

This blog seeks to document some of the learnings that a Microsoft Technical Account team had in working through a proof of concept (PoC) with a large media content provider on CNV. This is not a technical deep dive on custom neural voice models and their discrete and growing capabilities. This blog is focused on sharing practitioner learnings in working through the process of creating a neural voice to enable successful proof of concepts.

As additional context, the customer in question was intrigued by the idea of being able to use a highly realistic human sounding voice for a variety of use cases. One in particular was to mimic a human voice for sponsorship announcements during various events. Typically, voice actors would be recruited to read out sponsorships well in advance of the event being sponsored. This would include drafting up specific transcripts, recruiting a voice actor to record the transcript in a professional studio, and finally playing it for the intended event. Operationally, this could take weeks to orchestrate! The synthetic voice of a voice actor, however, can significantly shorten the production cycle and enable a faster turnaround. This greatly appealed to the customer and on conclusion of the PoC, clearly demonstrated how a neural voice could be a significant cost and efficiency initiative while preserving brand identity.

Embrace your Custom Neural Voice (Pro)

CNV offers a Custom Neural Voice Lite feature which enables anyone to quickly test the capabilities with just 20-50 training samples in under 20 minutes of training time! It’s a fantastic offering to get comfortable with the process and the nuances of what’s required. But as customers look to get the most out of CNV, you’ll quickly want to create and deploy your own professional voice. Why? In short, there is a marked difference in the quality of the voice from the Lite to the Pro version because of the longer training time (20 minutes vs. nearly a day) and amount of training samples required for a Pro voice. More importantly, creating your own Pro Voice is a huge return from a learning standpoint and builds conviction in the technology. Deployment is also a more realistic next step with the Pro version and allows one to take advantage of additional tooling and customization options.

Here are a few steps and key learnings:

Understand the approval process for your professional voice and the requirements of Responsible AI. The key first step in this process is to ensure you get approved for the intended use case by completing this form. For a proof of concept, this can seem burdensome. But it is important to recognize the power of the capability that is being unlocked. Imagine a proof of concept to create a neural voice of a prominent personality. Imagine then if this neural voice is misused to make false or misleading statements. You can begin to understand how important it is to define a use case and ensure a thoughtful and transparent process is in place, both on the Microsoft as well as the customer side. It is critical we be responsible with this technology, so start with yourself and ensure you are clear with why you are applying for this capability. It is key to also review the Limited access features for Cognitive Services at this link.
Lock down your persona, select your transcripts and schedule time.
- Personas. I was thrown off initially by the ask for a “persona definition”. However, later I realized how important this is for priming the kind of voice you are training. Personas are subjective descriptions and are relative classifications between any two voices. However, they are key to later interpreting the voice and evaluating it. While I may opt for a “generic conversational voice”, even that has subtleties that I may not fully capture or be aware of when I am in a conversation as opposed to when I’m reading a script and recording a voice.
- Transcripts. Microsoft offers a range of general, chat and customer service transcripts in multiple languages as a starting point. However, if you’re going for a specific persona specific to a certain scenario, it is ideal to model transcripts that are specific to that domain. This provides better training data to achieve a more realistic model specific to a particular outcome. Do note that while there are guidelines in the product documentation around the number of training scripts (at a minimum +300 up to +2k) you will want to ensure that you record a little over whatever maximum you’re going for since during upload, the system may filter out poor quality inputs, and reduce the overall total.
- Schedule time. I found that for a +300-script workload, I had to spend 2-3+ hours. This not only included recording time (most recordings were less than 10 seconds) but also some of the operational tasks around labeling, file movement and writing scripts to automate some of the formatting requirements before uploading into Speech Studio.
While challenging, try to stay in persona. This is a subtle point, but because of it, I have a whole new respect for voice actors! While it’s easy to maintain a persona in the first 20-30% of recordings, it’s a lot harder to preserve this for a sustained period. I found myself speaking a lot less naturally than I normally do mostly because of "recording fatigue" and wanting to get through the recordings faster. This of course, deteriorated the sound of the persona I was striving for. Why does that matter? You get what you train for. Sub-par inputs will result in a sub-par model. Unless you are intentional about making it sound like the persona, this can be a gap in the final evaluation. If using your own voice, don’t sweat too much but be mindful this can impact final results. This is also where the many hours that voice actors spend in studios, with coaches and support, is specifically to ensure higher quality recording inputs.
Leverage scripts and helper functions to automate preparatory tasks. While the product team has built a fantastic low-code experience to load, train and deploy models with the Custom Neural Voice Portal, there are still a number of preparatory tasks to get through in order to prep data for loading and training. This will vary depending upon one’s circumstance. Since I do not have a professional studio with a professional suite of tools, I found crafting several Python scripts helped the workflow considerably. These included scripts to rename files, convert from MP4 to WAV, zip files and create the consolidated transcript to pair the audio files with the right transcripts. A repository with some of these sample scripts is available here.
Leverage post-processing Studio feedback to improve training samples. With the heavy lifting of prepping files out of the way, the Studio experience was very intuitive to work through. Aside from no-brainer prompts, one major call-out of the Studio experience was the analytics overview presented after a successful post-processing data upload (see snapshot below). This provided feedback on pronunciation, signal to noise ratio (SNR) and analytics on the duration of the scripts. It also provided some great warnings and indications of how to improve the diversity of the scripts for a better model. For example, a common issue was that in my scripts, I did not have enough “exclamation utterances”. While this will not hinder the process, these are delicate warnings that can improve the final model. Also note below that of 317 utterances imported, 313 only qualified and were loaded for a total of around 22 minutes of audio. Hence, depending on your recording software and training approach, you may have to record more than your targeted number of utterances, so do factor this in.
Defaults and deployment. Once the model is processed, there are about 100 default statements that can be used to hear the voice and judge the authenticity. These offer a range of pre-selected statements that are different to the training samples. This is also when you have your ‘eureka’ moment at how amazing the model is, but also start to build more intuition for how you would improve it. From here, one can deploy the model in the CNV portal, which opens up a REST endpoint that can be programmatically accessed through your programming language of choice. One can also customize audio outputs through the Audio Content Creation tool which is a low-code environment to feed text, save or export outputs, while tweaking intonation, pronunciation etc.

As an example of the created Pro voice, listen to the sample below summarize the above steps!

Help the customer embrace a Custom Neural Voice

With the workflow of a CNV Pro voice under your belt (and the first one is always the hardest!), it becomes easier to walk a customer through the process with conviction. For this customer, additional variables included getting the voice actor in a professional studio to perform the recordings, figuring out how much technical and coaching support they would need and deciding on how much to support from a data preparatory standpoint on both sides. In an ideal case, the customer can self-manage the transcript generation, recording process and related overhead of scheduling, formatting etc. with minimal oversight. However, given this is a relatively newer use case/process, more handholding was required.

Below are a few additional learnings:

As mentioned before, ensure the customer understands the approval process for working with a professional voice actor. Again, this is a critical step to work through and reaffirms Responsible AI practices to ensure use cases align with proper guidelines. What is equally important is that with a professional voice actor, it is critical to make the customer aware of the voice talent disclosure which covers learnings and approaches to brokering this use case.
Using a professional voice actor is HIGHLY recommended. Voice actors are specifically trained to leverage their voice to project, pronounce and intonate for specific personas and use cases. Their work can involve hours of recording time just to get a script and character perfect. For CNV training, this is key to train a specific persona. During the customer engagement, it was tempting to just have one of the project team members use their voice for the recordings as a way of bootstrapping the process faster. However, while we ended up leveraging a professional voice actor in the end, in hindsight we recognized how important this step was. Most individuals, like me, are not trained professionally and will likely experience a steady degradation in the quality of the recordings as fatigue sets in, and the desire to finish faster heightens. This unfortunately can result in lower quality training samples where the resulting voice may not sound as natural as expected since in normal conversation, one tends to have more energy, infrequent pauses, specific emphases, etc. than what can be captured reading scripts back-to-back. Voice actors recognize this and treat each script with focus and energy. Moreover, with more voice actors working remotely and having professional studio setups at home, this allowed for an unexpected benefit to break up the recording over multiple sessions as opposed to one continuous recording session.
Getting coaching support for the initial set of recordings is key. In the initial recording session, the product team arranged to have an expert sound engineer provide feedback and coaching to the voice actor. In traditional recording sessions, similar feedback mechanisms exist with multiple individuals/coaches providing feedback to a voice actor in real-time. This provided immediate feedback to our voice actor on his persona, the script flow, and some of the operational pieces allowing him to get into a groove faster. One operational request we insisted upon was ensuring that the voice actor recorded discrete audio samples per script or utterance. As per the product documentation, an utterance should roughly equate to one sentence. This was important since recordings often get merged into a single audio file on completion and this needs to be separated for the CNV model training stage.
Train as specific to the use case as possible. Though deep learning advancements continue to evolve at a rapid pace, it is still sensible to heed the guidance to train specifically for the intended use case. While there is a level of generalizability to create a voice that can work for multiple use cases, CNV models still require good data to train from and cannot be expected to say or sound like “anything” off a general training. This can often be a customer expectation as they weigh the time and expense of recording the initial training run. However, for best results, while keeping a diversity of inputs, do train as specifically as possible to embody the characteristics of how the voice will be used in its natural environment. For the customer, since we had a specific use case with an identified persona, but were less stringent on the specific statements, we combined a mix of customer scripts with our general and chat transcripts available at the Microsoft repository. (In this context, it is worth noting that the product team is actively working on a feature which allows a CNV model to speak in multiple styles and express different emotions without any additional training data. To some extent, this will enhance the variety and utility of the created voice, without needing to train specifically for those emotions or styles.)
Data preparation tips pre-model loading. Once high-quality recordings were done (note chart below for the analytics on the upload), the next step was to ensure the recordings paired with the right transcript and were appropriately normalized. A couple of salient points include:
- Ensure the highest sample rate for the recordings. In this case, we had a choice between 16-bit and 24-bit recordings. We opted for the latter given 16KHz is not recommended for CNV. The higher the better.
- Pair the recordings to the right transcript. In a formal recording scenario, the voice actor would focus on voicing the scripts, and others would manage the actual recording and background tasks to pair the right transcript with the right audio. With this customer, we asked the voice actor to continue to focus on just recording in his professional home studio and to offload the background tasks to us. This resulted in getting a final batch of audio files that were labelled differently, hard to pair with their original transcript and with no guarantee that the original statement was exactly recorded. Moreover, files included duplicate files, blanks, re-recordings with a few words augmented, etc. For example, though we provided a +600-transcript list, we ended up with +750 recordings returned back. To filter through this, several key steps helped:
  - Matching the audio file to the right transcription. Since we could not start from the original transcript order, and the audio files were labeled differently, we ran the audio files through the batch transcription service of Speech Service. For +750 recordings, this came back under 10 minutes with several textual forms represented. The closest to normalized guidance is the lexical format, where words are clearly spelt out for numbers, special characters, etc. For more detail on the code to trigger batch transcriptions, see here.
  - Use a scripting language to automate data cleaning and formatting. Two operational aspects required some level of automation:
    - Data Cleansing. This included removing empty audio files (for example, those with no lexical or display values from the above batch transcription), weeding out duplicate files/transcripts and identifying multi-sentence scripts based on counts of periods or question marks. For more details, refer to this code here.
    - Normalization. This was partially solved through some level of scripting. Many cases still required manual intervention to normalize the text as per the documentation guidelines. For example, normalizing “BTW” to “by the way”, acronyms like “ABC” to “A B C”, writing out numbers (“1” to “one”), special characters (“%” to “percent”), etc. This is also where choosing the lexical transcription provides an advantage.
Cross lingual models aid personalization to new markets. Cross lingual models (which are still in preview) allow you to trigger custom neural models in other languages from the primary language (in this case, English). The customer loved this feature since this offered being able to retain the voice persona but have it speak multiple languages. This required less training time than the primary English model and helped the customer imagine how a consistent brand could extend to non-English markets. Of novelty was that for a cross-lingual model, one can feed both English and foreign language inputs, and it would recognize both! In fact, if speaking English, it would do it with a foreign accent! Amazing!

Here's another recap with a cross-lingual twist!:

Moment of truth… evaluating the synthetic voice

It is somewhat ironic that for all the technical rigor of deep learning that goes into creating synthetic voices, the decision around whether a synthetic voice sounds “good enough” is still a "human call". Ultimately, the customer must have conviction that the created voice sounds legitimate for its intended purpose. However, in the process, several factors helped guide the evaluation:

Number of training samples. In the customer scenario, we leveraged +650 discrete recording samples with a variety of sentence structures to train the voice. This resulted in a high-quality voice though given the persona – which was geared more to an “announcer style voice” – it did sound slightly unnatural for long-form sentences . As per the product documentation, training can go up to 2,000 discrete samples for a professional voice in a professional recording environment. More data is always likely to improve the voice, though up to a certain point.
Training for the specific use case. Specificity matters when creating a synthetic voice. In many instances, customers may just want to experience the technology and/or need a more generalized voice to support a wide-ranging number of scenarios . In the latter case, a pre-built Azure voice may be ideal since it just needs to sound humanlike without requiring any specific persona. However, the more the voice is trained for the intended use case, the higher the likelihood of a better result. Its complement is also true – we cannot evaluate a voice to do more generic things when trained for a specific purpose or use case.
Maintaining the persona is key through the recording process. As mentioned, feedback, coaching and ensuring consistency in recording matters. In this scenario, by breaking up the recording session over multiple days, and having feedback early in the process, it reduced recording fatigue and created higher quality inputs. This parallels the many hours teams spend in recording studios to get a script sounding “just right”. We cannot under-call the importance of this step, though creating a synthetic voice right can likely reduce the repetitiveness of this process.
What is realistic is relative to its environment. Consider the environment in which the voice will be used. If it is accompanied by a lot of noise, extra music, or a situation where without priming that “you are listening to a synthetic voice”, a normal person would find it hard to distinguish the synthetic voice, this also could be key in determining whether a voice sounds authentic. Given the stage of the PoC, we did not have the opportunity to test this in a real environment, but this also should be a factor in evaluation.
A mock “Turing test” could quantitatively guide evaluation. One of the internal exercises the team performed was to get three participants in two consecutive rounds to grade 10 samples of recordings and label them human or synthetic. We called this a mock “Turing test”. This produced interesting results. As the results demonstrate in the table below, the ability to distinguish a human vs. a neural voice is not always consistent. Certain phrases, how certain words are articulated, the pauses in between words, background noise, etc. all add variance and don’t necessarily lend themselves to a consistent evaluation. The call-out here is when you have multiple customer participants, this can be an exercise to crowdsource opinions, and remove biases of any individual in the evaluation process. The results below were also conducted on the CNV Pro Voice created by the Microsoft account team (not the customer CNV voice). This was trained with a bare minimum of ~300 discrete inputs, in a sub-optimal recording environment (see SNR in first chart), with a non-professional voice actor (myself!) who started getting recording fatigue after just 50 samples! However, despite those factors, the resulting neural voice was believable enough to produce the variability seen in the participant results below.

To interpret the results below, there are three categories – “accurate”, “human, but actually neural” and “neural, but actually human”. Based on each recorded sample, each participant judged whether the audio recording was of a human or a neural voice. If guessed correctly, their guess would be marked "accurate". If not, they would fall into one of the two remaining categories. The total score for each participant in each round should equal 100%.

	First round	Second round
Participant 1	Accuracy = 100% Human, but actually neural = 0% Neural, but actually human = 0%	Accuracy = 50% Human, but actually neural = 10% Neural, but actually human = 40%
Participant 2	Accuracy = 60% Human, but actually neural = 40% Neural, but actually human = 0%	Accuracy = 60% Human, but actually neural = 0% Neural, but actually human = 40%
Participant 3	Accuracy = 70% Human, but actually neural = 30% Neural, but actually human = 0%	Accuracy = 80% Human, but actually neural = 10% Neural, but actually human = 10%