Today at Ignite 2023 conference, Microsoft is taking customization one step further with its new 'Personal Voice' feature. This innovation is specifically designed to enable customers to build apps that allow their users to easily create their own AI voice, resulting in a fully personalized voice experience.
Custom neural voice, an existing capability of Azure AI Speech enables businesses to build a unique and more natural voice experience for their brands. It has powered companies like Vodafone, Swisscom, Progressive and Duolingo to deliver more engaging digital interactions to their customers and create a stronger emotional connection to their brand.
Now with custom neural voice’s newest feature, personal voice, users can get AI replicating their voice in a few seconds by providing a 1-minute speech sample as the audio prompt, and then use it to generate speech in any of the 100 languages supported.
The value of this new feature is immense.
- Extremely small samples: Preparing training samples for creating an AI voice could be difficult or costly. With personal voice, users can create a voice that just sound like them with a voice sample, as short as 60 seconds.
- Express: Personal voice greatly reduced waiting time, allowing users to create a voice in seconds.
- Multilingual and global reach: With an audio prompt in one language, the voice created can be used in 100 languages & variances, reaching a global audience.
This feature opens doors for personalized and intelligent voice experiences.
- Voice assistant: Create a personalized voice assistant experience. Instead of relying on a generic voice, users can now use their own voice for a truly unique experience.
- Gaming: Enable an immersive experience for gamers, allowing them to fully embody their characters with their own unique voices.
- Language dubbing: Expand your global reach and dub your content into 100 languages with the speakers' native voice for a seamless and enjoyable experience for your viewers, no matter where they are located.
- Media & entertainment: Create easy-to-use personal voices for stories, audio books, podcasts, videos, and more, making the content more relatable and immersive than ever before.
- Speech translation: Break down language barriers and improve communication. Allow conversation participants to be heard in various languages with their true voice.
Haier, a leading IoT Smart Living brand in China has been working with Microsoft closely in bringing branded and personalized voice experiences in their smart speakers and more intelligent home appliances and set a great example of how personal voice can be used to differentiate brands in the market with innovative user experience.
“Over the past two years, we have closely collaborated with Microsoft in speech synthesis, language understanding, and other aspects. In 2022, Haier Smart Home, jointly with Microsoft China, released China's first "Home Brain Whitepaper" and launched intelligent speakers and home appliances products with Haier‘s brand voice empowered by Microsoft's Azure AI text to speech and custom neural voice capabilities. Recently, Microsoft text to speech's personal voice feature allows our users to replicate their own voice in real-time with only a minimal amount of data. This allows users to control and use home appliances by listening to their family's voice, making the intelligent speakers more family-like and life-enriching.”
----Wei Qiu Deng, the Vice President of Haier Smart Home, General Manager of Whole House Intelligence
Try the samples
Check out the samples below to hear how the voices created sound, using just a short sample as the audio prompt.
|
Human voice sample |
Output sample |
Female 1 |
(English) |
English Chinese French |
Female 2 |
(English)
|
English Chinese Spanish |
Male 1 |
(English) |
English German Chinese |
Male 2 |
(Chinese) |
Chinese English German |
Try with your own data
In Speech Studio, click on the ‘Personal Voice’ card.
All users can try the prebuilt samples, while the “try your own voice” feature is available upon request. If you do not already have permission to access the demo, click the ‘Request demo access’ button to submit the request.
After gaining permission, you can record your own voice and try the voice output samples in different languages.
Check the video below for a demonstration.
Responsibly integrate personal voice into your apps
With the personal voice feature, it’s required that every voice be created with explicit consent from the user. A recorded statement from the user is required acknowledging that the customer will create and use their voice. Based on the speaker verbal statement file and the audio prompt (a clean human voice sample longer than 60 seconds), personal voice creates a voice profile that encodes the voice characteristics of the user which is then used to generate synthesized audio with the text input provided. The voice created can generate speech in any of the 100 languages supported, no locale tag required in the request, empowered by automatic language detection at the sentence level.
Below is a sample request:
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'> <voice xml:lang='en-US' xml:gender='Male' name='PhoenixV2Neural'> <mstts:ttsembedding speakerProfileId='your speaker profile ID here'> I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun. 我很高兴听到你觉得我很了不起,我让你的旅行计划更轻松、更有趣。Je suis heureux d’apprendre que vous me trouvez incroyable et que j’ai rendu la planification de votre voyage plus facile et plus amusante. </mstts:ttsembedding> </voice> </speak> |
As part of Microsoft's commitment to responsible AI, personal voice is designed with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteracting the proliferation of harmful deepfakes and misleading content. For this reason, personal voice is a Limited Access feature available by registration only, and only for certain use cases. To access the API and use the feature in your business applications, register your use case here and apply for the access.
Eligible customers can integrate the personal voice API with their applications supporting personal voices for the following use cases only:
- In applications where voice output is constrained and defined by customers who meet Limited Access eligibility criteria, and where the voice does not read user-generated or open-ended content. Voice model usage must remain within the application and output must not be publishable or shareable from the application. Some examples of applications that fit this description are voice assistants in smart devices and customizing a character voice in gaming.
- Dubbing for films, TV, video, and audio for entertainment scenarios only, where customers who meet Limited Access eligibility criteria maintain sole control over the creation of, access to, and use of the voice models and their output.
Contact mstts[at]microsoft.com if your use case is not covered in the above categories.
Customer must comply with the Guidelines for responsible deployment of synthetic voice technology and the code of conduct when using the service.
Watermarks are added to voices created with the personal voice feature. Watermarks allow customers/users to identify whether speech is synthesized using Azure AI Speech, and specifically, which voice was used. Eligible customers can use Azure AI Speech watermark detection capabilities. To request to add watermark detection to your applications please contact mstts[at]microsoft.com.
Get started
Starting December 1st, 2023, you can start to try the personal voice feature on Speech Studio and through API with your own data. This feature is currently only available in West Europe, East US, and South East Asia. Create a Speech or Cognitive Services standard resource in any of these three regions in order to access the feature.
In addition to creating a personal voice, you can create a brand voice for your business with Custom Neural Voice’s professional voice feature. Azure AI Speech also offers over 400 neural voices covering more than 140 languages and locales. With these pre-built Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users.