This post was co-authored by Sheng Zhao, Anny Dow, Edward Un, Yueying Liu, Garfield He and Yang Zheng.
(voiced by Neural TTS)
Neural Text to Speech (Neural TTS) converts text to lifelike speech for more natural interfaces. With natural-sounding speech that matches the stress patterns and intonation of human voices, neural TTS significantly reduces listening fatigue when users are interacting with AI systems, enabling scenarios from audiobooks to voice assistants.
We’re excited to share that we are expanding our available neural TTS voices with Francisca, our new Brazilian Portuguese (pt-BR) voice. Francisca features the same human-like natural prosody of the other neural TTS voices on Azure — Guy (American English Male), Jessa (American English Female), Katja (German Female), Elsa (Italian Female), and Xiaoxiao (Mandarin Chinese Female).
With a powerful base model created using a large volume of speech samples, we were able to build Francisca’s voice from much less training data than it would require otherwise. The neural TTS base model learns different speaking styles from multiple speakers, and through transfer learning, can easily adapt its style to a target speaker. Like other neural voices, Francisca can generate realistic speech waveforms for a given text input, matching the patterns of stress and intonation transitions in spoken language seamlessly.
Besides the capability to synthesize speech, developers can also tailor the voice for different scenarios with different voice styles using the neural TTS. For example, the new pt-BR voice can also speak with a “cheerful” tone. The “cheerful” style can be used to express an emotion that is positive and happy. This is particularly useful in chat bot scenarios. You can adjust the speaking styles easily with the <mstts:express-as> element in SSML.
We conducted MOS (Mean Opinion Score) studies to evaluate the naturalness of Francisca. In a crowd-sourcing test with more than 60 native speakers, we examined 30 audios produced by Francisca in the neutral style and another 30 in the cheerful style. Overall impressions were rated on a 1-5 Likert scale, with naturalness in rhythm variations, pitch variations, stresses, pauses, and intelligibility considered. Human speech and a pt-BR voice from another cloud service provider (company X) were used as benchmarks. Results showed very positive feedback on Francisca in both the neutral (4.44) and cheerful (4.38) styles.
Hear what Francisca sounds like.
Example 1: Francisca (neutral)
Example 2: Francisca (cheerful)
Like other neural voices, Francisca is created using 24khz sampling rate. You can maximize the fidelity of neural voice outputs with 24khz related formats:
For scenarios where lower sampling rate is required, for example playing back for phone calls, Francisca and other neural voices can also be easily sampled down with a lower bit rate. Learn more about the output format supported.
Neural TTS is powering Microsoft services at scale. The Francisca voice is now supported in the new Microsoft Edge, enabling you to anytime, anywhere with natural voices.
Edge Read Aloud also makes it easy to follow along with text, supporting the output of word boundaries so each word being read out is simultaneously highlighted in the UI. This is an essential feature for immersive reading scenarios. To build your own Read Aloud apps, check out SynthesisWordBoundaryEventAsync function in our sample codes.
The same transfer learning technology is now shipped in the the Custom Neural Voice capability, enabling organizations to create their one-of-a-kind digital voices with 5X less data while still delivering high-fidelity audio outputs.
With Brazilian Portuguese (pt-BR) added to the family, seven locales are now supported in the custom neural voice online training portal - American English (en-US), British English (en-UK), Indian English (en-IN), German, French, Chinese (zh-CN) and Brazilian Portuguese (pt-BR). More locales are available through customer engagement. Submit a request to create your custom voice using the neural TTS technology.
With these updates, we’re excited to be powering natural and intuitive voice experiences. Text to Speech has more than 75 standard voices in over 45 languages and locales in addition to our growing list of neural voices. Learn more about how you can get started.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.