Now more than ever, developers are expected to design and build apps that interact naturally with end-users. At Build, we’ve made several improvements to the Speech service that will make it easier to build rich, voice-enabled experiences that address a variety of needs with speech-to-text and text-to-speech.
Improving the Speech Studio experience
Speech Studio is a UI-based experience that allows developers to explore the Speech service with no-code tools and enables them to customize various aspects of the Speech service in a guided experience.
Improvements to Speech Studio include:
- Modern UX with the latest unified Azure design template.
- Convenient no-code tools for quickly onboarding to the Speech service. Try out Real-time Speech-to-text to transcribe your audio into text, the Voice Gallery to explore our natural sounding Text-to-speech voices and Pronunciation Assessment to evaluate a user’s fluency and pronunciation.
- Responsive layouts, faster page loading, and improved login experience with latest Azure-AD authentication.
Text-to-speech adds more languages, continuous voice quality improvement and more
At the Microsoft Build conference, Microsoft announced the extension of Neural TTS to support 10 more languages and 32 new voices. With this update, Azure neural TTS now provides developers a rich choice of more than 250 voices available across 70+ languages and variances in 21+ Azure regions. The 10 newly released languages (locales) are: English (Hongkong), English (New Zealand), English (Singapore), English (South Africa), Spanish (Argentina), Spanish (Colombia), Spanish (US), Gujarati (India), Marathi (India) and Swahili (Kenya).
In addition, 11 new voices are added to the US English portfolio, enabling developers to create even more appealing read-aloud and conversational experiences in different voices. These new voices are distributed across different age groups, including a kid voice, and with different voice timbre, to meet customers’ requirements on voice variety. Together with Aria, Jenny, and Guy, we now offer 14 neural TTS voices in US English).
Besides that, we also improve the question tones for the following voices: Mia and Ryan in English (United Kingdom), Denise and Henri in French (France), Isabella in Italian (Italy), Conrad in German (Germany), Alvaro in Spanish (Spain), Dalia and Jorge in Spanish (Mexico).
See more details in the TTS blog.
Cross-lingual adaptation enables the same voice to speak in multiple languages
To support the growing need for a single persona to speak multiple languages in scenarios such as localization and translation, a neural voice that speaks multiple languages is brought out in public preview. This new Jenny multilingual voice, with US English as the primary/default language, can speak another 13 secondary languages, each fluently: German (Germany), English (Australia), English (Canada), English (Canada), Spanish (Spain), Spanish (Mexico), French (Canada), French (France), Italian (Italy), Japanese (Japan), Korean (Korea), Portuguese (Brazil), Chinese (Mandarin, Simplified).
With this new voice, developers can easily enable their applications to speak multiple languages, without changing the persona. Learn how to use the multi-lingual capability of the voice with SSML.
What’s more, we have also brought this powerful feature to Custom Neural Voice, allowing customers to build a natural-sounding one-of-a-kind voice that speaks different languages. Custom Neural Voice has enabled a number of global companies such as BBC, Swisscom, AT&T and Duolingo to build realistic voices that resonate with their brands.
This cross-lingual adaptation feature (preview) brings new opportunities to light up more compelling scenarios. For example, developers can enable an English virtual assistant’s voice to speak German fluently so the bot can read movie titles in German; or, create a game with the same non-player characters speaking different languages to users from different geographies.
Speech-to-text adds new languages, continuous language detection and more
Speech-to-text capability now supports 95 languages and variants. At Build, we announced 9 new languages (locales): English (Ghana), English (Kenya), English (Tanzania), Filipino (Philippines), French (Switzerland), German (Austria), Indonesian (Indonesia), Malay (Malaysia), Vietnamese (Vietnam).
This enables developers to provide solutions to a global audience of users. A great example is Twitter, who are using Speech service to generate captions for live audio conversations on Twitter Spaces, making its platform more accessible to all its users.
Continuous language detection
Speech transcription is incredibly accurate and useful for scenarios like call center transcription and live audio captioning. However, in working with some of our customers, we noticed that in some cases, they have multilingual employees and customers who might switch between different languages, often mid-sentence. In our increasingly globalized world, the ability to support multilingual scenarios becomes more essential by the day- whether in conferences, on social media, or in call center transcripts.
With continuous language detection, which is in Preview, Speech-to-text can now support recognition in multiple languages. This removes the manual effort from developers to have to tag and split the audio to transcribe it in the correct language – it all becomes automatic. Using this, customers can send audio files or steams, each containing a different, or possibly more than one, language, and the service can process it once and return the resulting transcription back. To learn how to get started, visit our documentation page.
Pronunciation assessment
An important element of language learning is being able to accurately pronounce words. The Speech service now supports pronunciation assessment to empower language learners and educators more. Pronunciation assessment is generally available in US English. Other Speech-to-text languages are available in preview.
Pronunciation assessment is used in PowerPoint coach to advise presenters on the correct pronunciation of spoken words throughout their rehearsal. Teams Reading Progress also uses pronunciation assessment to help students improve reading fluency, after the pandemic negatively affected students’ reading ability. It can be used inside and outside of the classroom to save teachers time and improve learning outcomes for students.
BYJU also uses pronunciation assessment to build the English Language App (ELA) to target geographies where English is used as the secondary language and is considered an essential skill to acquire. The app combines comprehensive lessons with state-of-the-art speech technology to help children learn English with a personalized lesson path.
Pearson’s Longman English Plus uses pronunciation assessment to empower both students and teachers to improve productivity in language learning, with a personalized placement test feature and learning material recommendations for different levels of students. As the world’s leading learning company, Pearson enables tens of millions of learners per year to maximize their success. Key technologies from Microsoft used in Longman English Plus are pronunciation assessment, neural text-to-speech and natural language processing.
Custom Keyword
Custom Keyword, now generally available, allows you to generate keyword recognition models that execute at the edge by specifying any word or short phrase. The models can be used to add voice activation to your product, enabling your end-users to interact completely hands-free. What’s new is the ability to create Advanced models – models with increased accuracy without you having to provide any training data. Custom Keyword fully handles data generation and training. To learn how to get started, read this walkthrough.
Speech SDK updates
Here are the highlights of the May release of the Speech SDK 1.17.0:
- Smaller footprint - we continue to decrease the memory and disk footprint of the Speech SDK and its components and have decreased the footprint by over 30% over the last few releases.
- The SDK now supports the language detection feature mentioned above in C++ and C#. You can easily recognize what language is being spoken either at the beginning of a conversation, or throughout a conversation.
- We are always broadening the scope of platforms on which you can develop speech enabled applications. We just added the ability to develop mixed reality and gaming applications using Unity on macOS.
- We always strive to meet developers where they are, both on their platforms, and in their preferred programming language. We just added support for text-to-speech support to our Go programming language API. This is in addition to speech recognition feature we’ve supported for Go since 2020.
See the Speech SDK 1.17.0 release notes for more details. If you’d like us to support additional features for your use case, we are always listening! Please find us on GitHub and drop a question! We will get back to you quickly and will do what we can to support you in developing for your use case!
Have fun developing awesome speech enabled solutions on the Azure Speech service!
Next Steps:
- Visit the Speech product page to learn about key scenarios & docs to get started
- Try out Speech Studio for a UI-based building experience
- Get started on your 30-day learning journey for AI development