Improve remote learning with speech-enabled apps powered by Azure Cognitive Services
Published Aug 26 2020 12:40 AM 8,078 Views

This post was co-authored by Melissa Ma, Yueying Liu, Anny Dow and Sheng Zhao  


Online learning has grown rapidly over the last couple of months as schools and organizations adapt to new ways of connecting and methods of education. Speech technology can play a significant role in making distance learning more engaging and accessible to students of all backgrounds. With Azure Cognitive Services, developers can quickly add speech capabilities to applications, bringing online learning to life.


Enhancing language fluency with pronunciation assessment


One key element in language learning is improving pronunciation skills. For new language learners, practicing pronunciation and getting timely feedback is essential to becoming a more fluent speaker. In the current environment, online language learning and the ability to practice anytime, anywhere, has become even more important.


At the Build conference in May, we announced the preview of the pronunciation assessment capability, powered by Speech to Text. 


The pronunciation assessment capability evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio, allowing users to benefit from:

  • Highly accurate evaluations - Provides consistent and accurate evaluation results using a machine learning-based approach that correlates highly with speech assessments conducted by native experts. The pronunciation assessment model was trained with 100,000+ hours of speech data from native English speakers and is highly robust. It assesses three dimensions of pronunciation: accuracy, fluency and completeness. Pronunciation assessment can provide evaluations at multiple levels of granularity, returning accuracy scores for specific phonemes, words, sentences, or even whole articles.
  • Ability to account for inserted and omitted words – Enables rich configuration parameters to support flexibility in using the API. Using NLP techniques and EnableMiscue setting, pronunciation assessment can detect errors such as extra, missing, or repeated words—when compared to reference text—to assist in more accurate scoring. This is particularly useful for longer paragraphs of text.
  • Real-time streaming - Supports streaming upload on audio files for immediate feedback.


With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation. Online learning solution providers or educators can use the capability to evaluate pronunciation of multiple speakers in real-time. Pronunciation assessment currently supports the English language.



Educational organizations, like the Tomorrow Advancing Life (TAL) Education Group, are already building applications using pronunciation assessment to help students practice language learning remotely.


“Effectively and efficiently teaching accurate pronunciation to students of different levels is a big challenge, both in class and outside of class. The Speech service’s pronunciation assessment capability provides a powerful solution to address this challenge. We’ve been highly impressed by the robustness of pronunciation assessment and its ability to deal with noisy environments, and how well it correlates with pronunciation evaluations conducted by our teachers.”

- Xiangyu Hu, AI Scientist of Tomorrow Advancing Life (TAL) Education Group  


Learn how you can get started with the pronunciation assessment using our tutorial video and download source code from Github to try out.



Developing interactive courses with Text to Speech


Another way that Speech technology can support better online learning experiences is through Text to Speech, a Speech service feature that converts text to lifelike speech. Educators can create interactive materials with highly expressive and humanlike voices using Neural Text to Speech (Neural TTS), now available in 36 voices with 31 languages. (Learn about our most recent languages here.)


With Neural TTS, developers can add natural-sounding voice to learning materials, for scenarios like slide narration. Neural TTS can also be used for reading aloud any content, facilitating new ways for students to interact with material as well as increasing accessibility for students with learning differences. Educational organizations can also use Neural TTS to create AI-powered virtual “teachers” that interact with students to make online courses more engaging.


Experience the Neural Voices with the new Edge browserExperience the Neural Voices with the new Edge browser


With the Custom Neural Voice capability, online learning solution providers can further create interactive learning experiences for their students in a voice that represents their brand, or develop unique voices for different characters. For example, Duolingo, one of the world’s most popular language learning apps, is creating unique voices for different characters used in the lessons.    


Using SSML or the Audio Content Creation tool, users can further finetune audio characteristics like speaking rate, pitch, and pronunciation to fit their scenarios—no code required. Neural TTS also supports different speaking styles—like cheerfulness and empathy—making it easier to bring audiobooks to life. Recently we have just added 10 new voice styles, available in Chinese (Xiaoxiao voice) and will be expanded to other languages. With these new styles, online education solution providers can create more engaging interactive courses that express rich emotions.  


To learn more about Audio Content Creation, watch the video tutorial.



To learn more and get started adding speech to your educational applications, check out our resources below:


Pronunciation Assessment

Text to Speech


Version history
Last update:
‎Jul 02 2021 06:45 AM
Updated by: