Speech Service Update – Hierarchical Transformer for Pronunciation Assessment
Published Feb 13 2023 09:55 PM 7,598 Views




This post was co-authored by Ke Wang, Yinhe Wei, Lei He, Sheng Zhao, Qinying Liao, Andy Beatman and Deb Adeogba


Pronunciation Assessment plays a significant role in Computer-Assisted Language Learning (CALL) for language learners and educators. It can evaluate speech pronunciation and give speakers feedback on various points, including accuracy, fluency and prosody of speech. Pronunciation Assessment is a feature of Speech Service in the Azure Cognitive Services family, publicly available in 10+ languages and variances including American English, British English, Australian English, French, Spanish and Chinese, with additional languages in preview.


In February 2023, Microsoft Reimagine Education event announced several great new features to support student success. Pronunciation assessment is used in Reading Coach on Immersive Reader and the Speaker Progress in Teams. It can be used inside and outside of the classroom to save teachers time and improve learning outcomes for students on reading fluency. Pronunciation assessment is also used in PowerPoint coach to advise presenters on the correct pronunciation of spoken words throughout their rehearsal.


BYJU uses pronunciation assessment to build the English Language App (ELA) to target geographies where English is used as the secondary language and is considered an essential skill to acquire. The app combines comprehensive lessons with state-of-the-art speech technology to help children learn English with a personalized lesson path.


Pearson’s Longman English Plus uses pronunciation assessment to empower both students and teachers to improve productivity in language learning, with a personalized placement test feature and learning material recommendations for different levels of students. As the world’s leading learning company, Pearson enables tens of millions of learners per year to maximize their success.


For language learners, practicing pronunciation and receiving timely feedback are essential to improve language skills. In order to provide accurate assessment results, we utilized Azure Neural Text-to-Speech (TTS), Transformer, Ordinal Regression, and hierarchical structure to benefit the accuracy assessment. In this blog, we will provide a deeper review on the technology behind the Pronunciation Assessment capability, demonstrated with performance gain on word-level accuracy.


Neural network-based goodness of pronunciation (GOP) and its variants are dominant methods to assess pronunciation and has been shown to correlate well with human assessment. The quality of the GOP feature depends on the quality of the acoustic models used. With Azure Speech-to-Text (STT), which has a powerful elaborated model structure and is trained with large-scale real data, we can leverage the high-quality GOP feature to train our mispronunciation detection model.


Overall architecture

The high-level architecture of our mispronunciation detection system is shown in Figure 1. One of the main challenges for mispronunciation detection is the unavailability of high-quality labeled data. To overcome such data scarcity issues, including unbalance issues of the positive and negative samples, we used Azure Neural TTS to generate the training data, i.e., to mimic human behavior for detection of mispronunciation errors. From there, the augmented data was used to pretrain the source model and then we utilized the labeled data to fine-tune the source model. Moreover, in the data labeling stage, we asked 3-5 language experts (LEs) to label the data individually under the same metrics. The Pearson Correlation Coefficients (PCC) must be larger than a given threshold between 2 separate LEs to guarantee the labeled quality. With this two-stage modeling, we can also support some low resource locales by leveraging the TTS generated data. On SpeechOcean762 dataset, the PCC is improved from 0.5661 to 0.6562 by combining these innovations.


Figure 1_The overall architecture.jpg

Figure 1. The overall architecture


Hierarchical Transformer

In this section, we introduce the details of the Hierarchical Transformer model. The framework is illustrated as follows:


Figure 2_Hierarchical mispronunciation detection Transformer.jpg

Figure 2. The overall framework for Hierarchical mispronunciation detection Transformer. (a). The structure of the Hierarchical Transformer model. (b). The details for Transformer block. (c) Aligner block to connect the senone and phoneme information


We consider senone, phoneme and word feature from the acoustic model of a STT system as the hierarchical Transformer model input feature. Senone information is used to detect the detailed pronunciation pattern, using self-attention to capture the focuses in senone and phoneme level, and word-level feature provide the bird’s-eye view for the current word scoring. Aligner block is used to connect the senone and phoneme information explicitly which will let the Transformer learn the hidden relationship between them.


As for word-level features, we adopt word posterior score, sentence-level signal-to-noise ratio (SNR), duration, and statistical information, including consonant and vowel attributes. Phoneme feature is much simpler than word feature. It solely consists of phoneme score and duration. While for senone feature, only senone score and status are used. With this elaborately designed feature and model structure, both coarse-grained and fine-grained features contribute Transformer to model the pronunciation score.

Ordinal regression

Ordinal regression (OR) has been adopted for pronunciation assessment on sentence-level fluency and accuracy assessment. In comparison with traditional machine learning tasks it performs better as it does not target the speech assessment task as a classification or regression task. It aims at predicting the ranking information between the compared samples, i.e., it compares two samples and judges which one is better. This binary preference test is easier, quicker, and more accurate than traditional methods. In addition, ranking also matches human behavior since the assessment scores exhibit a natural ordering.


We further adopt the OR for pronunciation assessment on accuracy assessment, demonstrated with experiment results with SpeechOcean762 data set.


To do a fair comparison with other systems, we also trained some models on SpeechOcean762 dataset which contains 2,500 well-labeled training and evaluation samples, respectively. The PCC assigns a value between -1 to 1 where 0 is no correlation. Negative value means the prediction is opposite to the target, and positive value means the prediction is aligned with the target. It is used to measure the correlation between generated assessment scores and scores labeled by human judges. Values that are close to 1 indicate strong correlation. In SpeechOcean762 data set, each sample is labeled by 5 individual LEs. The PCC is then manually averaged between each 2 LEs on all 2,500 evaluation samples as one kind of human parity. The experimental results are shown in Table 1.




Competitor 1

Competitor 2



Human Parity








Table 1. Experimental results on SpeechOcean762


Goodness Of Pronunciation feature-based Transformer (GOPT), which uses Transformer and multi-task learning to model multi-aspect and multi-granularity scores, is a state-of-the-art method to the best of our knowledge. We have also evaluated some top commercial speech assessment services, and the evaluation results are shown as Competitor 1 and Competitor 2 in the Table 1. Our hierarchical Transformer has achieved the best performance. Specifically, by leveraging the OR, we further closed the gap of human ability on PCC.


Get started

To learn more and get started, first try our no-code tool provided in Speech Studio, which allows you to explore the Speech service with intuitive user interface. You will need an Azure account and a Speech service resource to create your own personal account in Speech Studio. If you do not have an account and subscription, try the Speech service for free. 


Here are more resources to help you add speech to your language learning applications: 

Version history
Last update:
‎Nov 12 2023 06:23 PM
Updated by: