Speech Service Ignite Update – New Enhancement features for Pronunciation Assessment

Microsoft

Nov 15, 2023

We are thrilled to unveil an exciting array of new capabilities for Pronunciation Assessment, which is set to revolutionize computer-assisted language learning (CALL) for both learners and educators alike. With Pronunciation Assessment now Generally Available in over 14 languages and variances, including American English, British English, Australian English, French, German, Japanese, Korean, Portuguese, Spanish and Chinese, and with additional languages in preview, we are excited to empower more on global solution for evaluating speech pronunciation and providing feedback on accuracy and fluency.

In November Ignite 2023, we are taking Pronunciation Assessment to the next level with new public preview features in English that cover Prosody, Grammar, Vocabulary and Topic. These new features will provide an even more comprehensive language learning experience for learners and educators, including speaking and conversational-based evaluation without a script. For Reading scenario, we are releasing new features in the Reading Progress in Teams as shown on Figure 1. Pronunciation assessment on reading accuracy and prosody for students can be used inside and outside of the classroom to save teachers time and improve learning outcomes for students on reading accuracy and prosody aspects. For Speaking scenario, Pronunciation assessment is also used in PowerPoint coach to advise presenters on the correct pronunciation of spoken words throughout their rehearsal. With its advanced capabilities, Pronunciation Assessment is ultimate for anyone looking to take their language learning to the next level.

Figure 1. Reading Progress in Teams

Overall architecture

The high-level architecture of our pronunciation assessment system is shown in Figure 2. The enhanced system can provide up to seven scores based on the scenario. These scores are accuracy, fluency, completeness as well as four new scores which are prosody, grammar, vocabulary, and topic.

Figure 2. The overall architecture

Prosody

Prosody is an important dimension in Pronunciation Assessment, as it can be used to measure whether prosodic patterns are correct or not in a given speech. In our enhanced system, we provide both prosody feedback and prosody scores for all scenarios, including reading and speaking.

With prosody feedback, we offer a detailed error analysis of speech from target speakers. This involves three different types of errors: unexpected break, missing break at punctuation, and monotone (where the whole utterance is monotonically rising, falling, or flat).

For prosody score, it’s an utterance-level score which indicates the overall naturalness, including stress, intonation, speaking speed and rhythm. We use Transformer modeling combined with a pre-trained ASR model to do prosody scoring. The framework is illustrated in Figure 3. We use phone-level features from the ASR model as Transformer model input and use LSTM pooling to get the final utterance-level prosody score. The phone-level features consist of both GOP features and phonetic features. With this elaborately designed feature and model structure, both acoustic and linguistic features contribute Transformer to model the utterance-level prosody score.

Figure 3. Transformer based prosody model.

Grammar

Grammar score is one of the new features introduced this time, which can quickly complete a full-text grammar check and give a grammar score based on the results. This can improve the user's expression accuracy and reduce the output of incorrect expressions.

Figure 4. Fluency boost model for grammar error correction.

The grammar scoring feature adopts the Fluency Boost Learning and Inference for Neural Grammatical Error Correction - ACL Anthology proposed by Microsoft Research Asia earlier. By generating a large amount of error correction data through Fluency Boost Learning and combining it with seq2seq pre-training technology, the system generates scores for sentences sequentially, thus obtaining the most grammatically correct sentence. Then, it evaluates and scores the error level of the original sentence. The grammar checking model has surpassed human reference levels on both the authoritative grammar error correction datasets CoNLL-2014 and JFLEG and maintains the industry-leading technical level.

Vocabulary

Vocabulary score examines various aspects of word usage within a text, incorporating accuracy, appropriateness, and richness, by utilizing statistical models to evaluate each dimension.

Accuracy pertains to the precise employment of words in a specific context, encompassing aspects such as spelling, meaning, and collocation, to ensure that the vocabulary within the text accurately conveys the author's intentions. Appropriateness primarily assesses whether the vocabulary aligns with contextual requirements and stylistic preferences, guaranteeing proper word choice and enhancing the text's fluency and readability. Richness, conversely, emphasizes the diversity and innovation of vocabulary, promoting the use of an extensive lexicon to articulate ideas while avoiding repetition and clichés.

By consolidating the scores from these dimensions, a more comprehensive evaluation of the text's vocabulary quality can be attained, ultimately assisting authors in elevating their writing proficiency and rendering the text more engaging and impactful.

Topic

In this update, another important feature is the Topic Score, which can measure the performance of a text in terms of its relevance to the topic. Relevance refers to the degree of connection between the text content and the topic, involving whether the points, arguments, and examples discussed in the text are closely related to the given topic. A high relevance text means that the author can focus on the topic, concentrate on exploring the core ideas, and avoid lengthy narration that deviates from the subject. Measuring relevance helps assess the structure and logic of the article, thereby improving the quality and readability of the text.

To implement this feature, the Topic Score employs a word vector model to calculate the word vectors for both the topic and the text. This model is a technique that translates vocabulary into numerical vectors, capturing the semantic relationships among words. By determining the degree of correlation between the word vectors of the topic and the text, the performance of the text in terms of relevance can be more accurately assessed. This method is useful for evaluating individual texts as well as comparing the relevance between different texts, offering authors more intuitive evaluation indicators.

The introduction of the Topic Score allows for a more comprehensive assessment of a text's quality, aiding authors in enhancing their writing skills and fostering the creation and dissemination of high-quality content.

Benchmark

We evaluate our speech assessment services using both internal and third-party tests. New features, prosody and content, and previous features, accuracy and fluency, are covered. Also, some top commercial speech assessment services are evaluated, and the results are shown as Competitor. We use Pearson Correlation Coefficients (PCC) to measure the correlation between predicted scores and ground-truth human labels. The PCC assigns a value between -1 to 1, where a negative value means the prediction is opposite to the target, and a positive value means the prediction is aligned with the target. Values that are close to 1 indicate strong correlation and 0 means no correlation.

The internal evaluation is firstly done on open-source dataset SpeechOcean762, which contains 5,000 utterances with labels from 5 individual language experts. Following the official protocol, we evaluate all systems on evaluation set (2500 utterances). The results show that our prosody model has achieved the best PCC.

PCC	Competitor 1	Competitor 2	Competitor 3	MSFT
Accuracy	0.67	0.68	0.69	0.70
Fluency	0.67	0.63	0.74	0.72
Prosody	0.59	N/A	0.79	0.84

Table 1. Results of internal evaluation

Then, we evaluate vocabulary and grammar scores on an internal dataset, which includes 200 utterances with labels from at least 3 experts for each. Our system reaches PCC 0.65 and 0.68 on vocabulary and grammar score respectively. Since content scoring is really a new feature on the market, there are no such features provided by top competitors.

We also outsource a third-party company to do external evaluations which are done with publicly available APIs, on a test set unknown for all 6 service providers and across diverse domains in English. The results are shown in Table 2. It shows that we achieved the best PCC in all available dimensions: word-level accuracy, sentence-level accuracy, fluency and prosody

PCC	Competitor 1	Competitor 2	Competitor 3	Competitor 4	Competitor 5	MSFT
Accuracy (Word)	0.46	0.19	0.39	0.49	0.40	0.52
Accuracy (Utterance)	0.57	0.57	0.49	0.59	0.55	0.60
Fluency	0.44	0.45	0.42	N/A	0.42	0.49
Prosody	N/A	0.55	N/A	N/A	N/A	0.60

Table 2. Results of third-party evaluation

Get started

To learn more and get started, first try our no-code tool provided in AI Studio on Reading and Speaking evaluation, which allows you to explore the Speech service with intuitive user interface. You will need an Azure account and a Speech service resource to create your own personal account in Speech Studio. If you do not have an account and subscription, try the Speech service for free.