Azure Speech To Text Phoneme Detection

Question

Hello, I am working on a very niche speech detection app that Azure has been very helpful for but I still have some large hurdles to cross.

I would like to be able to detect a user sounding out individual phonemes. Right now, Azure's STT can split up words into phonemes for you but it refuses to provide transcriptions of phonemes by themselves. For instance, Azure will happily translate audio of you saying "la" as phonemes /l/ and /a/, but if you exclusively make the "L" sound with no vowel, azure will not respond with any phoneme data and will continue waiting for more audio. Is there any way to force Azure STT responses to be as granular as possible? I would like to be able detect isolated phonemes even when they do not combine to become a word. I am interfacing with Azure through Unity FYI.

Thanks

kidd_ip · Answer

How about this:
&nbsp;

Use Pronunciation Assessment: Azure offers a pronunciation assessment feature that evaluates speech pronunciation and provides feedback on the accuracy and fluency of spoken audio. This might help in detecting individual phonemes more accurately.
Adjust Configuration Parameters: Ensure that your SpeechRecognizer configuration is set up correctly. You can specify the language and enable prosody and content assessment to improve phoneme detection.
Continuous Recognition Mode: If your audio files exceed 30 seconds, consider using continuous recognition mode for processing. This mode allows for uninterrupted streaming and might help in detecting isolated phonemes.
Customize Speech Recognition Model: You might need to customize the speech recognition model to better suit your needs. This could involve training the model with specific phoneme data to improve its ability to detect isolated sounds.

Forum Discussion

Azure Speech To Text Phoneme Detection

1 Reply