Speech Recognition for Singlish

Former Employee

Jul 24, 2022

Mithun Prasad, PhD, Senior Data Scientist at Microsoft

Manprit Singh, MS, Data & AI Strategist at Microsoft

Much of human communication happens using speech. Today with digital systems that act as ether for human speech constantly capture and store speech. This is generating a data stream of gigantic proportions. As more and more systems provide speech interfaces, it becomes critical to be able to analyze these interactions. Market trends point that voice is the future of HCI(Human Computer Interaction).

However, in the absence of systems which have the capability to convert speech to text (S2T) this humongous source of insights remains untapped and usually is referred to as a form of dark data. Recent advances in Artificial intelligence have made S2T more and more accessible. Microsoft S2T API can convert 80+ locales and languages, supports multiple accents & dialects. It also offers mix language support languages like Hinglish ( Hindi + English). However even with this OutOfBox broad linguistic support it is impossible to provide support for all possible languages and scenarios. Ability to transcribe speech in the local dialects/slang/domain/industry/organization is a critical success factor in adoption of the S2T technology.

Objective :

To enable this last mile adaptation of S2T models we want to bring the best the current speech transcription landscape has to offer, and present it in a coherent platform which businesses can leverage to get a head start on S2T adaptation use cases.

Putting Transfer Learning to task :

“Transfer learning will be the next driver of ML success.”- Andrew Ng, in his Neural Information Processing Systems (NIPS) 2016 tutorial

Microsoft S2T API supports customization of base models to increase further accuracy on customer specific use cases where they experience variations with local dialects/slang/domain/industry/organization. Training a speech-to-text model can improve recognition accuracy for Microsoft's baseline model. A model is trained using human-labeled transcriptions and related text. These datasets along with previously uploaded audio data, are used to refine and train the speech-to-text model. When using speech-to-text for recognition and transcription in a unique environment, you can create and train custom acoustic, and language models. Customization is helpful for addressing ambient noise or industry-specific vocabulary.

You can do language model adaptation with related text/ sentences/ utterances when needed to improve recognition accuracy on industry-specific vocabulary and grammar, such as medical terminology or IT jargon.

Acoustic model customization is needed to Improve recognition accuracy on speaking styles, accents, or specific background noises.

Microsoft provides tools called the CustomSpeech to do these customizations.

Singapore and its unique creole Singlish :

Singlish is a local form of English spoken in Singapore that blends words borrowed from the cultural mix of communities. The unique melting-pot demographics of Singapore created a prolonged language contact between speakers of many different languages in, including Hokkien, Malay, Teochew, Cantonese and Tamil. Singlish arose out of this unique situation.

An example of what Singlish looks like :

Linguistics call such language fusion as a creole language. These emerge as a result of a process called creolization. A creole language, or simply creole, is a stable natural language that develops from the simplifying and mixing of different languages into a new one within a fairly brief period of time.

S2T in Singapore :

There is tremendous interest in Singapore to understand Singlish. A speech recognition system that could interpret and process the unique vocabulary used by Singaporeans (including Singlish and dialects) in daily conversations is very valuable. For example, an automatic speech transcribing system could be deployed at various government agencies and companies to assist frontline officers in acquiring relevant and actionable information while they focus on interacting with customers or service users to address their queries and concerns.

Efforts are on to understand calls made to transcribe emergency calls at Singapore’s Civil Defence Force (SCDF) while AI Singapore has launched Speech Lab to channel efforts in this direction.

Infocomm Media Development Authority(IMDA) of Singapore released IMDA National Speech Corpus, for local AI developers which gives them the ability to customize AI solutions with locally accented speech data.

IMDA National Speech Corpus is a :

• A 3 part speech corpus each with 1000 hours of recordings of phonetically-balanced scripts from ~1000 local English speakers.

• Audio recordings with words describing people, daily life, food, location, brands, commonly found in Singapore. These are recorded in quiet rooms using a combination of microphones and mobile phones to add acoustic variety.

• Text files which have transcripts. Of note are certain terms in Singlish such as ‘ar’, ‘lor’, etc.

This is a bounty for the open AI community in accelerating efforts towards speech adaptation. With such efforts, the trajectory for the local AI community and businesses are poised for major breakthroughs in Singlish in the coming years.

We have leveraged the IMDA national speech corpus as a starting ground to see how adding customized audio snippets from locally accented speakers drives up accuracy of transcription. An overview of the uptick is in the below chart. Without any customization, the holdout set performed with an accuracy of 73%. As more data snippets were added, we can validate that with the right datasets, we can drive accuracy up using human annotated speech snippets.

On the left is the uplift in terms of accuracy. The right correspondingly shows the Word Error Rate dropping on addition of more audio snippets

Keeping human in the loop

The speech recognition models learn from humans, based on “human-in-the-loop learning”. Human-in-the-Loop Machine Learning is when humans and Machine Learning processes interact to solve one or more of the following:

Making Machine Learning more accurate
Getting Machine Learning to the desired accuracy faster
Making humans more accurate
Making humans more efficient

An illustration of what a human in the loop looks like is as follows.

In a nutshell, human in the loop learning is giving AI the right calibration at appropriate junctures. An AI model starts learning for a task, which eventually can plateau over time. Timely interventions by a human in this loop can give the model the right nudge. “Transfer learning will be the next driver of ML success.”- Andrew Ng, in his Neural Information Processing Systems (NIPS) 2016 tutorial

Not everybody has access to volumes of call center logs, and conversation recordings collected from a majority of local speakers which are key sources of data to train localized speech transcription AI. In the absence of significant amounts of local accented data with ground truth annotations, and our belief behind transfer learning to be a powerful driver in accelerating AI development, we leverage existing models and maximize their ability to understand towards local accents.

The framework allows extensive room for human in the loop learning and can connect with AI models from both cloud providers and open source projects. A detailed treatment of the components in the framework include:

The speech to text model can be any kind of Automatic Speech Recognition (ASR) engine or Custom Speech API, which can run on cloud or on premise. The platform is designed to be agnostic to the ASR technology being used.
Search for ground truth snippets. In a lot of cases when the result is available, a quick search of the training records can point to the number of records trained, etc.
Breakdown on Word Error Rates (WER): The industry standard to measure Automatic Speech Recognition (ASR) systems is based on the Word Error Rate, defined as the below

where S refers to the number of words substituted, D refers to the number of words deleted, and I refer to the number of words inserted by the ASR engine.

A simple example illustrating this is as below, where there is 1 deletion, 1 insertion, and 1 substitution in a total of 5 words in the human labelled transcript.

Word Error Rate comparison between ground truth and transcript (Source: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-evaluate-data)

So, the WER of this result will be 3/5, which is 0.6. Most ASR engines will return the overall WER numbers, and some might return the split between the insertions, deletions and substitutions.

However, in our work (platform), we can provide a detailed split between the insertions, substitutions and deletions.

The platform built has ready interfaces that allow human annotators to plug audio files with relevant labeled transcriptions, to augment data
It ships with dashboards which show detailed substitutions, such as how often was the term ‘kaypoh’ transcribed as ‘people’.

The crux of the platform is the ability to control the existing transcription accuracy, by getting a detailed overview of how often the engine is having trouble transcribing certain vocabulary, and allowing human to give the right nudges to the model.

References and useful links