Build an application that transcribes speech

Microsoft

May 14, 2021

One of the most common ways to benefit from AI services in your apps is to utilize Speech to Text capabilities to tackle a range of scenarios, from providing captions for audio/video to transcribing phone conversations and meetings. Speech service, an Azure Cognitive Service, offers speech transcription via its Speech to Text API in over 94 language/locales and growing.

In this article we are going to show you how to integrate real-time speech transcription into a mobile app for a simple note taking scenario. Users will be able to record notes and have the transcript show up as they speak. Our Speech SDK supports a variety of operating systems and programming languages. Here we are going to write this application in Java to run on Android.

Common Speech To Text scenarios

The Azure Speech Service provides accurate Speech to Text capabilities that can be used for a wide range of scenarios. Here are some common examples:

Audio/Video captioning. Create captions for audio and video content using either batch transcription or realtime transcription.
Call Center Transcription and Analytics. Gain insights from the interactions call center agents have with your customers by transcribing these calls and extracting insights from sentiment analysis, keyword extraction and more.
Voice Assistants. Voice assistants using the Speech service empowers developers to create natural, human-like conversational interfaces for their applications and experiences. You can add voice in and voice out capabilities to your flexible and versatile bot built using Azure Bot Service with the Direct Line Speech channel, or leverage the simplicity of authoring a Custom Commands app for straightforward voice commanding scenarios.
Meeting Transcription. Microsoft Teams provides live meeting transcription with speaker attribution that make meetings more accessible and easier to follow. This capability is powered by the Azure Speech Service.
Dictation. Microsoft Word provides the ability to dictate your documents powered by the Azure Speech Service. It's a quick and easy way to get your thoughts out, create drafts or outlines, and capture notes.

How to build real-time speech transcription into your mobile app

Prerequisites

As a basis for our sample app we are going to use the “Recognize speech from a microphone in Java on Android” GitHub sample that can be found here. After cloning the cognitive-services-speech-sdk GitHub repo we can use Android Studio version 3.1 or higher to open the project under samples/java/android/sdkdemo. This repo also contains similar samples for various other operating systems and programming languages.

In order to use the Azure Speech Service you will have to create a Speech service resource in Azure as described here. This will provide you with the subscription key for your resource in your chosen service region that you need to use in the sample app.

The only thing you need to try out speech recognition with the sample app is to update the configuration for speech recognition by filling in your subscription key and service region at the top of the MainActivity.java source file:

//
// Configuration for speech recognition
//

// Replace below with your own subscription key
private static final String SpeechSubscriptionKey = "YourSubscriptionKey";
// Replace below with your own service region (e.g., "westus").
private static final String SpeechRegion = "YourServiceRegion";

You can leave the configuration for intent recognition as-is since we are just interested in the speech to text functionality here.

After you have updated the configuration, you can build and run your sample. Ideally you run the application on an Android phone since you will need to have a microphone input.

Trying out the sample

On first use, the application will ask you for the needed application permissions. Then the sample application provides a few options for you to use. Since we want users to be able to capture a longer note we will use the Recognize continuously option.

With this option the recognized text will show up at the bottom of the screen as you speak, and you can speak for a while with some longer pauses in between. Recognition will stop when you hit the stop button. So, this will allow you to capture a longer note.

This is what you should see when you try out the application:

Code Walkthrough

Now that you have this sample working and you have tried it out, let’s look at the key portions of the code that are needed to get the transcript. These can all be found in the MainActivity.java source file.

First in the onCreate function we need to ask for permission to access the microphone, internet, and storage:

int permissionRequestId = 5;

// Request permissions needed for speech recognition
ActivityCompat.requestPermissions(MainActivity.this, new String[]{RECORD_AUDIO, INTERNET, READ_EXTERNAL_STORAGE}, permissionRequestId);

Next, we need to create a SpeechConfig that provides the subscription key and region so we can access the speech service:

// create config
final SpeechConfig speechConfig;
try {
    speechConfig = SpeechConfig.fromSubscription(SpeechSubscriptionKey, SpeechRegion);
} catch (Exception ex) {
    System.out.println(ex.getMessage());
    displayException(ex);
    return;
}

The main work to recognize the spoken audio is done in the recognizeContinuousButton function that gets invoked when the Recognize continuously button is pressed and the onClick event is triggered:

///////////////////////////////////////////////////
// recognize continuously
///////////////////////////////////////////////////
recognizeContinuousButton.setOnClickListener(new View.OnClickListener() {

First a new recognizer is created providing information about the speechConfig we created earlier as well as the audio Input from the microphone:

audioInput = AudioConfig.fromStreamInput(createMicrophoneStream());
reco = new SpeechRecognizer(speechConfig, audioInput);

Besides getting the audio stream from a microphone you could also use audio from a file or other stream for example.
Next two event listeners are registered. The first one is for the Recognizing event which signals intermediate recognition results. These are generated as words are being recognized as a preliminary indication of the recognized text. The second one is the Recognized event which signals the completion of a recognition. These will be produced when a long enough pause in the speech is detected and indicate the final recognition result for that part of the audio.

reco.recognizing.addEventListener((o, speechRecognitionResultEventArgs) -> {
    final String s = speechRecognitionResultEventArgs.getResult().getText();
    Log.i(logTag, "Intermediate result received: " + s);
    content.add(s);
    setRecognizedText(TextUtils.join(" ", content));
    content.remove(content.size() - 1);
});

reco.recognized.addEventListener((o, speechRecognitionResultEventArgs) -> {
    final String s = speechRecognitionResultEventArgs.getResult().getText();
    Log.i(logTag, "Final result received: " + s);
    content.add(s);
    setRecognizedText(TextUtils.join(" ", content));
});

Lastly recognition is started using startContinuousRecognitionAsync() and a stop button is displayed.

final Future<Void> task = reco.startContinuousRecognitionAsync();
setOnTaskCompletedListener(task, result -> {
    continuousListeningStarted = true;
    MainActivity.this.runOnUiThread(() -> {
        buttonText = clickedButton.getText().toString();
        clickedButton.setText("Stop");
        clickedButton.setEnabled(true);
    });
});

When the stop button is pressed recognition is stopped by calling stopContinuousRecognitionAsync():

if (continuousListeningStarted) {
    if (reco != null) {
        final Future<Void> task = reco.stopContinuousRecognitionAsync();
        setOnTaskCompletedListener(task, result -> {
            Log.i(logTag, "Continuous recognition stopped.");
            MainActivity.this.runOnUiThread(() -> {
                clickedButton.setText(buttonText);
            });
            enableButtons();
            continuousListeningStarted = false;
        });
    } else {
        continuousListeningStarted = false;
    }

    return;
}

That is all that is needed to integrate Speech to Text into your application.

Next Steps:

Take a look at the Speech-to-text quickstart
Note that there are default limits to the number of concurrent recognitions. See Speech service Quotas and Limits - Azure Cognitive Services for information on these limits, how to increase them and best practices.
Take a look at supported spoken languages
Explore how to improve accuracy with Custom Speech by training a custom model that is adapted to the words and phrases used in your scenario.
Learn more about other things you can do with Speech service.

Updated May 13, 2021

Version 1.0

azure ai services

speech

HeikoRa

Microsoft

Joined May 07, 2019

View Profile

AI - Azure AI services Blog

Follow this blog board to get notified when there's new activity