Announcing the public preview of Real-time Diarization in Azure AI Speech

Microsoft

Jul 18, 2023

We are pleased to announce the public preview of Real-time diarization in Azure AI Speech. This new feature offers real-time transcription while simultaneously identifying speakers, making it an invaluable tool for a variety of scenarios.

Real-time diarization enables conversations to be transcribed in real-time while simultaneously identifying speakers. Diarization refers to the ability to tell who spoke and when. It differentiates speakers in mono channel audio input based on their voice characteristics. This allows for the identification of speakers during conversations and can be useful in a variety of scenarios such as doctor-patient conversations, agent-customer interactions, and court proceedings.

What’s available in Public Preview

Developers can access the real-time diarization feature through Speech SDK, which includes two APIs: ConversationTranscriber and MeetingTranscriber. ConversationTranscriber differentiates speakers as GUEST1, GUEST2, etc., while MeetingTranscriber identifies speakers by their real names.

ConversationTranscriber API combines diarization with speech to text functionality to provide transcription outputs that contain a speaker entry for each transcribed speech. The transcription output is tagged as GUEST1, GUEST2, GUEST3, etc. based on the number of speakers in the audio conversation. The ConversationTranscriber API is similar to SpeechRecognizer API which enables easier transition between the two APIs. The ConversationTranscriber API utilizes Speech to Text endpoint. Therefore, the API supports the audio formats and features which are supported by the Speech to Text endpoint (e.g. custom phrase list, language id and word level timings, etc.)

The MeetingTranscriber API will identify different speakers with their real name, instead of GUEST1, GUEST2, GUEST3, etc. This API supports adding and removing participants to a meeting.

Real-time diarization will be released through Speech SDK, and therefore supports the Speech SDK supported languages (e.g. C#, .NET, Python, Java, JavaScript, etc.) It will be available to all regions and locales that Azure AI speech supports.

Use cases and scenarios

Real-time diarization is the feature requested by many customers, to help with an array of use cases. This feature will not only make it possible for users to further speech analytics and gain insights from transcriptions by identifying speakers but can also be used to help the accessibility of transcripts. We anticipate real-time diarization to be used in scenarios, such as:

Live Conversation Transcription

When speakers are all in the same room using a single microphone setup, do live transcription about which speaker (e.g. Guest1, Guest2, or Guest3) talks about what transcription.

Conversation Transcription on Prerecorded Live Events

Use prerecorded live events streamed acoustically to a microphone, and get the transcription about which speaker (e.g. Guest1, Guest2, or Guest3) talks about what transcription.

Live Caption and Subtitle

Show live captions or subtitles of the meetings, videos, or audios.

Getting started

The public preview of real-time diarization will be available in Speech SDK version 1.31.0, which will be released in early August.

Follow the below steps to create a new console application and install the Speech SDK and try out the real-time diarization from file with ConversationTranscriber API. Additionally, we will release detailed documentation including the Quickstart Doc and sample code when the Public Preview is released.

Open a command prompt where you want the new project and create a console application with the .NET CLI. The Program.cs file should be created in the project directory.
```
.NET CLICopy 
dotnet new console
```

Install the Speech SDK in your new project with the .NET CLI.

.NET CLICopy 
dotnet add package Microsoft.CognitiveServices.Speech

Replace the contents of Program.cs with the following code.

using Microsoft.CognitiveServices.Speech; 
using Microsoft.CognitiveServices.Speech.Audio; 

class Program  
{ 
    // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION" 
    static string speechKey = Environment.GetEnvironmentVariable("SPEECH_KEY"); 
    static string speechRegion = Environment.GetEnvironmentVariable("SPEECH_REGION"); 

    async static Task Main(string[] args) 
    { 
        var filepath = "katiesteve.wav"; 
        var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);         
        speechConfig.SpeechRecognitionLanguage = "en-US"; 

        var stopRecognition = new TaskCompletionSource<int>(TaskCreationOptions.RunContinuationsAsynchronously); 

        // Create an audio stream from a wav file or from the default microphone 
        using (var audioConfig = AudioConfig.FromWavFileInput(filepath)) 
        { 
            // Create a conversation transcriber using audio stream input 
            using (var conversationTranscriber = new ConversationTranscriber(speechConfig, audioConfig)) 
            { 
                conversationTranscriber.Transcribing += (s, e) => 
                { 
                    Console.WriteLine($"TRANSCRIBING: Text={e.Result.Text}"); 
                }; 

                conversationTranscriber.Transcribed += (s, e) => 
                { 
                    if (e.Result.Reason == ResultReason.TRANSCRIBEDSpeech) 
                    { 
                        Console.WriteLine($"TRANSCRIBED: Text={e.Result.Text} Speaker ID={e.Result.SpeakerId}"); 
                    } 
                    else if (e.Result.Reason == ResultReason.NoMatch) 
                    { 
                        Console.WriteLine($"NOMATCH: Speech could not be TRANSCRIBED."); 
                    } 
                }; 

                conversationTranscriber.Canceled += (s, e) => 
                { 
                    Console.WriteLine($"CANCELED: Reason={e.Reason}"); 

                    if (e.Reason == CancellationReason.Error) 
                    { 
                        Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}"); 
                        Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}"); 
                        Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?"); 
                        stopRecognition.TrySetResult(0); 
                   } 

                   stopRecognition.TrySetResult(0); 
                }; 

                conversationTranscriber.SessionStopped += (s, e) => 
                { 
                    Console.WriteLine("\n    Session stopped event."); 
                    stopRecognition.TrySetResult(0); 
                }; 

                await conversationTranscriber.StartTranscribingAsync(); 

                // Waits for completion. Use Task.WaitAny to keep the task rooted. 
                Task.WaitAny(new[] { stopRecognition.Task }); 

                await conversationTranscriber.StopTranscribingAsync(); 
            } 
        } 
    }
}

Replace katiesteve.wav with the filepath and filename of your .wav file. The intent of this sample is to recognize speech from multiple participants in the conversation. Your audio file should contain multiple speakers. For example, you can use the sample audio file provided in the Speech SDK samples repository on GitHub.
To change the speech recognition language, replace en-US with another supported language. For example, es-ES for Spanish (Spain). The default language is en-US if you don't specify a language. For details about how to identify one of multiple languages that might be spoken, see language identification.

Run your new console application to start speech recognition:

dotnet run

The transcribed conversation should be output as text something like below:

TRANSCRIBING: Text=good morning 
TRANSCRIBING: Text=good morning steve 
TRANSCRIBED: Text=Good morning, Steve. Speaker ID=GUEST-1 
TRANSCRIBING: Text=good morning 
TRANSCRIBING: Text=good morning katie 
TRANSCRIBING: Text=good morning katie have you heard 
TRANSCRIBING: Text=good morning katie have you heard about 
TRANSCRIBING: Text=good morning katie have you heard about the new 
TRANSCRIBING: Text=good morning katie have you heard about the new conversation 
TRANSCRIBING: Text=good morning katie have you heard about the new conversation transcription 
TRANSCRIBING: Text=good morning katie have you heard about the new conversation transcription capability 
TRANSCRIBED: Text=Good morning. Katie, have you heard about the new conversation transcription capability? Speaker ID=GUEST-2 
TRANSCRIBING: Text=no 
TRANSCRIBING: Text=no tell me more 
TRANSCRIBED: Text=No, tell me more. Speaker ID=GUEST-1 
TRANSCRIBING: Text=it's the new 
TRANSCRIBING: Text=it's the new feature 
TRANSCRIBING: Text=it's the new feature that 
TRANSCRIBING: Text=it's the new feature that transcribes our 
TRANSCRIBING: Text=it's the new feature that transcribes our discussion 
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets 
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us 
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know 
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who 
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said what 
TRANSCRIBED: Text=It's the new feature that transcribes our discussion and lets us know who said what. Speaker ID=GUEST-2 
TRANSCRIBING: Text=that 
TRANSCRIBING: Text=that sounds interesting 
TRANSCRIBING: Text=that sounds interesting i'm 
TRANSCRIBING: Text=that sounds interesting i'm going to give this a try 
TRANSCRIBED: Text=That sounds interesting. I'm going to give this a try. Speaker ID=GUEST-1 
CANCELED: Reason=EndOfStream