We are pleased to announce the public preview of Real-time diarization in Azure AI Speech. This new feature offers real-time transcription while simultaneously identifying speakers, making it an invaluable tool for a variety of scenarios.
Real-time diarization enables conversations to be transcribed in real-time while simultaneously identifying speakers. Diarization refers to the ability to tell who spoke and when. It differentiates speakers in mono channel audio input based on their voice characteristics. This allows for the identification of speakers during conversations and can be useful in a variety of scenarios such as doctor-patient conversations, agent-customer interactions, and court proceedings.
What’s available in Public Preview
Developers can access the real-time diarization feature through Speech SDK, which includes two APIs: ConversationTranscriber and MeetingTranscriber. ConversationTranscriber differentiates speakers as GUEST1, GUEST2, etc., while MeetingTranscriber identifies speakers by their real names.
ConversationTranscriber API combines diarization with speech to text functionality to provide transcription outputs that contain a speaker entry for each transcribed speech. The transcription output is tagged as GUEST1, GUEST2, GUEST3, etc. based on the number of speakers in the audio conversation. The ConversationTranscriber API is similar to SpeechRecognizer API which enables easier transition between the two APIs. The ConversationTranscriber API utilizes Speech to Text endpoint. Therefore, the API supports the audio formats and features which are supported by the Speech to Text endpoint (e.g. custom phrase list, language id and word level timings, etc.)
The MeetingTranscriber API will identify different speakers with their real name, instead of GUEST1, GUEST2, GUEST3, etc. This API supports adding and removing participants to a meeting.
Real-time diarization will be released through Speech SDK, and therefore supports the Speech SDK supported languages (e.g. C#, .NET, Python, Java, JavaScript, etc.) It will be available to all regions and locales that Azure AI speech supports.
Use cases and scenarios
Real-time diarization is the feature requested by many customers, to help with an array of use cases. This feature will not only make it possible for users to further speech analytics and gain insights from transcriptions by identifying speakers but can also be used to help the accessibility of transcripts. We anticipate real-time diarization to be used in scenarios, such as:
- Live Conversation Transcription
When speakers are all in the same room using a single microphone setup, do live transcription about which speaker (e.g. Guest1, Guest2, or Guest3) talks about what transcription.
- Conversation Transcription on Prerecorded Live Events
Use prerecorded live events streamed acoustically to a microphone, and get the transcription about which speaker (e.g. Guest1, Guest2, or Guest3) talks about what transcription.
- Live Caption and Subtitle
Show live captions or subtitles of the meetings, videos, or audios.
Getting started
The public preview of real-time diarization will be available in Speech SDK version 1.31.0, which will be released in early August.
Follow the below steps to create a new console application and install the Speech SDK and try out the real-time diarization from file with ConversationTranscriber API. Additionally, we will release detailed documentation including the Quickstart Doc and sample code when the Public Preview is released.
- Open a command prompt where you want the new project and create a console application with the .NET CLI. The Program.cs file should be created in the project directory.
.NET CLICopy dotnet new console
- Install the Speech SDK in your new project with the .NET CLI.
.NET CLICopy dotnet add package Microsoft.CognitiveServices.Speech
- Replace the contents of Program.cs with the following code.
using Microsoft.CognitiveServices.Speech; using Microsoft.CognitiveServices.Speech.Audio; class Program { // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION" static string speechKey = Environment.GetEnvironmentVariable("SPEECH_KEY"); static string speechRegion = Environment.GetEnvironmentVariable("SPEECH_REGION"); async static Task Main(string[] args) { var filepath = "katiesteve.wav"; var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion); speechConfig.SpeechRecognitionLanguage = "en-US"; var stopRecognition = new TaskCompletionSource<int>(TaskCreationOptions.RunContinuationsAsynchronously); // Create an audio stream from a wav file or from the default microphone using (var audioConfig = AudioConfig.FromWavFileInput(filepath)) { // Create a conversation transcriber using audio stream input using (var conversationTranscriber = new ConversationTranscriber(speechConfig, audioConfig)) { conversationTranscriber.Transcribing += (s, e) => { Console.WriteLine($"TRANSCRIBING: Text={e.Result.Text}"); }; conversationTranscriber.Transcribed += (s, e) => { if (e.Result.Reason == ResultReason.TRANSCRIBEDSpeech) { Console.WriteLine($"TRANSCRIBED: Text={e.Result.Text} Speaker ID={e.Result.SpeakerId}"); } else if (e.Result.Reason == ResultReason.NoMatch) { Console.WriteLine($"NOMATCH: Speech could not be TRANSCRIBED."); } }; conversationTranscriber.Canceled += (s, e) => { Console.WriteLine($"CANCELED: Reason={e.Reason}"); if (e.Reason == CancellationReason.Error) { Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}"); Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}"); Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?"); stopRecognition.TrySetResult(0); } stopRecognition.TrySetResult(0); }; conversationTranscriber.SessionStopped += (s, e) => { Console.WriteLine("\n Session stopped event."); stopRecognition.TrySetResult(0); }; await conversationTranscriber.StartTranscribingAsync(); // Waits for completion. Use Task.WaitAny to keep the task rooted. Task.WaitAny(new[] { stopRecognition.Task }); await conversationTranscriber.StopTranscribingAsync(); } } } }
- Replace katiesteve.wav with the filepath and filename of your .wav file. The intent of this sample is to recognize speech from multiple participants in the conversation. Your audio file should contain multiple speakers. For example, you can use the sample audio file provided in the Speech SDK samples repository on GitHub.
- To change the speech recognition language, replace en-US with another supported language. For example, es-ES for Spanish (Spain). The default language is en-US if you don't specify a language. For details about how to identify one of multiple languages that might be spoken, see language identification.
Run your new console application to start speech recognition:
dotnet run
The transcribed conversation should be output as text something like below:
TRANSCRIBING: Text=good morning
TRANSCRIBING: Text=good morning steve
TRANSCRIBED: Text=Good morning, Steve. Speaker ID=GUEST-1
TRANSCRIBING: Text=good morning
TRANSCRIBING: Text=good morning katie
TRANSCRIBING: Text=good morning katie have you heard
TRANSCRIBING: Text=good morning katie have you heard about
TRANSCRIBING: Text=good morning katie have you heard about the new
TRANSCRIBING: Text=good morning katie have you heard about the new conversation
TRANSCRIBING: Text=good morning katie have you heard about the new conversation transcription
TRANSCRIBING: Text=good morning katie have you heard about the new conversation transcription capability
TRANSCRIBED: Text=Good morning. Katie, have you heard about the new conversation transcription capability? Speaker ID=GUEST-2
TRANSCRIBING: Text=no
TRANSCRIBING: Text=no tell me more
TRANSCRIBED: Text=No, tell me more. Speaker ID=GUEST-1
TRANSCRIBING: Text=it's the new
TRANSCRIBING: Text=it's the new feature
TRANSCRIBING: Text=it's the new feature that
TRANSCRIBING: Text=it's the new feature that transcribes our
TRANSCRIBING: Text=it's the new feature that transcribes our discussion
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said what
TRANSCRIBED: Text=It's the new feature that transcribes our discussion and lets us know who said what. Speaker ID=GUEST-2
TRANSCRIBING: Text=that
TRANSCRIBING: Text=that sounds interesting
TRANSCRIBING: Text=that sounds interesting i'm
TRANSCRIBING: Text=that sounds interesting i'm going to give this a try
TRANSCRIBED: Text=That sounds interesting. I'm going to give this a try. Speaker ID=GUEST-1
CANCELED: Reason=EndOfStream
Updated Jul 18, 2023
Version 2.0HeikoRa
Microsoft
Joined May 07, 2019
AI - Azure AI services Blog
Follow this blog board to get notified when there's new activity