We are pleased to announce the public preview of Real-time diarization in Azure AI Speech. This new feature offers real-time transcription while simultaneously identifying speakers, making it an invaluable tool for a variety of scenarios.
Real-time diarization enables conversations to be transcribed in real-time while simultaneously identifying speakers. Diarization refers to the ability to tell who spoke and when. It differentiates speakers in mono channel audio input based on their voice characteristics. This allows for the identification of speakers during conversations and can be useful in a variety of scenarios such as doctor-patient conversations, agent-customer interactions, and court proceedings.
What’s available in Public Preview
Developers can access the real-time diarization feature through Speech SDK, which includes two APIs: ConversationTranscriber and MeetingTranscriber. ConversationTranscriber differentiates speakers as GUEST1, GUEST2, etc., while MeetingTranscriber identifies speakers by their real names.
ConversationTranscriber API combines diarization with speech to text functionality to provide transcription outputs that contain a speaker entry for each transcribed speech. The transcription output is tagged as GUEST1, GUEST2, GUEST3, etc. based on the number of speakers in the audio conversation. The ConversationTranscriber API is similar to SpeechRecognizer API which enables easier transition between the two APIs. The ConversationTranscriber API utilizes Speech to Text endpoint. Therefore, the API supports the audio formats and features which are supported by the Speech to Text endpoint (e.g. custom phrase list, language id and word level timings, etc.)
The MeetingTranscriber API will identify different speakers with their real name, instead of GUEST1, GUEST2, GUEST3, etc. This API supports adding and removing participants to a meeting.
Real-time diarization will be released through Speech SDK, and therefore supports the Speech SDK supported languages (e.g. C#, .NET, Python, Java, JavaScript, etc.) It will be available to all regions and locales that Azure AI speech supports.
Use cases and scenarios
Real-time diarization is the feature requested by many customers, to help with an array of use cases. This feature will not only make it possible for users to further speech analytics and gain insights from transcriptions by identifying speakers but can also be used to help the accessibility of transcripts. We anticipate real-time diarization to be used in scenarios, such as:
When speakers are all in the same room using a single microphone setup, do live transcription about which speaker (e.g. Guest1, Guest2, or Guest3) talks about what transcription.
Use prerecorded live events streamed acoustically to a microphone, and get the transcription about which speaker (e.g. Guest1, Guest2, or Guest3) talks about what transcription.
Show live captions or subtitles of the meetings, videos, or audios.
Getting started
The public preview of real-time diarization will be available in Speech SDK version 1.31.0, which will be released in early August.
Follow the below steps to create a new console application and install the Speech SDK and try out the real-time diarization from file with ConversationTranscriber API. Additionally, we will release detailed documentation including the Quickstart Doc and sample code when the Public Preview is released.
.NET CLICopy
dotnet new console
.NET CLICopy
dotnet add package Microsoft.CognitiveServices.Speech
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
class Program
{
// This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
static string speechKey = Environment.GetEnvironmentVariable("SPEECH_KEY");
static string speechRegion = Environment.GetEnvironmentVariable("SPEECH_REGION");
async static Task Main(string[] args)
{
var filepath = "katiesteve.wav";
var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);
speechConfig.SpeechRecognitionLanguage = "en-US";
var stopRecognition = new TaskCompletionSource<int>(TaskCreationOptions.RunContinuationsAsynchronously);
// Create an audio stream from a wav file or from the default microphone
using (var audioConfig = AudioConfig.FromWavFileInput(filepath))
{
// Create a conversation transcriber using audio stream input
using (var conversationTranscriber = new ConversationTranscriber(speechConfig, audioConfig))
{
conversationTranscriber.Transcribing += (s, e) =>
{
Console.WriteLine($"TRANSCRIBING: Text={e.Result.Text}");
};
conversationTranscriber.Transcribed += (s, e) =>
{
if (e.Result.Reason == ResultReason.TRANSCRIBEDSpeech)
{
Console.WriteLine($"TRANSCRIBED: Text={e.Result.Text} Speaker ID={e.Result.SpeakerId}");
}
else if (e.Result.Reason == ResultReason.NoMatch)
{
Console.WriteLine($"NOMATCH: Speech could not be TRANSCRIBED.");
}
};
conversationTranscriber.Canceled += (s, e) =>
{
Console.WriteLine($"CANCELED: Reason={e.Reason}");
if (e.Reason == CancellationReason.Error)
{
Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
stopRecognition.TrySetResult(0);
}
stopRecognition.TrySetResult(0);
};
conversationTranscriber.SessionStopped += (s, e) =>
{
Console.WriteLine("\n Session stopped event.");
stopRecognition.TrySetResult(0);
};
await conversationTranscriber.StartTranscribingAsync();
// Waits for completion. Use Task.WaitAny to keep the task rooted.
Task.WaitAny(new[] { stopRecognition.Task });
await conversationTranscriber.StopTranscribingAsync();
}
}
}
}
Run your new console application to start speech recognition:
dotnet run
The transcribed conversation should be output as text something like below:
TRANSCRIBING: Text=good morning
TRANSCRIBING: Text=good morning steve
TRANSCRIBED: Text=Good morning, Steve. Speaker ID=GUEST-1
TRANSCRIBING: Text=good morning
TRANSCRIBING: Text=good morning katie
TRANSCRIBING: Text=good morning katie have you heard
TRANSCRIBING: Text=good morning katie have you heard about
TRANSCRIBING: Text=good morning katie have you heard about the new
TRANSCRIBING: Text=good morning katie have you heard about the new conversation
TRANSCRIBING: Text=good morning katie have you heard about the new conversation transcription
TRANSCRIBING: Text=good morning katie have you heard about the new conversation transcription capability
TRANSCRIBED: Text=Good morning. Katie, have you heard about the new conversation transcription capability? Speaker ID=GUEST-2
TRANSCRIBING: Text=no
TRANSCRIBING: Text=no tell me more
TRANSCRIBED: Text=No, tell me more. Speaker ID=GUEST-1
TRANSCRIBING: Text=it's the new
TRANSCRIBING: Text=it's the new feature
TRANSCRIBING: Text=it's the new feature that
TRANSCRIBING: Text=it's the new feature that transcribes our
TRANSCRIBING: Text=it's the new feature that transcribes our discussion
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said what
TRANSCRIBED: Text=It's the new feature that transcribes our discussion and lets us know who said what. Speaker ID=GUEST-2
TRANSCRIBING: Text=that
TRANSCRIBING: Text=that sounds interesting
TRANSCRIBING: Text=that sounds interesting i'm
TRANSCRIBING: Text=that sounds interesting i'm going to give this a try
TRANSCRIBED: Text=That sounds interesting. I'm going to give this a try. Speaker ID=GUEST-1
CANCELED: Reason=EndOfStream
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.