Announcing general availability of real-time diarization
Published May 21 2024 06:49 AM 1,093 Views
Microsoft

We are excited to announce Generally Available of real-time diarization which is an enhanced add-on feature of Azure Speech service. With this feature, you can get live (in real time) speech to text transcription by speakers (Guest1, Guest2, Guest3, etc.), so that you know which speaker was speaking a particular part transcribed speech conversation transcription.

 

What’s Real-time Diarization

The diarization is a feature that differentiates speakers in an audio. Real-time diarization is capable of distinguishing speakers' voices through single channel audio in streaming mode. Diarization combined with speech to text functionality can provide transcription outputs that contain a speaker entry for each transcribed segment. The transcription output is tagged as GUEST1, GUEST2, GUEST3, etc. based on the number of speakers in the audio conversation. Below graph demonstrates the difference between the transcription results with and without diarization.

Yan_LiMS2008_0-1716273757578.png

 

Use Cases and Scenarios

Real-time diarization can be used in a wide range of scenarios. Below lists some typical use cases. It can also be used to help with accessibility scenarios.

  • Live Conversation/Meeting Transcription

When speakers are all in the same room with a single microphone setup, do live transcription about which speaker (e.g. Guest-1, Guest-2, or Guest-3) talks about what transcription. Combined with GPT based on the diarized transcription, you can also do meeting/conversation summary, recap, or ask questions about the conversation/meeting, etc.

Microsoft Teams, for instance, is leveraging the diarization featrue to show live meeting transcription in Teams. Based on the meeting transcription, Microsoft Teams’ Copilot provides a meeting summary, recap, and many other cool features for people to interact Teams’ Copilot about the meetings.

  • Real-time Agent Asist

Use Speech Analytics (which is another new feature that Azure Speech Service provides at Build) with real-time diarization, you can do the live transcription analytics to help on the Agent Asist scenarios to optimally address the customers questions and concerns.

  • Live Caption and Subtitle (Translated Caption)

Show live captions or subtitles (translated captions) of meetings, videos, or audios.

 

What’s Improved Since Public Preview

After the public preview, we put in a lot of effort to improve the diarization quality. This is the major feedback we heard from Preview users regarding the quality of real-time diarization. We released a new diarization model and improved diarization quality by ~3% on WDER. In addition, we removed the limitation of 7 seconds of continuous audio data from a single speaker. In the Preview version, when a speaker first talks, the diarization would start to perform with better quality after the 7 seconds of continuous audio of the speaker. Now in GA version, we don’t have this limitation anymore.

 

Early Adopters from Diverse Area

So far, we have over a thousand customers from diverse industries trying out real-time diarization on a variety of scenarios. Below are some examples.

  • Medical

Live transcription between doctor and patient, and transcription analytics

  • Banking

Live meeting transcription

  • Telecommunication

Conversation transcription, summarization, transcription analytics

  • Legal

App to assist trial and appellate attorney who are preparing for oral arguments (e.g. capture the attorneys’ and judges’ positions during mock oral arguments, etc.)

 

Try it Out

To try out the real-time diarization, you can go to Speech Studio (Speech Studio - Real-time speech to text (microsoft.com)) and do the following steps (shown in the below screenshot) to experience the feature,

  1. Click on “Show advanced options.
  2. Use the “Speaker diarization” toggle to turn on or off the real-time diarization.

Yan_LiMS2008_1-1716273757593.png

Real-time diarization is available to all the regions that Azure Speech Service supports. It is released through Speech SDK (version 1.31.0 or higher). The feature is available in the following SDKs.

  • C#,
  • C++
  • Java
  • JavaScript
  • Python

Please feel free to follow the Quickstart: Real-time diarization to start experiencing the feature.

 

 

 

 

 

 

Co-Authors
Version history
Last update:
‎May 21 2024 04:57 AM
Updated by: