In July we shared with this audience that OpenAI Whisper would be coming soon to Azure AI services, and today – we are very happy to announce – is the day! Customers of Azure OpenAI service and Azure AI Speech can now use Whisper.
The OpenAI Whisper model is an encoder-decoder Transformer that can transcribe audio into text in 57 languages. Additionally, it offers translation services from those languages to English, producing English-only output. Furthermore, it creates transcripts with enhanced readability.
OpenAI Whisper model in Azure OpenAI service
Azure OpenAI Service enables developers to run OpenAI’s Whisper model in Azure, mirroring the OpenAI Whisper API in features and functionality, including transcription and translation capabilities.
The Whisper model's REST APIs for transcription and translation are available from the Azure OpenAI Service portal.
See details on how to use the Whisper model with the Azure OpenAI Service here: Speech to text with Azure OpenAI Service - Azure OpenAI | Microsoft Learn
OpenAI Whisper model in Azure AI Speech
Users of Azure AI Speech can leverage OpenAI’s Whisper model in conjunction with the Azure AI Speech batch transcription API. This enables customers to easily transcribe large volumes of audio content at scale. This capability is particularly useful for processing extensive collections of audio data stored within the Azure platform.
Users of Whisper in Azure AI Speech benefit from existing features including async processing, speaker diarization, customization (available soon), and larger file sizes.
- Large file sizes: Azure AI Speech enhances Whisper transcription by enabling files up to 1GB in size and the ability to process large amounts of files by allowing you to batch up to 1000 files in a single request.
- Time stamps: Using Azure AI Speech, the recognition result includes word-level timestamps, providing the ability to identify where in the audio each word is spoken.
- Speaker diarization: This is another beneficial feature of Azure AI Speech that identifies individual speakers in an audio file and labels their speech segments. This feature allows customers to distinguish between speakers, accurately transcribe their words, and create a more organized and structured transcription of audio files.
- Customization/Finetuning (available soon): The Custom Speech capability in Azure Speech allows customers to finetune Whisper on their own data to improve recognition accuracy and consistency.
See details on how to use the Whisper model with Azure AI Speech here: Create a batch transcription - Speech service - Azure AI services | Microsoft Learn
Getting started
Azure AI Studio
Users can use the Whisper model in Azure OpenAI through Azure AI Studio.
- To gain access to Azure OpenAI Service, users need to apply for access.
- Once approved, visit the Azure portal and create an Azure OpenAI Service resource.
- After creating the resource, users can begin using Whisper.
Azure AI Speech Studio
Users can experiment with the Whisper model in Azure AI Speech Studio.
In Speech Studio you can find both the Whisper Model in Azure OpenAI Service try-out as well as the Batch speech to text try-out that now includes the Whisper model.
Speech to text try-outs and tools in Speech Studio:
The Batch speech to text try-out allows you to compare the output of the Whisper model side by side with an Azure Speech model as a quick initial evaluation of which model may work better for your specific scenario.
Comparing Whisper to Azure Speech model in the Batch speech to text try-out:
Conclusion
The Whisper model is a great addition to the broad portfolio of capabilities the Azure AI Speech Service offers. We are looking forward to seeing the innovative ways in which developers will take advantage of this new offering to improve business productivity and to delight users.