Whisper is an advanced automatic speech recognition (ASR) system, developed using 680,000 hours of supervised multilingual and multitask data from the web. This extensive and diverse data set enhances its ability to handle various accents, background noise, and technical jargon. Whisper not only transcribes multiple languages but also translates them into English. We are making the models and inference code open source to provide a robust foundation for developing practical applications and advancing research in speech processing. The Whisper model, developed by OpenAI, converts speech to text and is ideal for transcribing audio files. Trained on an extensive dataset of English audio and text, it excels at transcribing English speech but can also handle other languages, producing English text as output. We have Whisper models accessible through the Azure Open AI service. The Whisper model by Azure OpenAI provides varied solutions for different scenarios. It excels in transcribing and analyzing prerecorded audio and video files. It is also ideal for the quick processing of individual audio files. It can transcribe phone call recordings and provide analytics such as call summary, sentiment, key topics, and custom insights. Similarly, it can transcribe meeting recordings and provide analytics like meeting summary, meeting chapters, and action item extraction. The Whisper model also supports contact center voice agent services like call routing and interactive voice response and is suitable for application-specific voice assistants in various scenarios such as set-top boxes, mobile apps, in-car systems, and more. However, it does not support real-time transcription, pronunciation assessment, or translation of live or prerecorded audio. It is recommended for translating prerecorded audio from other languages into English.
Developers using Whisper in Azure AI Speech benefit from additional capabilities such as processing of large file sizes up to 1GB, speaker diarization, and the ability to fine-tune the Whisper model using audio plus human-labeled transcripts.
For accessing Whisper, developers can use the Azure OpenAI Studio. The Whisper REST API supports translation services from a growing list of languages to English. The Whisper model is a significant addition to Azure AI's broad portfolio of capabilities, offering innovative ways to improve business productivity and user experience.
Here is a code snippet on how to use Azure Open AI Whisper API in python.
import openai
import time
import os
import os
import urllib
from IPython.display import Audio
from pathlib import Path
from pydub import AudioSegment
import ssl
openai.api_type = "azure"
openai.api_version = "2023-09-01-preview"
model_name = "whisper"  
deployment_id = "whisper"
audio_language="en"
audio_test_file = "./wikipediaOcelot.wav"
#Azure OpenAI CONFIGURATION
from openai import AzureOpenAI
client = AzureOpenAI(
    api_key="yourkey",  
    api_version="2023-12-01-preview",
    azure_endpoint = "https://instance.openai.azure.com/"
    )
def transcribe_audio(file):
    transcript = openai.audio.transcriptions.create(
        file=open(audio_test_file, "rb"),
        model="whisper",
)
    return transcript.text
print(transcribe_audio(audio_test_file))
Best Practices for using Whisper API in Azure.
Whisper API does offer a variety of parameters that can be utilized for more specific transcriptions. The prompt parameter in the OpenAI Whisper API allows you to guide the transcription process by providing specific instructions or conditions. For example, you could use the prompt parameter to instruct the API to ignore or exclude certain words or phrases from the transcription. This can be particularly useful when you want to filter out specific content or when handling sensitive information. By using the prompt parameter, you're able to customize the transcription output to better suit your specific needs or requirements
def transcribe_audio(file):
    transcript = openai.audio.transcriptions.create(
        file=open(audio_test_file, "rb"),
        model="whisper",
        probability=0.5
        prompt="your prompt text",
        response_format="verbose_json"
        response_format="text"      
)
Preprocessing
Preprocessing in the context of audio transcription involves preparing the audio data to improve the quality and accuracy of the transcription. It's a crucial step that can significantly impact the results. Here are the main steps involved in audio preprocessing:
- Trimming: This involves removing unnecessary parts of the audio, such as silences at the beginning or end of the audio file. Trimming can help reduce the size of the audio file and also eliminate sections that might cause inaccuracies in the transcription.
- Segmentation: For long audio files, it can be beneficial to break them down into smaller, manageable segments. This can make the transcription process more efficient and also improve accuracy as it's easier to manage and process shorter audio clips.
- Audio Quality Enhancement: This may involve tasks like noise reduction, volume normalization, and echo cancellation. Improving the audio quality can significantly enhance the accuracy of the transcription.
- Audio Format Conversion: The audio files need to be in a format that is compatible with the transcription service. If they are not, they must be converted into a compatible format.
 These preprocessing steps are primarily aimed at reducing potential sources of error in the transcription and making the audio data more manageable for the transcription service.
You can use PyDub is a simple and easy-to-use Python library for audio processing tasks such as slicing, concatenating, and exporting audio files.
Post Processing
In the context of audio transcription, the output from the initial transcription process can be further refined using Language Models like GPT-3.5. This step is known as post-processing.
In post-processing, the initial transcript, which could potentially contain errors or inconsistencies, is passed to the language model. The language model, guided by its training and potentially a system prompt, generates a corrected or refined version of the transcript.
This process allows for the correction of errors, better context understanding, and even the rephrasing or summarization of the content, depending on the specific system prompt provided. It is an effective way to leverage the capabilities of language models to improve the quality and usefulness of audio transcriptions.
def generate_corrected_transcript(temperature, system_prompt, audio_file):
    response = client.chat.completions.create(
        model="gpt4",
        temperature=temperature,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": transcribe_audio(audio_file)
            }
        ]
    )
   
    return response.choices[0].message.content
You can learn more about Azure Whisper Open AI models here.
Speech to text with Azure OpenAI Service - Azure OpenAI | Microsoft Learn