Microsoft Foundry Blog

5 MIN READ

Real-time Speech Transcription with GPT-4o-transcribe and GPT-4o-mini-transcribe using WebSocket

Microsoft

May 02, 2025

Azure OpenAI has expanded its speech recognition capabilities with two powerful models: GPT-4o-transcribe and GPT-4o-mini-transcribe. These models also leverage WebSocket connections to enable real-time transcription of audio streams, providing developers with cutting-edge tools for speech-to-text applications. In this technical blog, we'll explore how these models work and demonstrate a practical implementation using Python.

Understanding OpenAI's Realtime Transcription API

Unlike the regular REST API for audio transcription, Azure OpenAI's Realtime API enables continuous streaming of audio data through WebSockets or WebRTC connections. This approach is particularly valuable for applications requiring immediate transcription feedback, such as live captioning, meeting transcription, or voice assistants.

The key difference between the standard transcription API and the Realtime API is that transcription sessions typically don't contain responses from the model, but rather focus exclusively on converting speech to text in real-time.

GPT-4o-transcribe and GPT-4o-mini-transcribe: Feature Overview

Azure OpenAI has introduced two specialized transcription models:

GPT-4o-transcribe: The full-featured transcription model with high accuracy
GPT-4o-mini-transcribe: A lighter, faster model with slightly reduced accuracy but lower latency

Both models connect through WebSockets, enabling developers to stream audio directly from microphones or other sources for immediate transcription. These models are designed specifically for the Realtime API infrastructure.

Setting Up the Environment

First, we need to set up our Python environment with the necessary libraries:

import os
import json
import base64
import threading
import pyaudio
import websocket
from dotenv import load_dotenv

load_dotenv('azure.env')  # Load environment variables from .env

OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_STT_TTS_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("❌ OPENAI_API_KEY is missing!")

# WebSocket endpoint for OpenAI Realtime API (transcription model)
url = f"{os.environ.get('AZURE_OPENAI_STT_TTS_ENDPOINT').replace('https', 'wss')}/openai/realtime?api-version=2025-04-01-preview&intent=transcription"
headers = { "api-key": OPENAI_API_KEY}
# Audio stream parameters (16-bit PCM, 16kHz mono)
RATE = 24000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

audio_interface = pyaudio.PyAudio()
stream = audio_interface.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=RATE,
                              input=True,
                              frames_per_buffer=CHUNK)

Establishing the WebSocket Connection

The following code establishes a connection to OpenAI's Realtime API and configures the transcription session:

def on_open(ws):
    print("Connected! Start speaking...")
    session_config = {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "gpt-4o-mini-transcribe",
                "prompt": "Respond in English."
            },
            "turn_detection": {"type": "server_vad", "threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 200}
        }
    }
    ws.send(json.dumps(session_config))

    def stream_microphone():
        try:
            while ws.keep_running:
                audio_data = stream.read(CHUNK, exception_on_overflow=False)
                audio_base64 = base64.b64encode(audio_data).decode('utf-8')
                ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": audio_base64
                }))
        except Exception as e:
            print("Audio streaming error:", e)
            ws.close()

    threading.Thread(target=stream_microphone, daemon=True).start()

Processing Transcription Results

This section handles the incoming WebSocket messages containing the transcription results:

def on_message(ws, message):
    try:
        data = json.loads(message)
        event_type = data.get("type", "")
        print("Event type:", event_type)
        #print(data)   
        # Stream live incremental transcripts
        if event_type == "conversation.item.input_audio_transcription.delta":
            transcript_piece = data.get("delta", "")
            if transcript_piece:
                print(transcript_piece, end=' ', flush=True)
        if event_type == "conversation.item.input_audio_transcription.completed":
            print(data["transcript"])
        if event_type == "item":
            transcript = data.get("item", "")
            if transcript:
                print("\nFinal transcript:", transcript)

    except Exception:
        pass  # Ignore unrelated events

Error Handling and Cleanup

To ensure proper resource management, we implement handlers for errors and connection closing:

def on_error(ws, error):
    print("WebSocket error:", error)

def on_close(ws, close_status_code, close_msg):
    print("Disconnected from server.")
    stream.stop_stream()
    stream.close()
    audio_interface.terminate()

Running the WebSocket Client

Finally, this code initiates the WebSocket connection and starts the transcription process:

print("Connecting to OpenAI Realtime API...")
ws_app = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws_app.run_forever()

Analyzing the Implementation Details

Session Configuration

Let's break down the key components of the session configuration:

input_audio_format: Specifies "pcm16" for 16-bit PCM audio
input_audio_transcription:
- model: Specifies "gpt-4o-mini-transcribe" (could be replaced with "gpt-4o-transcribe" for higher accuracy)
- prompt: Provides instructions to the model ("Respond in English")
- language: specify the language like "hi" else you can set it null to default to all language.
input_audio_noise_reduction: Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
turn_detection: Configures "server_vad" (Voice Activity Detection) to automatically detect speech turns. Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

Audio Streaming

The implementation uses a threaded approach to continuously stream audio data from the microphone to the WebSocket connection. Each chunk of audio is:

Read from the microphone
Encoded to base64
Sent as a JSON message with the "input_audio_buffer.append" event type

Transcription Events

The system processes several types of events from the WebSocket connection:

conversation.item.input_audio_transcription.delta: Incremental updates to the transcription
conversation.item.input_audio_transcription.completed: Complete transcripts for a segment
item: Final transcription results

Customization Options

The example code can be customized in several ways:

Switch between models (gpt-4o-transcribe or gpt-4o-mini-transcribe)
Adjust audio parameters (sample rate, channels, chunk size)
Modify the prompt to provide context or language preferences
Configure noise reduction for different environments
Adjust turn detection for different speaking patterns

Deployment Considerations

When deploying this solution in production, consider:

Authentication: Securely store and retrieve API keys
Error handling: Implement robust reconnection logic
Performance: Optimize audio parameters for your use case
Rate limits: Be aware of Azure OpenAI's rate limits for the Realtime API
Fallback strategies: Implement fallbacks for connection drops

Conclusion

GPT-4o-transcribe and GPT-4o-mini-transcribe represent significant advances in real-time speech recognition technology. By leveraging WebSockets for continuous audio streaming, these models enable developers to build responsive speech-to-text applications with minimal latency.

The implementation showcased in this blog demonstrates how to quickly set up a real-time transcription system using Python. This foundation can be extended for various applications, from live captioning and meeting transcription to voice-controlled interfaces and accessibility tools.

As these models continue to evolve, we can expect even better accuracy and performance, opening up new possibilities for speech recognition applications across industries.

Remember that when implementing these APIs in production environments, you should follow Azure OpenAI's best practices for API security, including proper authentication and keeping your API keys secure.

Here is the link to end to end code.

Thanks

Manoranjan Rajguru

https://www.linkedin.com/in/manoranjan-rajguru/

Updated Jun 04, 2025

Version 2.0

azure ai services

mrajguru

Microsoft

Joined October 13, 2023

View Profile

Microsoft Foundry Blog