Blog Post

AI - Azure AI services Blog
3 MIN READ

Introducing the GPT-4o-Audio-Preview: A New Era of Audio-Enhanced AI Interaction

Allan_Carranza's avatar
Jan 22, 2025

We are thrilled to announce the release of audio support accessible via Chat Completions API featuring the new GPT-4o-Audio preview Model, now available in preview. Building on to our recent launch of GPT-4o-Realtime-Preview, this groundbreaking addition to the GPT-4o family introduces support for audio prompts and the ability to generate spoken audio responses. This expansion enhances the potential for AI applications in text and voice-based interactions and audio analysis. Starting today, developers can unlock immersive, voice-driven experiences by harnessing the advanced capabilities of GPT-4o-Audio-Preview, now in public preview.

Key Benefits of GPT-4o-Audio-Preview


Chat Completions API with GPT-4o-Audio Preview model is designed to transform the way users interact with AI by incorporating natural audio elements, adding depth to applications that require nuanced understanding and response generation.

  • Engaging Spoken Summaries: GPT-4o-Audio-Preview can generate spoken summaries from text content, offering a dynamic, engaging way to present information. This feature is ideal for applications that benefit from audio-based delivery, such as digital assistants, interactive training modules, and accessibility solutions.
  • Sentiment Analysis from Audio: With the ability to detect sentiment in audio recordings, this model can analyze vocal nuances and translate them into meaningful, text-based insights. This is particularly valuable for customer service and support applications, where understanding tone and mood can enhance user satisfaction and personalize responses.
  • Asynchronous Speech-In, Speech-Out Interactions: GPT-4o-Audio-Preview enables seamless asynchronous voice interactions, supporting applications where users can submit spoken queries or commands and receive spoken responses at a later time. This capability enhances user convenience and opens up possibilities for hands-free, voice-enabled applications in diverse environments.

 

Exploring Real-World Application of GPt-4o-Audio-Preview

1. Create Immersive Stories from Existing Text

With the GPT-4o-Audio-Preview model, businesses can revolutionize content delivery by converting text articles into engaging spoken summaries. This feature caters to users who prefer listening over reading, creating a more immersive storytelling experience. For example, news websites can offer audio summaries of their articles, allowing users to stay informed while driving, exercising, or multitasking.

2. Improve Customer Support via Audio Analysis

Understanding customer sentiment is crucial for enhancing service quality and user satisfaction. GPT-4o-Audio-Preview can analyze recorded customer conversations to detect sentiment and emotional nuances. This capability helps businesses identify areas of improvement, personalize responses, and develop more effective customer support strategies. For instance, a call center can use this technology to assess the mood of customers during interactions and adjust their approach accordingly.

3. Enhance Interactive Education and Training Modules

Educational institutions and corporations can leverage GPT-4o-Audio-Preview to create interactive and dynamic training modules. This model can generate spoken explanations, quizzes, and feedback, making learning more engaging and accessible. For example, an online course platform can offer audio-based lessons and assessments that cater to auditory learners, enhancing the overall educational experience.

Comparing Realtime API to Chat Completions API


The GPT 4o models associated with Realtime API and Chat Completions API both support audio and speech capabilities, each offering unique functionalities for AI-driven user experiences. However, they serve distinct purposes:

  • Realtime API with model GPT-4o-Realtime-Preview: Optimized for real-time, low-latency conversations, focusing on enabling natural back-and-forth interactions with minimal delay, ideal for chatbots and conversational AI systems.
  • Chat Completions API with model GPT-4o-Audio-Preview: Tailored for processing and generating audio content, supporting advanced features like speech recognition and audio synthesis, making it ideal for asynchronous speech-in, speech-out interactions and audio sentiment analysis.

 Ready to get started?

Updated Jan 23, 2025
Version 3.0
  • Marcus_aiaisir's avatar
    Marcus_aiaisir
    Copper Contributor

    Is there any information or forecasts as to when the GPT-4o Realtime Preview model will be available in Germany?

  • alexoli97's avatar
    alexoli97
    Copper Contributor

    When will it be available in Standard instead of Global Standard? 

  • Good one for the gpt 40 model to have different flavours. Considering there are still few things that are very brittle for the realtime model to not be stable in protection - Voice Activity detection (being silent at times), audio transcription being very sensitive, gathering additional conversational metrics using transcripts (which has tons of false positives with transcription not being right) - This audio model can be an additional async call to gather sentiment, analyze transcript. If there are any blogs from Microsoft connecting these two would be very helpful!