AMA: GPT-4o Audio model revolutionizes your Copilot and other AI applications

Event details

Unlock the potential of your applications with the latest GPT-4o-realtime API with Audio, now available on Azure on October 1st, 2024. Join us to explore how this model, integrated as part of the new...

EricStarker

Updated Dec 27, 2024

ryansusman

Copper Contributor

Oct 09, 2024

Thank you for scheduling this session. We have been experimenting with some of the sample code provided by Microsoft, and it appears to be functioning well. However, we have observed instances where the model generates music-like sounds, although it is not actual music but has a tune. Additionally, there are occasions when the model changes its voice. Could you provide guidance on how we should approach grounding the outputs?

Travis_Wilson_MSFT
Microsoft
Oct 09, 2024
Oh, I know exactly what you mean; the model can get pretty "creative" sometimes. It was even more entertaining a few weeks ago; one of its favorite pastimes was to start -- no joke -- giggling in the middle of a response. Much of this is getting rapidly improved within the model itself and is driven by continual new deployments. From a consumption perspective, you can use system messages ("instructions" inside of "session.update" with the /realtime API) and few-shot examples (conversation items with example input/output) to help prime the model for better output, just like you would with e.g. chat completions. This applies to even mundane things like retaining the same tone or voice -- responses should (and will) do a better job of not "getting distracted" all on their own, but gentle reminders surprisingly do assist, too.