AMA: GPT-4o Audio model revolutionizes your Copilot and other AI applications

Travis_Wilson_MSFT
Microsoft
Oct 09, 2024
Allan and I are part of the AI Platform team working on the Azure OpenAI Service capabilities (Copilot is a same-company internal customer of the same capabilities now available to everyone) and thus we're not the best people to comment on Copilot app specifics. I will beg a bit of patience with any rough edges around anything related to the technology, though -- we just simultaneously released this gpt-4o-realtime-preview feature set (the beta /realtime API endpoint) with OpenAI last week and I can vouch for things continuing to change *very* quickly; I'm still just astounded that so many cool experiences were made possible so quickly with underpinnings changing so rapidly! As far as a "transformation" goes: flashy wording aside (hey, it got some attention), there really *is* some amazing potential that this kind of voice-in, voice-out interaction paradigm opens up. When voice assistants first became popularized, many people were understandably disappointed with how "on rails" and ultimately limited some of the capabilities necessarily ended up being, given the constraints in the technology: handling truly natural speech (including interruptions, so-called disfluencies like "ums" and "ahs", speaker variations, etc.) was hard, interactions still felt very "walkie-talkie-like" in how transactional and turn-based things were, you still felt like you were choosing from a short menu of things the assistant was good at, and so on. This new /realtime capability set built around gpt-4o-realtime-audio breaks through a lot of those barriers--I've had several people I've demoed to remark that they didn't believe it wasn't actually a pre-recorded person answering the questions and/or that it wasn't a person replying to them, even when they were trying it live, given how natural the experience felt. Aside from white-lie flattery, nobody ever *really* said that about voice assistants before. Now, that isn't to say that everything's absolutely perfect yet -- this is a beta/preview feature area, after all! -- but even trying it out on the playground or demo apps (or seeing it in action inside of Copilot, OpenAI's Advanced Voice Mode, etc.) really gives a sense for it isn't an unreasonable exaggeration to call this all "transformative."

Event details