The Future of AI: Computer Use Agents Have Arrived

Microsoft

Apr 07, 2025

The Future of AI blog series is an evolving collection of posts from the AI Futures team in collaboration with subject matter experts across Microsoft. In this series, we explore tools and technologies that will drive the next generation of AI. Explore more at: https://aka.ms/the-future-of-ai

The Future of AI: Computer Use Agents Have Arrived

On March 11 we announced the availability of Azure OpenAI Service's Responses API, which includes a new type of agent: a Computer Use Agent, or CUA. CUAs can literally use a computer - launching apps, navigating websites, and reasoning their way through tasks. You have to see it to believe it. In today's blog we'll explore some common questions about Computer Use Agents.

[ https://www.youtube.com/watch?v=3jXgyvCFkk0 ]

The Responses API is not the only CUA out there. There are now a few of them on the market. Two others I've used heavily are open source: browser-use and UI-focused (aka UFO) agents .

How Do CUAs Work?

All CUAs use the vision capabilities of multimodal models to interpret what's happening on the screen, and they combine that with an AI agent framework that can plan tasks and reason out what to do next. Some CUAs, like the Responses API, can control any type of computer or virtual machine and rely solely on computer vision to understand the screen. Others, like browser-use and UFO agents, take other cues from the systems they're controlling. These CUAs that use "hints" from the system they're controlling can be more accurate, but they tend to be constrained as to the types of systems they support.

UFOs can control a Windows computer or virtual machine, because it uses the Windows API to help it understand what's going on. Browser-use agents can only control a browser, not a whole computer or VM - but it uses the structure of the web page, called the DOM, in addition to computer vision to help it determine where it can click. You can see that in action in the below video - when it controls a page, it renders boxes around the areas of the page that are clickable.

[https://youtu.be/lRv31JF4emY?si=FmT8btQ7TuYYVqIa]

How to use CUAs

Today, CUAs are in their infancy, and you use them by downloading and implementing one of the above-mentioned tools, or by using prebuilt capabilities like Operator in OpenAI’s ChatGPT. I anticipate they will soon be everywhere - in every OS, every browser, every phone OS. In the future, you may use these agents to order from eCommerce sites, arrange travel, and make restaurant reservations, and soon, you won't remember how you ever lived without them.

Do CUAs Replace Robotic Process Automation (RPA)?

For many years, people have used Robotic Process Automation systems like UiPath and Microsoft's own Power Automate Desktop to automate applications. These tools can control computers in similar ways to CUAs, clicking through apps and websites, but they lack a reasoning capability - so if they reach a screen that looks different from what they were programmed on, they often fail.

But CUAs do not mean the end of RPA. In fact, they will likely be complementary tools. As a general rule, one should only use agents to perform tasks that require reasoning. If you're building an automation that is deterministic, where the tasks are predefined and the screens are not expected to change dramatically, RPA is the right choice, because agents like CUAs can make mistakes while reasoning out tasks, whereas traditional RPA executes rote steps one by one, never deviating from their instructions.

Many RPA vendors, including UiPath, are already building CUA capabilities into their automation systems. This means you can use them to design processes that are semi-deterministic, where a large portion of the process can be accomplished just following programmed instructions, but where some of the process requires reasoning. This will tend to deliver the best of both worlds.

Isn't it Inefficient for Agents to Click on User Interfaces Intended for Humans?

Yes, sometimes it is inefficient. When it's possible to use an agent framework like Semantic Kernel and connect it directly to a service via an API, you should do that - it's far more efficient and robust. CUAs often struggle with things like date pickers, for example, whereas agents can usually call APIs with date parameters quite competently. But some sites and apps just do not have publicly accessible APIs, and in those cases, a CUA is a good choice.

A middle ground is emerging with the /llms.txt proposal. Llms.txt is a text file that contains information that an agent can use to interpret a site or app without having to visually parse it. Over time, we expect this - or something like it - to emerge as a standard so that a site or app can be accessible both to humans and to agents. Nothing is set in stone yet, though.