The Future of AI: Autonomous Agents for Identifying the Root Cause of Cloud Service Incidents

Microsoft

May 12, 2025

The Future of AI blog series is an evolving collection of posts from the AI Futures team in collaboration with subject matter experts across Microsoft. In this series, we explore tools and technologies that will drive the next generation of AI. Explore more at: https://aka.ms/the-future-of-ai

Introduction: Incident Handling at Microsoft

Microsoft is committed to maximal service uptime for all its services, including Azure, Xbox, and Bing. This requires extensive monitoring to detect and resolve issues before they affect customers. We handle over a hundred thousand service alerts monthly, most of which are early warnings managed internally. These incidents can be caused by configuration bugs, infrastructure issues, or dependency failures, needing specific mitigation steps. Upon the creation of new incidents, on-call engineers must act quickly to prevent these incidents from escalating into affecting customers and/or causing outages.

For diagnosis and mitigation of the incidents, engineers primarily rely on manually-authored troubleshooting guides (TSGs). In theory, these guides serve as quick and effective tools for documenting how to triage an incident (understand what is going on) and then how to handle it. However, in practice, because they are manually crafted, these guides are often outdated, incomplete, hard to understand without sufficient background in the problem, and/or difficult to locate, as they tend to be scattered across various platforms. This results in unnecessary toil and increases the amount of time from incident creation until it is mitigated.

Lately, with the rise of generative AI apps and agents, we have started to develop new solutions to help improve the quality of these TSGs and help on-call engineers find and execute the required steps faster and more efficiently. Theoretically, the entire flow from improving the existing TSGs to help DRIs take more decisive and quick action, all the way to having AI agents executing actions (such as running queries against logs and analyzing the outputs) can be automated with agents!

Our solution involves the following:

Use AI to assess and improve existing TSGs
Create enhanced TSGs and executable workflows that AI agents can use to triage incidents on behalf of the on-call engineer
Create AI agent and computer-use systems to highlight relevant information and knowledge to the on-call engineer
Automatically save all new knowledge created from mitigating new incidents in a way that AI systems can use with future incidents

Figure 1: Replacing the incident mitigation flow with a new agentic AI-based solution. The new flow augments the existing incident mitigation flow by integrating agentic AI systems to all steps of the process for maximizing on-call engineer efficiency by guaranteeing always up to date TSGs and quick TSG discovery.

Improving Existing TSGs

Before it is possible to enable advanced agentic AI automation, we must ensure that the knowledge base for the AI is up to the task. To do this, we created an AI-based system that can evaluate existing TSGs based on a set of predefined categories and heuristics. For example, judging if a TSG includes all the relevant information to be actionable (such as inline query templates or references to other required documentation) or ensuring that the TSG is clear and concise.

With the help of these categories and heuristics, we send a prompt to an AI model (GPT-4o from Azure OpenAI Service) hosted in Azure AI Foundry, along with the existing TSG to obtain a score and recommendations on how to improve the TSG in this category. We can then either manually or automatically update the TSG to implement the recommendations with the help of another prompt (discussed in the “Capturing New Knowledge” section below). For example, while authoring TSGs, developers can utilize tools such as a new Azure AI Foundry extension for Visual Studio Code that we created to assess TSG content quality in real-time. This extension will analyze the content and provide suggestions on how to improve it based on certain categories and heuristics. Additionally, we developed a similar system that can prevent TSG quality regression during the “Pull Request” process in Azure DevOps. This integration ensures that only sufficiently highly rated TSGs can be saved and distributed to on-call engineers.

With the introduction of these new systems, we have found the average TSG quality goes up and, with higher quality TSGs, the time to resolve service incidents decreases. However, we can take it a step further. Beyond just generating better TSGs, we can also generate executable workflows that allow AI agents to take automated actions described in the TSG content.

Generating Enhanced TSGs and Executable Workflows

With a sufficient knowledge base of high-quality TSGs, we can start generating actionable steps that AI agents can execute. We have multiple research and engineering teams across Microsoft that have been involved in this process. They have been trying and benchmarking several ways to maximize the effectiveness and reliability of using agents for this process.

In this research, we have found that the TSGs serve as a fantastic jumping off point to start instructing AI on what to do. However, the inherent inaccuracies of natural languages can lead to unexpected results when agents execute the steps. Therefore, we have developed approaches that allow us to distil the information contained in the TSGs into new forms that improve reliability. Namely, using text-based reasoning with intermediate steps to distil the information and with graph-based reasoning.

This work is still actively in the research stages, but results have already shown to be promising.

Natural Language-based Reasoning

Text-based reasoning refers to keeping the information from the TSGs as natural language and chunking, classifying, and vectorizing it in ways to support accurate reasoning in the root cause analysis. A classic example of this is using a RAG database. We have created and experimented with various types of RAG databases as well as with other approaches. For example, one approach was to iteratively prompt a model (O3-mini) to summarize TSGs adding to an existing TSG with each iteration, refining as it went. This resulted in the model adding information from several TSGs into a single aggregated result.

Graph-based Reasoning

In our ongoing efforts to advance incident management, we have incorporated graph-based reasoning techniques to enhance the investigation and resolution of operational incidents by on-call engineers. Our methodology models the actions taken by engineers as a knowledge graph, enabling the systematic capture of relationships and dependencies among disparate pieces of information embedded across heterogeneous, multi-modal sources leveraged during debugging. This structure empowers our system to intelligently traverse complex workflows, automatically mapping and suggesting relevant actions and solutions in real time. Ultimately, leveraging a graph data structure not only streamlines the investigation and resolution process, but also paves the way toward a fully automated troubleshooting agent—making incident management faster, smarter, and more scalable.

Figure 2: Learned troubleshooting workflows at different granularity. Each node corresponds to one action, and the links represent the sequence of steps taken during the investigation process.

Highlighting Relevant Information and Knowledge

Another key optimization opportunity that we identified with the incident mitigation process is finding the right TSG. Each incident that we receive is slightly different and looking at earlier similar cases might not always be sufficient to help mitigate new ones, especially since code is always evolving. However, we have started to use generative AI here as well.

We created new AI tools, and integrated our internal Microsoft Copilot, to allow the on-call engineer to have a ChatGPT-like experience that is rooted in past incidents and TSG knowledge. This knowledge is stored in a RAG, similar to how we use a RAG during TSG generation (as described earlier). However, this RAG is designed for on-call engineer to ask questions about what certain errors mean, or how to resolve incidents. The engineer can also directly ask the AI to give a link to the relevant TSGs, or to create a new one on the fly from the new incident’s details. This allows the engineer to get the answers faster without needing to skim through several TSGs. They can just ask using natural language and get what they are looking for directly!

Capturing New Knowledge

With all the tools discussed so far, TSG improvement, creating process from the TSGs, highlighting relevant TSGs, the on-call engineer has many tools to optimize and improve their on-call effectiveness and experience. However, we cannot always guarantee that all the information required to solve all incidents will always be available. This could be due to code changes (such as a new feature being rolled out), some new security enhancing initiative, or maybe just an issue that has never happened before. In such cases, we must make sure to continually update the TSGs. These systems are all useless if they don’t have sufficient good quality and up-to-date TSGs powering them.

To ensure that new knowledge is never lost, we created a fully automated system to monitor all new incidents and to create TSGs based on the discussion and handling that leads to its mitigation. Moreover, we enable the on-call engineer to flag certain incidents as extra important to ensure that, if there was some new knowledge discovered in the incident, it will get included in a new TSG. This system uses a combination of all the previously discussed systems to generate TSGs, automatically assess them, improve them, and submit them for human approval.

The TSGs that this system generates can then be paired with human written TSGs to generate always up-to-date TSGs that auto-update as the product evolves! Therefore, this completes the loop. The on-call engineer oversees the entire process, but the AI does all the tedious work of creating and maintaining TSGs and eventually using them to give time back to the engineer to help develop great new features.

Next Steps

These projects are still very early in development. However, they already show significant promise. By using AI, the on-call engineers are empowered to spend time more effectively. Instead of scrambling through outdated documentation, they can now ask a natural language question and dive deeper, if necessary. Soon they might not need to do anything but just press a button to approve and then supervise the AI while it acts on its own. Overall, these systems can help to reduce the total time that it takes to resolve an incident as well as ensure that future similar incidents are handled consistently and prevented ahead of time if possible.

As with any new systems, tuning and testing is critical. As we continue to develop these systems, we will continue to test and finetune the results. As we do this, we are updating the categories and heuristics, the models, the prompts, and seeking feedback from our on-call engineers. We are also working to improve our integrations with existing internal tools (such as the incident management portal) to further streamline the on-call engineer experience. Overall, this is an extremely interesting problem to solve and yet another novel application to AI.