Blog Post

Microsoft Foundry Blog
8 MIN READ

Introducing TSGen: Automated TSG Generation @ Scale – Built by AI

Daniel-Genkin-MSFT's avatar
Apr 03, 2026

Transforming Cloud Incident Management Through Intelligent Automation

This post is a follow-up to the previous write-up at https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/the-future-of-ai-autonomous-agents-for-identifying-the-root-cause-of-cloud-servi/4412494?previewMessage=true. If you haven’t read it yet, it provides the background on why we started building TSGen, our Troubleshooting Guide Generator, and the core idea behind automated, scalable generation. This post outlines how we built TSGen on a cross-discipline team using AI for both research and engineering workflows, focusing on the core algorithm and some preliminary results.

The Challenge: Why Manual Troubleshooting Guides Fall Short

Operating cloud services at scale presents unique challenges for incident management. When issues arise, engineers rely on Troubleshooting Guides (TSGs) to diagnose and resolve problems quickly. However, manual TSG creation and maintenance can occasionally create bottlenecks. TSGs are often siloed across different platforms, making them difficult to locate during critical incidents. The content itself tends to be inconsistently structured between silos and occasionally incomplete, requiring engineers to interpret ambiguous instructions under time pressure. We ran an internal study examining over 4,000 TSGs mapped to thousands of incidents that revealed that while TSGs significantly reduce mitigation efforts when properly maintained, their quality varies dramatically. Engineers surveyed about TSG effectiveness consistently report issues with outdated information, missing steps, and lack of clarity. These quality gaps lead to extended incident resolution times, increased engineer fatigue, and higher operational costs.

 

Figure 1: Categorizing Weight of TSG Aspects on Time to Mitigate and On-Call Experience

The Solution: Automating Troubleshooting Guide creation with AI

The core technical innovation is the use of an AI system that automatically synthesizes high‑quality, structured Troubleshooting Guides (TSGs) directly from historical incident data, rather than relying on manual authoring. TSGen ingests diverse operational signals—such as past IcM incidents identified via monitor IDs or custom Kusto queries—and produces end‑to‑end, action‑oriented troubleshooting workflows within minutes. This shifts TSG creation from a labor‑intensive, error‑prone documentation task into an automated knowledge synthesis problem, enabling consistent structure and coverage across services.

A second key innovation is operational scalability with continuous relevance. TSGen is designed not only to generate new TSGs, but to keep them up‑to‑date as new incidents occur, addressing the chronic issue of stale or incomplete troubleshooting documentation. The system has already demonstrated practical effectiveness in pilot deployments, with dozens of generated TSGs accepted and published for real on‑call usage, showing that AI‑generated artifacts can meet production engineering standards rather than serving as drafts or suggestions.

Finally, TSGen explicitly targets dual consumption by humans and AI agents, generating structured outputs that are useful both for on‑call engineers and for automated agents involved in incident diagnosis. This positions TSGs as a shared, machine‑readable knowledge layer rather than static documents, reducing “tribal knowledge” and enabling faster, more reliable incident response at scale across Microsoft services.

 

 

Figure 2: Zero-ing in on the most common issues found in TSGs

TSGen's Five-Step Automated Workflow

TSGen addresses the manual TSG challenge through a sophisticated five-step automated workflow that transforms incident data into executable troubleshooting guides. The first step, Collection, gathers incident data from multiple sources including diagnostic logs, historical tickets, and troubleshooting documentation. This comprehensive data aggregation creates the foundation for intelligent TSG generation. The second step, Filtering, removes noise and irrelevant information from the collected data. Machine learning algorithms identify which incident attributes are most relevant for troubleshooting, eliminating false signals that could lead to incorrect guidance. The third step, Core Incident Selection, identifies representative incidents that exemplify common problem patterns. Rather than processing every incident individually, TSGen selects the most informative examples that capture the essential troubleshooting logic. The fourth step, Data Distillation, extracts key troubleshooting patterns and actionable steps from the selected incidents. This process analyzes successful resolution paths to identify the critical diagnostic checks and mitigation actions. The fifth and final step, TSG Generation, synthesizes the distilled information into structured, actionable troubleshooting guides. The output is a well-formatted TSG that engineers can follow systematically during incident response.

 

Figure 3: The TSGen Five-Step Workflow

From Manual to Automated: Real-World Impact

The shift from manual TSG creation to automated TSG maintenance delivers measurable benefits for incident management operations. Teams using automated TSG maintenance report significant reductions in time-to-mitigation for common incident types. Engineers spend less time searching for relevant documentation and interpreting ambiguous instructions by ensuring that all TSGs have consistent formatting and reliable information, allowing them to focus on complex problem-solving.

 

Figure 4: With The Presence of a TSG, Incident Mitigation time decreases by ~40%

Industry-Wide Implications for Cloud Operations

TSGen represents a broader trend toward intelligent automation in cloud operations. The challenges of maintaining high service availability while managing complex distributed systems affect organizations across industries. As cloud infrastructure grows in scale and complexity, the volume of potential incidents increases exponentially. Traditional manual approaches cannot keep pace with this growth. Automated TSG generation offers a scalable solution that improves with the volume of data it processes. Each incident handled by the system contributes to its collection of incident knowledge, creating a positive feedback loop for ever improving TSGs. This scalability benefit is particularly valuable for organizations operating multiple services or supporting global customer bases. The technology also democratizes incident management expertise. In traditional models, effective troubleshooting requires deep institutional knowledge that takes years to develop. Automated systems capture and codify this expertise, making it accessible to engineers at all experience levels. This knowledge transfer capability reduces dependency on veteran engineers and accelerates onboarding for new team members.

Key Benefits of Automated TSG Generation

Automated TSG generation delivers multiple strategic advantages for organizations managing cloud infrastructure:

  • Faster incident resolution reduces service disruptions and improves customer experience
  • Improved TSG quality through continuous learning ensures troubleshooting guidance remains accurate and comprehensive
  • Reduced operational costs result from decreased manual documentation maintenance and shorter incident durations
  • Enhanced engineer productivity allows technical teams to focus on innovation rather than repetitive troubleshooting tasks
  • Knowledge preservation captures institutional expertise in executable form, protecting organizations from knowledge loss when engineers transition
  • Scalability enables consistent incident management across growing infrastructure without proportional headcount increases
  • Data-driven insights from automated systems reveal patterns in incident types and resolution effectiveness, informing preventive measures

How We Built This Iteration

This iteration was developed in VS Code using Copilot CLI, with Claude models (including Opus 4.6) for implementation support and rapid iteration. This new iteration was primarily focused on improving output quality from the core algorithm, improving engineering efficiency / speed of iteration by migrating from Node.js to Python to simplify the codebase and speed up experimentation, and in deploying a new agentic playground to make it easier for teams across Microsoft to experiment and help beta test.

Learnings and Recommendations from Building with AI

This iteration was both a research and an engineering project. Our cross-discipline team leveraged AI at every level of development. The majority of the code that we developed for this new iteration was created by AI, allowing us to iterate and develop faster. A few practical learnings helped us get better outcomes and avoid rework:

  • Create a solid plan up front for each major change.
    • We used the “Plan” mode in VS Code to have Claude AI models assist in defining what we want to make in a way that AI can leverage.
    • For example, when we converted the codebase from NodeJS to Python, we made a new dedicated plan.
  • Be detailed in the initial description and write down explicit requirements as bullet points (including edge cases and non-goals).
    • Our initial prompt to generate the plan was quite long. However, it was not highly structured. We focused primarily on getting the information into the AI, rather than giving it an actual handmade plan.
    • For example, we included snippets such as what the new folder structure should be and that there should be no regressions in functionality.
  • If you already know how you want something to work, state it directly - specific instructions beat vague intent.
    • Often models can produce solutions that you weren’t expecting. This can be good at times and inconvenient at other times. So, if you know what your end goal looks like, give code pointers and specific details for what functions should be named, what they should do, etc.
    • During plan creation, answer follow-up questions with as much context as you can, so assumptions don’t creep in.
  • Often times when designing a plan, the AI will ask some follow up questions. We treated these as an opportunity to elaborate. When the model asks for a follow up, don’t be shy to give it a lot of information. This can help make sure that it delivers a result similar to what you are expecting.
    • Read the plan critically and “negotiate” it as you go—treat the AI like a junior developer and make expectations explicit.
    • After you have a plan, make sure to read it fully to ensure that little miscommunications don’t occur. This is similar to the previous point, where you want to make sure that everything is sufficiently detailed but also that the details align with what you are trying to create.
  • If a model isn’t producing good results, switch models and try again.
    • Sometimes bringing in a new model can have the same effect as bringing in a different engineer with a fresh set of eyes. Especially given the speed of release for new models.
  • When the model is missing context, give hints about where to look (files, folders, examples, or the specific component to start from) so it can ground its plan.
    • If a model is asking questions or creating code that does not align with your interpretation of the plan, try to ground the plan in examples. This can help drastically clear up the miscommunications.

Looking Forward: The Future of Intelligent Incident Management

The evolution of TSG automation points toward increasingly autonomous incident management systems. Current systems like TSGen focus on automating TSG generation and execution for known incident patterns. Future developments will likely expand into autonomous root cause analysis and predictive incident prevention. Advanced AI agents could execute complex diagnostic workflows without human intervention, escalating only when novel situations arise that require human judgment. Natural language processing capabilities will enable engineers to interact with troubleshooting systems conversationally, asking questions and receiving context-aware guidance. The integration of reinforcement learning could allow systems to optimize troubleshooting strategies in real-time based on success rates. These systems might automatically adjust their approaches when initial steps prove ineffective, exploring alternative resolution paths intelligently. Another promising direction involves cross-system learning, where troubleshooting knowledge from one service or organization informs incident management in others. This collective intelligence approach could accelerate the development of effective troubleshooting strategies industry-wide. The ultimate vision is incident management systems that continuously improve, require minimal human oversight, and prevent problems before they impact customers.

Further Reading

- How Microsoft is Using AI Agents to Transform Cloud Incident Management 

- AutoTSG: Learning and Synthesis for Incident Troubleshooting

 

Updated Apr 03, 2026
Version 1.0
No CommentsBe the first to comment