customer zero
4 TopicsHow we build and use Azure SRE Agent with agentic workflows
The Challenge: Ops is critical but takes time from innovation Microsoft operates always-on, mission-critical production systems at extraordinary scale. Thousands of services, millions of deployments, and constant change are the reality of modern cloud engineering. These are titan systems that power organizations around the globe—including our own—with extremely low risk tolerance for downtime. While operations work like incident investigation, response and recovery, and remediation is essential, it’s also disruptive to innovation. For engineers, operational toil often means being pulled away from feature work to diagnose alerts, sift through logs, correlate metrics across systems, or respond to incidents at any hour. On-call rotations and manual investigations slow teams down and introduce burnout. What's more, in the era of AI, demand for operational excellence has spiked to new heights. It became clear that traditional human-only processes couldn't meet the scale and complexity needs for system maintenance especially in the AI world where code shipping velocity has increased exponentially. At the same time, we needed to integrate with the AI landscape which continues to evolve at a breakneck pace. New models, new tooling, and new best practices released constantly, fragmenting ecosystems between different platforms for observability, DevOps, incident management, and security. Beyond simply automating tasks, we needed to build an adaptable approach that could integrate with existing systems and improve over time. Microsoft needed a fundamentally different way to perform operations—one that reduced toil, accelerated response, and gave engineers the time to focus on building great products. The Solution: How we build Azure SRE Agent using agentic workflows To address these challenges, Microsoft built Azure SRE Agent, an AI-powered operations agent that serves as an always-on SRE partner for engineers. In practice, Azure SRE Agent continuously observes production environments to detect and investigate incidents. It reasons across signals like logs, metrics, code changes, and other deployment records to perform root cause analysis. It supports engineers from triage to resolution and it’s used in a variety of autonomy levels from assistive investigation to automating remediation proposals. Everything occurs within governance guardrails and human approval checks grounded in role‑based access controls and clear escalation paths. What’s more, Azure SRE Agent learns from past incidents, outcomes, and human feedback to improve over time. But just as important as what was built is how it was built. Azure SRE Agent was created using the agentic workflow approach—building agents with agents. Rather than treating AI as a bolt-on tool, Microsoft embedded specialized agents across the entire software development lifecycle (SDLC) to collaborate with developers, from planning through operations. The diagram above outlines the agents used at each stage of development. They come together to form a full lifecycle: Plan & Code: Agents support spec‑driven development to unlock faster inner loop cycles for developers and even product managers. With AI, we can not only draft spec documentation that defines feature requirements for UX and software development agents but also create prototypes and check in code to staging which now enables PMs/UX/Engineering to rapidly iterate, generate and improve code even for early-stage merges. Verify, Test & Deploy: Agents for code quality review, security, evaluation, and deployment agents work together to shift left on quality and security issues. They also continuously assess reliability, ensure performance, and enforce consistent release best practices. Operate & Optimize: Azure SRE Agent handles ongoing operational work from investigating alerts, to assisting with remediation, and even resolving some issues autonomously. Moreover, it learns continuously over time and we provide Azure SRE Agent with its own specialized instance of Azure SRE Agent to maintain itself and catalyze feedback loops. While agents surface insights, propose actions, mitigate issues and suggest long term code or IaC fixes autonomously, humans remain in the loop for oversight, approval, and decision-making when required. This combination of autonomy and governance proved critical for safe operations at scale. We also designed Azure SRE agent to integrate across existing systems. Our team uses custom agents, Model Context Protocol (MCP) and Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, business process and operational tools to add intelligence on top of established workflows rather than replacing them. Built this way, Azure SRE Agent was not just a new tool but a new operational system. And at Microsoft’s scale, transformative systems lead to transformative outcomes. The Impact: Reducing toil at enterprise scale The impact of Azure SRE Agent is felt most clearly in day-to-day operations. By automating investigations and assisting with remediation, the agent reduces burden for on-call engineers and accelerates time to resolution. Internally at Microsoft in the last nine months, we've seen: 35,000+ incidents have been handled autonomously by Azure SRE Agent. 50,000+ developer hours have been saved by reducing manual investigation and response work. Teams experienced a reduced on-call burden and faster time-to-mitigation during incidents. To share a couple of specific cases, the Azure Container Apps and Azure App Service product group teams have had tremendous success with Azure SRE Agent. Engineers for Azure Container Apps had overwhelmingly positive (89%) responses to the root cause analysis (RCA) results from Azure SRE agent, covering over 90% of incidents. Meanwhile, Azure App Service has brought their time-to-mitigation for live-site incidents (LSIs) down to 3 minutes, a drastic improvement from the 40.5-hour average with human-only activity. And this impact is felt within the developer experience. When asked developers about how the agent has changed ops work, one of our engineers had this to say: “[It’s] been a massive help in dealing with quota requests which were being done manually at first. I can also say with high confidence that there have been quite a few CRIs that the agent was spot on/ gave the right RCA / provided useful clues that helped navigate my initial investigation in the right direction RATHER than me having to spend time exploring all different possibilities before arriving at the correct one. Since the Agent/AI has already explored all different combinations and narrowed it down to the right one, I can pick the investigation up from there and save me countless hours of logs checking.” - Software Engineer II, Microsoft Engineering Beyond the impact of the agent itself, the agentic workflow process has also completely redefined how we build. Key learnings: Agentic workflow process and impact It's very easy to think of agents as another form of advanced automation, but it's important to understand that Azure SRE agent is also a collaborative tool. Engineers can prompt the agent in their investigations to surface relevant context (logs, metrics, and related code changes) to propose actions far faster and easier than traditional troubleshooting. What’s more, they can also extend it for data analysis and dashboarding. Now engineers can focus on the agent’s findings to approve actions or intervene when necessary. The result is a human-AI partnership that scales operations expertise without sacrificing control. While the process took time and experimentation to refine, the payoff has been extraordinary; our team is building high-quality features faster than ever since we introduced specialized agents for each stage of the SDLC. While these results were achieved inside Microsoft, the underlying patterns are broadly applicable. First, building agents with agents is essential to scaling, as manual development quickly became a bottleneck; agents dramatically accelerated inner loop iteration through code generation, review, debugging, security fixes, etc. In practice, we found that a generic agent—guided by rich context and powered by memory and learning—can continuously adapt, becoming faster and more effective over time as it builds experience. This allows the agent to apply prior knowledge, avoid relearning, and reduce the effort required to resolve similar problems repeatedly. In parallel, specialized agents help bring consistency and repeatability to well‑defined categories of incidents, encoding proven patterns, workflows, and safeguards. Together, these approaches enable systems that both adapt to new situations and respond reliably at scale. Microsoft also learned to integrate deeply with existing systems, embedding agents into established telemetry, workflows, and platforms rather than attempting to replace them. Throughout this process, maintaining tight human‑in‑the‑loop governance proved critical. Autonomy had to be balanced with clear approval boundaries, role‑based access, and safety checks to build trust. Finally, teams learned to invest in continuous feedback and evaluation, using ongoing measurement to improve agents over time and understand where automation added value versus where human judgment should remain central. Want to learn more? Azure SRE Agent is one example of how agentic workflows can transform both product development and operations at scale. Teams at Microsoft are on a mission of leading the industry by example, not just sharing results. We invite you to take the practical learnings from this blog and apply the same principles in your own environments. Discover more about Azure SRE Agent Learn about agents in DevOps tools and processes Read best practices on agent management with Azure8.4KViews4likes1CommentHow Microsoft 1ES uses agentic AI to take on security and compliance at scale
Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade IQ platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization. What we do Within Microsoft’s One Engineering System (1ES) organization, teams build and maintain the internal engineering systems that product groups across the company rely on to ship and secure their services. These shared tools and processes support teams responsible for mission-critical products, from modern cloud-native platforms to long-lived legacy applications. Security, compliance, and reliability work is non-negotiable at this scale. But it has to coexist with developer productivity and velocity across thousands of independently owned repositories. The problem: the CVE and compliance treadmill Here’s the loop we kept living: A security or compliance alert arrives, often via automation like Dependabot or a CVE finding. The version gets bumped, or the config gets nudged. CI is green. The PR merges. Production fails or the finding reopens because the fix required code changes beyond a version bump or a config flip. This repeats across repositories, teams, and organizations. And the hard truth is not all vulnerabilities are mechanical version bumps, and not all compliance findings are config tweaks. Many introduce behavioral or security model changes. Automation handles the easy cases but silently fails on the hard ones. A second pattern compounds it: when a service has 30+ open action items spanning OTel audit, identity, secret rotation, and CodeQL findings, just figuring out which ones are quick versus deep can take longer than the fixes themselves. Multiply this across Microsoft’s repo footprint and the cost becomes months of engineering time spent on work that doesn’t ship new customer value. But this is exactly the kind of challenge AI was made for: high-speed, high-scale evaluation and judgment calls, coached by human expertise. Why this is solvable now In the previous era of software development, an average CVE alert meant hours of developer toil. Three things changed at once. Frontier models like GPT-5.5 and Claude Opus 4.7 can now reason about context, intent, and tradeoffs not just generate code. Agent runtimes like GitHub Copilot CLI can read repositories, run tools, execute tests, and open pull requests end-to-end. And we’ve started encoding hard-won domain expertise as portable skills, so an agent doesn’t have to re-derive what an expert already knows. None of these is enough alone. Frontier models without runtimes are just chat. Runtimes without skills hallucinate confidently. Skills without judgment automate the wrong thing. Together, bounded by human–AI partnership patterns that make escalation a first-class behavior, they enable a safer, more disciplined way to tackle judgment-heavy engineering work. How we approach it: collaborate, don’t automate The co-creative model Instead of treating AI as a script executor, we treat agents as collaborators operating within explicit guardrails: Agents propose changes based on skills and available context. Humans review, approve, and retain final ownership of every change. Skills over prompts Agents start cold. They don’t have repo-specific context beyond the invoked skill. A skill captures the exact steps, decisions, and edge cases a human expert would apply to a specific class of problem. Skills are written once as Markdown and loaded only when needed: focused context, improved complexity handling, more predictable behavior. We author skills with agents too. The same operating model we use for remediation. Human owns the decision, agent does the work, signals feed back is how the skills themselves get written and refined. One of those agents, Ember, is now open-sourced on awesome-copilot. A real example: the XStream CVE Some CVEs include changes in aspects like default security models, which require code changes beyond just bumping the dependency version. Take the XStream dependency update. In the previous 1.4.17 version, any class deserializes through a default-allow classification. But in the latest update, classification changed to default-deny meaning we need to make permitted types explicit. Once we find the XStream call sites, we need to fix type permissions after each instantiation and make sure that change propagates from test, to PR, to run. This is the type of judgment-heavy work where naïve automation creates risk and blocks developers from focusing on feature work. How execution works The agent loads the relevant skill for the task at hand. If it encounters ambiguity or risk, it stops and escalates rather than guessing. The agent goes through required steps: compile, test, pull request, as explicitly agreed upon in the guidance we provide. After each run, the agent emits an Agent Signals: a structured self-assessment of what worked, what was hard, and where the skill fell short. These compound across sessions so the system improves continuously. Autonomy is great, but trust is far better. Between the CVE context, the skills, and our working agreement with the agent, we’re creating a dynamic where the agent feels empowered to execute until it reaches a point of uncertainty. This cuts down the risk of hallucinations dramatically and scales repeatable, trustworthy execution. The most important issues get surfaced for humans in the loop, where human judgment actually matters. Closing the loop: dev-side and ops-side Skills and agents handle the dev-side work: CVE remediation, compliance findings, codebase changes that need judgment. On the ops side, Azure SRE Agent handles at-scale data analysis and operational toil. Same philosophy on both sides: agents act within explicit guardrails, humans own the decisions that matter, and signals from every run feed back into the system. Then the two sides connect. Every Agent Signal our dev-side skills emit flows into Azure SRE Agent, which analyzes them at scale, identifies where skills are degrading or falling short, opens PRs against the skills themselves to fix the gaps, and sends us a daily skill-health report. The ops-side agent maintains the dev-side agents: agents improving agents, while humans review and merge every change. The same human-in-the-loop discipline that governs a CVE fix governs a skill fix. Impact Across Microsoft, 1ES supports teams working on hundreds of repos at a variety of ages and sizes. Agents enable velocity while skills enable uniqueness which is what helps us scale across such a vast enterprise. Impact of the frontier models, GitHub Copilot, agent skills and agent signals for compliance work. Real engineering time saved We’re finding 18-15 hours of manual work compressed into ~9 hours of agent+skill assisted work – a 50-60% reduction overall, with some compliance work moving from 3-4 hrs manually to 30 min with the agent+skill. What devs told us “Considering I didn’t know anything about any of this, including never having seen the IaC in question, I’d say at least a week’s worth, done in less than 10 prompts.” — Patrick, Senior Engineer “Many times with [compliance], the actual changes are minimal, but reading the docs and knowing what applies to your app can be more time consuming… When you have 30+ action items, you need to go hunting for which one is quick versus time-consuming. This [agent+skills] saves a lot of time.” — Greg, Engineering Manager “The [agent+skills] eliminates most early-phase toil — up to ~90% — but 0% of the last-mile effort. The bottleneck shifts entirely to validation and deployment.” — CloudBuild team That last quote is the one we keep coming back to. The agent+skills doesn’t eliminate the work, it changes where the work lives. Discovery, scoping, and first-draft remediation collapse. Validation and deployment become the new ceiling. That’s the right problem to have and it tells us where to invest next. Security and compliance response with agents is evolving from reactive maintenance to a proactive, strategic defense capability. What we’ve learned On quality and trust With agents, silent confidence is more dangerous than visible uncertainty. Testing agents cold exposes gaps early, before risk compounds. Build uncertainty into skills, and lean on Agent Signals to capture what worked, what was hard, and where the skill fell short. When agents report honestly, the next run starts smarter than the last one. Quality is measured, not assumed. We evaluate every PR on an A/B/C scale, and we run agents that evaluate other agents’ output, closing the loop between execution and assessment. On scaling Not all work should be automated. Some work requires human-AI collaboration. Encoding expertise will always be more valuable than scaling generic prompts. Start with a win in one repo, then slowly scale out that skill to other teams and repos. Where teams can start Teams don’t adopt AI through mandates. They adopt it through trust, built on quality results in their code. Start with one team, one skill, and one real win. Identify a CVE or dependency issue that appears repeatedly across repositories. Write the fix as Markdown, as if you’re onboarding a new engineer. That’s your first skill file. Test the skill with a cold agent on a real repo with a real problem. Iterate until the agent knows both how to act and when to stop. Agents can assess their own work and flag gaps in skills. Want to learn more? Watch the demo video of the dependency update scenario Learn more about the co-creative framework Discover how the GitHub Copilot CLI can help you run and orchestrate agents Learn more about Agent Signals Learn more about Agent Skills Read the companion ops-side story: How we build and use Azure SRE Agent with agentic workflows367Views3likes0Comments