Blog Post

Apps on Azure Blog
7 MIN READ

How Microsoft 1ES uses agentic AI to take on security and compliance at scale

JennyF's avatar
JennyF
Icon for Microsoft rankMicrosoft
Apr 28, 2026

Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade IQ platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization.

What we do

Within Microsoft’s One Engineering System (1ES) organization, teams build and maintain the internal engineering systems that product groups across the company rely on to ship and secure their services. These shared tools and processes support teams responsible for mission-critical products, from modern cloud-native platforms to long-lived legacy applications.

Security, compliance, and reliability work is non-negotiable at this scale. But it has to coexist with developer productivity and velocity across thousands of independently owned repositories.

The problem: the CVE and compliance treadmill

Here’s the loop we kept living:

  1. A security or compliance alert arrives, often via automation like Dependabot or a CVE finding.
  2. The version gets bumped, or the config gets nudged. CI is green. The PR merges.
  3. Production fails or the finding reopens because the fix required code changes beyond a version bump or a config flip.

This repeats across repositories, teams, and organizations. And the hard truth is not all vulnerabilities are mechanical version bumps, and not all compliance findings are config tweaks. Many introduce behavioral or security model changes. Automation handles the easy cases but silently fails on the hard ones.

A second pattern compounds it: when a service has 30+ open action items spanning OTel audit, identity, secret rotation, and CodeQL findings, just figuring out which ones are quick versus deep can take longer than the fixes themselves.

Multiply this across Microsoft’s repo footprint and the cost becomes months of engineering time spent on work that doesn’t ship new customer value.

But this is exactly the kind of challenge AI was made for: high-speed, high-scale evaluation and judgment calls, coached by human expertise.

Why this is solvable now

In the previous era of software development, an average CVE alert meant hours of developer toil. Three things changed at once.

Frontier models like GPT-5.5 and Claude Opus 4.7 can now reason about context, intent, and tradeoffs not just generate code. Agent runtimes like GitHub Copilot CLI can read repositories, run tools, execute tests, and open pull requests end-to-end. And we’ve started encoding hard-won domain expertise as portable skills, so an agent doesn’t have to re-derive what an expert already knows.

None of these is enough alone. Frontier models without runtimes are just chat. Runtimes without skills hallucinate confidently. Skills without judgment automate the wrong thing. Together, bounded by human–AI partnership patterns that make escalation a first-class behavior, they enable a safer, more disciplined way to tackle judgment-heavy engineering work.

How we approach it: collaborate, don’t automate

The co-creative model

Instead of treating AI as a script executor, we treat agents as collaborators operating within explicit guardrails:

  • Agents propose changes based on skills and available context.
  • Humans review, approve, and retain final ownership of every change.

Skills over prompts

Agents start cold. They don’t have repo-specific context beyond the invoked skill. A skill captures the exact steps, decisions, and edge cases a human expert would apply to a specific class of problem. Skills are written once as Markdown and loaded only when needed: focused context, improved complexity handling, more predictable behavior.

We author skills with agents too. The same operating model we use for remediation. Human owns the decision, agent does the work, signals feed back is how the skills themselves get written and refined. One of those agents, Ember, is now open-sourced on awesome-copilot.

A real example: the XStream CVE

Some CVEs include changes in aspects like default security models, which require code changes beyond just bumping the dependency version.

Take the XStream dependency update. In the previous 1.4.17 version, any class deserializes through a default-allow classification. But in the latest update, classification changed to default-deny meaning we need to make permitted types explicit. Once we find the XStream call sites, we need to fix type permissions after each instantiation and make sure that change propagates from test, to PR, to run.

This is the type of judgment-heavy work where naïve automation creates risk and blocks developers from focusing on feature work.

How execution works

  • The agent loads the relevant skill for the task at hand.
  • If it encounters ambiguity or risk, it stops and escalates rather than guessing.
  • The agent goes through required steps: compile, test, pull request, as explicitly agreed upon in the guidance we provide.
  • After each run, the agent emits an Agent Signals: a structured self-assessment of what worked, what was hard, and where the skill fell short. These compound across sessions so the system improves continuously.

Autonomy is great, but trust is far better. Between the CVE context, the skills, and our working agreement with the agent, we’re creating a dynamic where the agent feels empowered to execute until it reaches a point of uncertainty. This cuts down the risk of hallucinations dramatically and scales repeatable, trustworthy execution. The most important issues get surfaced for humans in the loop, where human judgment actually matters.

Closing the loop: dev-side and ops-side

Skills and agents handle the dev-side work: CVE remediation, compliance findings, codebase changes that need judgment. On the ops side, Azure SRE Agent handles at-scale data analysis and operational toil. Same philosophy on both sides: agents act within explicit guardrails, humans own the decisions that matter, and signals from every run feed back into the system.

Then the two sides connect. Every Agent Signal our dev-side skills emit flows into Azure SRE Agent, which analyzes them at scale, identifies where skills are degrading or falling short, opens PRs against the skills themselves to fix the gaps, and sends us a daily skill-health report. The ops-side agent maintains the dev-side agents: agents improving agents, while humans review and merge every change. The same human-in-the-loop discipline that governs a CVE fix governs a skill fix.

Impact

Across Microsoft, 1ES supports teams working on hundreds of repos at a variety of ages and sizes. Agents enable velocity while skills enable uniqueness which is what helps us scale across such a vast enterprise.

Impact of the frontier models, GitHub Copilot, agent skills and agent signals for compliance work.

Real engineering time saved

We’re finding 18-15 hours of manual work compressed into ~9 hours of agent+skill assisted work – a 50-60% reduction overall, with some compliance work moving from 3-4 hrs manually to 30 min with the agent+skill.

What devs told us

Considering I didn’t know anything about any of this, including never having seen the IaC in question, I’d say at least a week’s worth, done in less than 10 prompts.” — Patrick, Senior Engineer

Many times with [compliance], the actual changes are minimal, but reading the docs and knowing what applies to your app can be more time consuming… When you have 30+ action items, you need to go hunting for which one is quick versus time-consuming. This [agent+skills] saves a lot of time.” — Greg, Engineering Manager

The [agent+skills] eliminates most early-phase toil — up to ~90% — but 0% of the last-mile effort. The bottleneck shifts entirely to validation and deployment.” — CloudBuild team

That last quote is the one we keep coming back to. The agent+skills doesn’t eliminate the work, it changes where the work lives. Discovery, scoping, and first-draft remediation collapse. Validation and deployment become the new ceiling. That’s the right problem to have and it tells us where to invest next.

Security and compliance response with agents is evolving from reactive maintenance to a proactive, strategic defense capability.

What we’ve learned

On quality and trust

  • With agents, silent confidence is more dangerous than visible uncertainty. Testing agents cold exposes gaps early, before risk compounds.
  • Build uncertainty into skills, and lean on Agent Signals to capture what worked, what was hard, and where the skill fell short. When agents report honestly, the next run starts smarter than the last one.
  • Quality is measured, not assumed. We evaluate every PR on an A/B/C scale, and we run agents that evaluate other agents’ output, closing the loop between execution and assessment.

On scaling

  • Not all work should be automated. Some work requires human-AI collaboration.
  • Encoding expertise will always be more valuable than scaling generic prompts. Start with a win in one repo, then slowly scale out that skill to other teams and repos.

Where teams can start

Teams don’t adopt AI through mandates. They adopt it through trust, built on quality results in their code. Start with one team, one skill, and one real win.

  1. Identify a CVE or dependency issue that appears repeatedly across repositories.
  2. Write the fix as Markdown, as if you’re onboarding a new engineer. That’s your first skill file.
  3. Test the skill with a cold agent on a real repo with a real problem.
  4. Iterate until the agent knows both how to act and when to stop. Agents can assess their own work and flag gaps in skills.

Want to learn more?

Updated Apr 27, 2026
Version 1.0
No CommentsBe the first to comment