Blog Post

Apps on Azure Blog
8 MIN READ

Context Engineering Lessons from Building Azure SRE Agent

sanchitmehta's avatar
sanchitmehta
Icon for Microsoft rankMicrosoft
Dec 26, 2025

We started with 100+ tools and 50+ specialized agents. We ended with 5 core tools and a handful of generalists. The agent got more reliable, not less.

We spent a long time chasing model upgrades, polishing prompts, and debating orchestration strategies. The gains were visible in offline evals, but they didn’t translate into the reliability and outcomes we wanted in production. The real breakthrough came when we started caring much more about what we were adding to the context, when, and in what form. In other words: context engineering.

Every context decision involves tradeoffs: latency, autonomy (how far the agent goes without asking), user oversight, pre-work (retrieve/verify/compute before answering), how the agent decides it has sufficient evidence, and the cost of being wrong. Push on one dimension, and you usually pay for it elsewhere.

This blog is our journey building Azure SRE Agent – a cloud AI agent that takes care of your Azure resources and handles your production incidents autonomously. We'll talk about how we got here, what broke along the way, which context patterns survived contact with production, and what we are doing next to treat context engineering as the primary lever for reliable AI-driven SRE.

Tool Explosion, Under-Reasoned

We started where everyone starts: scoped tools and prescriptive prompts. We didn't trust the model in prod, so we constrained it. Every action got its own tool. Every tool got its own guardrails.

Azure is a sprawling ecosystem - hundreds of services, each with its own APIs, failure modes, and operational quirks. Within 2 weeks, we had 100+ tools and a prompt that read like a policy manual.

The cracks showed fast. User hits an edge case? Add a tool. Tool gets misused? Add guardrails. Guardrails too restrictive? Add exceptions. The backlog grew faster than we could close it.

Worse, the agent couldn’t generalize. It was competent at the scenarios we’d already encoded and brittle everywhere else. We hadn't built an agent - we'd built a workflow with an LLM stapled on.

Insight #1: If you don’t trust the model to reason, you’ll build brittle workflows instead of an agent.

Wide tools beat many tools

Our first real breakthrough came from asking a different question: what if, instead of 100 narrow tools, we gave the model two wide ones?

We introduced `az` and `kubectl` CLI commands as first-class tools. These aren’t “tools” in the traditional sense - they’re entire command-line ecosystems. But from the model’s perspective, they’re just two entries: “execute this Azure CLI command” and “execute this Kubernetes command”

The impact was immediate:

  • Context compression: Three tools instead of hundreds. Massive headroom recovered.
  • Capability expansion: The model now had access to the entire az/kubectl surface area, not just the subset we had wrapped.
  • Better reasoning: LLMs already “know” these CLIs from training data. By hiding them behind custom abstractions, we were fighting their priors.

This was our first hint of a deeper principle:

Insight #2: Don’t fight the model’s existing knowledge - lean on it. 

Multi-Agent Architectures: Promise, Pain, and the Pivot

Looking at the success of generic tools, we went further and built a full multi-agent system with handoffs.  A “handoff” meant one agent explicitly transferring control - along with the running context and intermediate results - to another agent.

Human teams are organized by specialty, so we mirrored that structure: specialized sub-agents with focused personas, each owning one azure service, handing off when investigations crossed boundaries. 

The theory was elegant: lazy tool loading.

  • The orchestrator knows about sub-agents, not individual tools.
  • User asks about Kubernetes? Hand off to the K8s agent.
    Networking question? Route to the networking agent.
  • Each agent loads only its own tools. Context stays lean.

It worked beautifully at small scale. Then we grew to 50+ sub-agents and it fell apart.

The results showed a bimodal distribution: when handoffs worked, everything worked; when they didn't, the agent got lost. We saw a clear cliff – problems requiring more than four handoffs almost always failed.

The following patterns emerged:

  1. Discovery problems.
    Each sub-agent only knew sub-agents it could directly call. Users would ask reasonable questions and get “I don’t know how to help with that” - not because the capability didn’t exist, but because the orchestrator didn’t know that the right sub-agent was buried three hops away.
  2. System prompt fragility.
    Each sub-agent has its own system prompt. A poorly tuned sub-agent doesn’t just fail locally - it affects the entire reasoning chain with its conflicting instructions. The orchestrator’s context gets polluted with confused intermediate outputs, and suddenly nothing works. One bad agent drags down the whole interaction, and we had over 50 SubAgents at this point.
  3. Infinite Loops.
    In the worst cases, agents started bouncing work around without making progress. The orchestrator would call a sub-agent, which would defer back to the orchestrator or another sub-agent, and so on. From the user’s perspective, nothing moved forward; under the hood, we were burning tokens and latency on a “you handle it / no, you handle it” loop. Hop limits and loop detection helped, but they also undercut the original clean architecture of the design.
  4. Tunnel Vision.
    Human experts have overlapping domains - a Kubernetes engineer knows enough networking to suspect a route issue, enough about storage to rule it out. This overlap makes human handoffs intelligent. Our agents had hard boundaries. They either surrendered prematurely or developed tunnel vision, chasing symptoms in their domain while the root cause sat elsewhere.

Insight #3: Multi-agent systems are hard to scale - coordination is the real work.

The failures revealed a familiar pattern. With narrow tools, we'd constrained what the model could do – and paid in coverage gaps. With domain-scoped agents, we'd constrained what it could explore – and paid in coordination overhead. Same overcorrection, different layer.

The fix was to collapse dozens of specialists into a small set of generalists. This was only possible because we already had generic tools. We also moved the domain knowledge from system prompts into files the agents could read on demand (later morphed to agent skills capability inspired by Anthropic). 

Our system evolved: fewer agents, broader tools, and on-demand knowledge replaced brittle routing and rigid boundaries. Reliability improved as we stopped depending on the handoff roulette.

Insight #4: Invest context budget in capabilities, not constraints.

A Real Example: The Agent Debugging Itself

Case in point: Our own Azure OpenAI infrastructure deployment started failing. We asked the SRE agent to debug it. 

Without any predefined workflow, it checked deployment logs, spotted a quota error, queried our subscription limits, found the correct support request category, and filed a ticket with the support team. The next morning, we had an email confirming our quota increase.

Our old architecture couldn't have done this - we had no Cognitive Services sub-agent, no support request tool. But with az as a wide tool and cross-domain knowledge, the model could navigate Azure's surface area the same way a human would.

This is what we mean by capability expansion. We never anticipated this scenario. With generalist agents and wide tools, we didn't need to.

Context Management Techniques for Deep Agents

After consolidating tools and agents, we focused on context management for long-running conversations.

1. The Code Interpreter Revelation

Consider metrics analysis. We started with the naive approach: dump all metrics into the context window and ask the model to find anomalies.

This was backwards. We were taking deterministic, structured data and pushing it through a probabilistic system. We were asking an LLM to do what a single Pandas one-liner could do. We ended up paying in tokens, latency, and accuracy (models don’t like zero-valued metrics).

Worse, it kind of worked. For short windows. For simple queries. Just enough success to hide how fundamentally wrong the approach was. Classic “works in demo, fails in prod.”

The fix was obvious in hindsight: let the model write code.

  • Don’t send 50K tokens of metrics into the context.
  • Send the metrics to a code interpreter.
  • Let the model write the pandas/numpy analysis.
  • Execute it. Return only the results and analysis of the results.

Metrics analysis had been our biggest source of tool failures. After this change: zero failures. And because we weren’t paying the token tax anymore, we could extend time ranges by an order of magnitude.

Insight #5: LLMs are orchestrators, not calculators.
Use them to decide what computation to run, then let actual code perform the computation.

2. Planning and Compaction

We also added two other patterns: a todo-style planner and more aggressive compaction.

  • Todo planner: Represent the plan as an explicit checklist outside the model’s context, and let the model update it instead of re-deriving the workflow on every turn.
  • Compaction: Continuously shrink history into summaries and structured state (e.g., key incident facts), so the context stays a small working set rather than an ever-growing log.

Insight #6: Externalizing plans and compacting history effectively “stretch” the usable context window.

3. Progressive disclosure with Files

With code interpretation working, we hit the next wall: tool calls returning absurd amounts of data.

Real example: an internal App Service Control Plane log table on which a user fires off a SELECT * – style query. The table has ~3,000 columns. Single digit log entry expands to 200K+ tokens. The context window is gone. The model chokes. The user gets an error.

Our solution was session-based interception.

Tool calls that can return large payloads never go straight into context. Instead, they write as a “file” into a sandboxed environment where the data can be:

  • Inspected ("what columns exist?")
  • Filtered ("show only the error-related columns")
  • Analyzed via code ("find rows where latency > p99")
  • Summarized before anything enters the model’s context

The model never sees the raw 200K tokens. It sees a reference to a session and a set of tools to interact with that session. We turned an unbounded context explosion into a bounded, interactive exploration. You have seen this with coding agents, and the idea was similar - could the model find its way through the large amount of data?

 

 

Insight #7: Treat large tool outputs as data sources, not context.

4. What's Next: Tool Call Chaining

The next update we’re working on is tool call chaining. This idea started with solving Troubleshooting Guides (TSGs) as Code.

A lot of agent workflows are predictable: “run this query, fetch these logs, slice this data, summarize the result.” Today, we often force the model to walk that path with one tool at a time:

Today, it often looks like:

Model → Tool A → Model → Tool B → Model → Tool C → Model → … → Response

The alternative:

Model → [Script: Tool A → Tool B → Tool C → … → Final Output] → Model → Response

The model writes a small script that chains the tools together. The platform executes the script and returns consolidated results. Three roundtrips become one. Context overhead drops by 60–70%.

This also unlocks something subtle: deterministic workflows inside probabilistic systems. Long-running operations that must happen in a specific order can be encoded as scripts. The model decides what should happen; the script guarantees how it happens. Anthropic recently published  a similar capability with Programmatic tool calling.

The Meta Lesson

Six months ago, we thought we were building an SRE agent. In reality, we were building a context engineering system that happens to do Site Reliability Engineering.

Better models are table stakes, but what moved the needle was what we controlled: generalist capabilities and disciplined context management.

Karpathy’s analogy holds: If context windows are the agent’s “RAM” then context engineering is memory management: what to load, what to compress, what to page out, and what to compute externally. As you fill it up, model quality often drops non-linearly - “lost in the middle,” “not adhering to my instructions,” and plain old long-context degradation shows up well before we hit the advertised limits. More tokens don’t just cost latency; they quietly erode accuracy.

We’re not done. Most of what we have done is “try it, observe, watch it break, tighten the loop”. But these patterns that keep working - wide tools, code execution, context compaction, tool chaining - are the same ones we see rediscovered across other agent stacks. In the end, the throughline is simple: give the model fewer, cleaner choices and spend your effort making the context it sees small, structured, and easy to operate on.

Updated Dec 27, 2025
Version 3.0
No CommentsBe the first to comment