As we were building out the partner ecosystem for Azure SRE Agent we ran into a hard question: How can two agents working on the same problem share context and persist that context once the incident is resolved? This post covers the architecture we built: direct real-time communication for speed, and a shared context layer using systems you already have setup like PagerDuty, GitHub Issues, or ServiceNow and more !
Your Azure SRE agent detects a spike in error rates. It triages with cloud-native telemetry, but the root cause trail leads into a third-party observability platform your team also runs. The agent can't see that data. A second agent can, one that speaks Datadog or Dynatrace or whatever your team chose. The two agents talk to each other using protocols like MCP or directly via an API endpoint and come up with a remediation. The harder question is what happens to the conversation afterward.
TL;DR
Two AI agents collaborate on incidents using two communication paths: a direct real-time channel (MCP) for fast investigation, and a shared memory layer that writes to systems your team already uses, like PagerDuty, GitHub Issues, or ServiceNow. No new tools to adopt. No ephemeral conversations that vanish when the incident closes.
The problem
Most operational AI agents work in isolation. Your cloud monitoring agent doesn't have access to your third-party observability stack. Your Datadog specialist doesn't know what your Azure resource topology looks like. When an incident spans both, a human has to bridge the gap manually. At 2 AM. With half the context missing.
And even when two agents do exchange information directly, the conversation is ephemeral. The investigation ends, the findings disappear. The next on-call engineer sees a resolved alert with no record of what was tried, what was found, or why the remediation worked. The next agent that hits the same pattern starts over from scratch.
What we needed was somewhere for both agents to persist their findings, somewhere humans could see it too. And we really didn't want to force teams onto a new system just to get there.
Two communication paths
Direct agent-to-agent (real-time)
During an active investigation, the primary agent calls the partner agent directly. The partner runs whatever domain-specific analysis it's good at (log searches, span analysis, custom metric queries) and returns findings in real time. This is the fast path.
The direct channel uses MCP, so any partner agent can plug in without custom integration work. The primary agent doesn't need to understand the internals of Datadog or Dynatrace. It asks questions, gets answers.
Shared memory (durable)
After the direct exchange, both agents write their actions and findings to external systems that humans already use. This is the durable path, the one that creates audit trails and makes handoffs work.
The shared memory backends are systems your team already has open during an incident:
| Backend | What gets written | Good fit for |
|---|---|---|
| Incident platform (e.g., PagerDuty) | Timeline notes, on-call handoff context | Teams with alerting-centric workflows |
| Issue tracker (e.g., GitHub Issues) | Code-level findings, root cause analysis, action comments | Teams with dev workflow integration |
| ITSM system (e.g., ServiceNow) | Work notes, ITSM-compliant audit trail | Enterprise IT, regulated industries |
The important thing: this doesn't require a new system. Agents write to whatever your team already uses.
How it works
| Step | Actor | What happens | Path |
|---|---|---|---|
| 1 | Alert source | Monitoring fires an alert | — |
| 2 | Primary agent | Receives alert, triages, starts investigating with native tools | Internal |
| 3 | Primary agent | Calls partner agent for domain-specific analysis (third-party logs, spans) | Direct via MCP or API |
| 4 | Partner agent | Runs analysis, returns findings in real time | Direct via MCP or API |
| 5 | Primary agent | Correlates partner findings with native data, runs remediation | Internal |
| 6 | Both agents | Write findings, actions, and resolution to external systems | Shared memory via existing sources |
| 7 | Agent or human | Verifies resolution, closes incident | Shared memory via existing sources |
Steps 3 through 5 happen in real time over the direct channel. Nothing gets written to shared memory until the investigation has actual results. The investigation is fast; the record-keeping is thorough.
Who does what
In this system the primary agent owns the full incident lifecycle: detection, triage, investigation, remediation, closure. The partner agent gets called when the primary agent needs to see into a part of the stack it can't access natively. It does the specialized deep-dive, returns what it found, and the primary agent takes it from there.
Both agents write to shared memory and the primary agent acts on the proposed next steps.
| Primary agent | Partner agent | |
|---|---|---|
| Communication | Calls partner directly; writes to shared memory after | Responds to calls; writes enrichment to shared memory |
| Scope | Full lifecycle | Domain-specific deep-dive |
| Tools | Cloud-native monitoring, CLI, runbooks, issue trackers | Third-party observability APIs |
| Typical share | ~80% of investigation + all remediation | ~20%, specialized enrichment |
Why shared context should live where humans already work
If your agent writes its findings to a system nobody checks, you've built a very expensive diary. Write them to a GitHub Issue, a ServiceNow ticket, a Jira epic, or whatever your team actually monitors, and the dynamics change: humans can participate without changing their workflow.
Your team already watches these systems. When an agent posts its reasoning and pending decisions to a place engineers already check, anyone can review or correct it using the tools they know. Comments, reactions, status updates. No custom approval UI. The collaboration features built into your workflow tool become the oversight mechanism for free.
That persistence pays off in a second way. Every entry the agent writes is a record that future runs can search. Instead of context that disappears when a conversation ends, you accumulate operational history. How was this incident type handled last time? What did the agent try? What did the human override? That history is retrievable by both people and agents through the same interface, without spinning up a separate vector database.
You could build a dedicated agent database for all this. But nobody will look at it. Teams already have notifications, permissions, and audit trails configured in their existing tools. A purpose-built system means a new UI to learn, new permissions to manage, and one more thing competing for attention. Store context where people already look and you skip all of that.
The best agent memory is the one your team is already reading.
Design principles
A few opinions that came out of watching real incidents:
Investigate first, persist second. The primary agent calls the partner directly for real-time analysis. Both agents write to shared memory only after findings are collected. Investigation speed should never be bottlenecked by writes to external systems.
Humans see everything through shared context. The direct path is agent-to-agent only, but the shared context layer is where humans can see the full picture and step in. Agents don't bypass human visibility.
Append-only. Both agents' writes are additive. No overwrites, no deletions. You can always reconstruct the full history of an investigation.
Backend-agnostic. Swapping PagerDuty for ServiceNow, or adding GitHub Issues alongside either one, is a connector config change.
What this actually gets you
The practical upside is pretty simple: investigations aren't waiting on writes to external systems, nothing is lost when the conversation ends, and the next on-call engineer picks up where the last one left off instead of starting over. Every action from both agents shows up in the systems humans already look at.
Adding a new partner agent or a new shared memory backend is a connector change. The architecture doesn't care which specific tools your team chose.
The fast path is for investigation. The durable path is for everything else.