Blog Post

Apps on Azure Blog
5 MIN READ

Shared Agent Context: How We Are Tackling Partner Agent Collaboration

dbandaru's avatar
dbandaru
Icon for Microsoft rankMicrosoft
Mar 25, 2026

As we were building out the partner ecosystem for Azure SRE Agent we ran into a hard question: How can two agents working on the same problem share context and persist that context once the incident is resolved? This post covers the architecture we built: direct real-time communication for speed, and a shared context layer using systems you already have setup like PagerDuty, GitHub Issues, or ServiceNow and more !

Your Azure SRE agent detects a spike in error rates. It triages with cloud-native telemetry, but the root cause trail leads into a third-party observability platform your team also runs. The agent can't see that data. A second agent can, one that speaks Datadog or Dynatrace or whatever your team chose. The two agents talk to each other using protocols like MCP or directly via an API endpoint and come up with a remediation. The harder question is what happens to the conversation afterward.

TL;DR

Two AI agents collaborate on incidents using two communication paths: a direct real-time channel (MCP) for fast investigation, and a shared memory layer that writes to systems your team already uses, like PagerDuty, GitHub Issues, or ServiceNow. No new tools to adopt. No ephemeral conversations that vanish when the incident closes.


The problem

Most operational AI agents work in isolation. Your cloud monitoring agent doesn't have access to your third-party observability stack. Your Datadog specialist doesn't know what your Azure resource topology looks like. When an incident spans both, a human has to bridge the gap manually. At 2 AM. With half the context missing.

And even when two agents do exchange information directly, the conversation is ephemeral. The investigation ends, the findings disappear. The next on-call engineer sees a resolved alert with no record of what was tried, what was found, or why the remediation worked. The next agent that hits the same pattern starts over from scratch.

What we needed was somewhere for both agents to persist their findings, somewhere humans could see it too. And we really didn't want to force teams onto a new system just to get there.


Two communication paths

Direct agent-to-agent (real-time)

During an active investigation, the primary agent calls the partner agent directly. The partner runs whatever domain-specific analysis it's good at (log searches, span analysis, custom metric queries) and returns findings in real time. This is the fast path.

The direct channel uses MCP, so any partner agent can plug in without custom integration work. The primary agent doesn't need to understand the internals of Datadog or Dynatrace. It asks questions, gets answers.

Shared memory (durable)

After the direct exchange, both agents write their actions and findings to external systems that humans already use. This is the durable path, the one that creates audit trails and makes handoffs work.

The shared memory backends are systems your team already has open during an incident:

BackendWhat gets writtenGood fit for
Incident platform (e.g., PagerDuty)Timeline notes, on-call handoff contextTeams with alerting-centric workflows
Issue tracker (e.g., GitHub Issues)Code-level findings, root cause analysis, action commentsTeams with dev workflow integration
ITSM system (e.g., ServiceNow)Work notes, ITSM-compliant audit trailEnterprise IT, regulated industries

The important thing: this doesn't require a new system. Agents write to whatever your team already uses.

 
 

How it works

StepActorWhat happensPath
1Alert sourceMonitoring fires an alert
2Primary agentReceives alert, triages, starts investigating with native toolsInternal
3Primary agentCalls partner agent for domain-specific analysis (third-party logs, spans)Direct via MCP or API
4Partner agentRuns analysis, returns findings in real timeDirect via MCP or API
5Primary agentCorrelates partner findings with native data, runs remediationInternal
6Both agentsWrite findings, actions, and resolution to external systemsShared memory via existing sources
7Agent or humanVerifies resolution, closes incidentShared memory via existing sources

Steps 3 through 5 happen in real time over the direct channel. Nothing gets written to shared memory until the investigation has actual results. The investigation is fast; the record-keeping is thorough.


Who does what

In this system the primary agent owns the full incident lifecycle: detection, triage, investigation, remediation, closure. The partner agent gets called when the primary agent needs to see into a part of the stack it can't access natively. It does the specialized deep-dive, returns what it found, and the primary agent takes it from there.

Both agents write to shared memory and the primary agent acts on the proposed next steps.

 Primary agentPartner agent
CommunicationCalls partner directly; writes to shared memory afterResponds to calls; writes enrichment to shared memory
ScopeFull lifecycleDomain-specific deep-dive
ToolsCloud-native monitoring, CLI, runbooks, issue trackersThird-party observability APIs
Typical share~80% of investigation + all remediation~20%, specialized enrichment

 

Why shared context should live where humans already work

If your agent writes its findings to a system nobody checks, you've built a very expensive diary. Write them to a GitHub Issue, a ServiceNow ticket, a Jira epic, or whatever your team actually monitors, and the dynamics change: humans can participate without changing their workflow.

Your team already watches these systems. When an agent posts its reasoning and pending decisions to a place engineers already check, anyone can review or correct it using the tools they know. Comments, reactions, status updates. No custom approval UI. The collaboration features built into your workflow tool become the oversight mechanism for free.

That persistence pays off in a second way. Every entry the agent writes is a record that future runs can search. Instead of context that disappears when a conversation ends, you accumulate operational history. How was this incident type handled last time? What did the agent try? What did the human override? That history is retrievable by both people and agents through the same interface, without spinning up a separate vector database.

You could build a dedicated agent database for all this. But nobody will look at it. Teams already have notifications, permissions, and audit trails configured in their existing tools. A purpose-built system means a new UI to learn, new permissions to manage, and one more thing competing for attention. Store context where people already look and you skip all of that.

The best agent memory is the one your team is already reading.


Design principles

A few opinions that came out of watching real incidents:

Investigate first, persist second. The primary agent calls the partner directly for real-time analysis. Both agents write to shared memory only after findings are collected. Investigation speed should never be bottlenecked by writes to external systems.

Humans see everything through shared context. The direct path is agent-to-agent only, but the shared context layer is where humans can see the full picture and step in. Agents don't bypass human visibility.

Append-only. Both agents' writes are additive. No overwrites, no deletions. You can always reconstruct the full history of an investigation.

Backend-agnostic. Swapping PagerDuty for ServiceNow, or adding GitHub Issues alongside either one, is a connector config change.


What this actually gets you

The practical upside is pretty simple: investigations aren't waiting on writes to external systems, nothing is lost when the conversation ends, and the next on-call engineer picks up where the last one left off instead of starting over. Every action from both agents shows up in the systems humans already look at.

Adding a new partner agent or a new shared memory backend is a connector change. The architecture doesn't care which specific tools your team chose.

The fast path is for investigation. The durable path is for everything else.

Published Mar 25, 2026
Version 1.0
No CommentsBe the first to comment