Apps on Azure Blog articles

Microsoft Foundry Now Has an AI Gateway Control Plane — What Changes for App Service

jordanselig — Fri, 17 Jul 2026 17:04:16 GMT

In May, I published a runnable sample that put Azure API Management in front of an AI agent on Azure App Service. APIM handled token limits, semantic caching, token metrics, and access to the model. The point was simple: keep the agent framework interchangeable and put the production controls at the gateway.

A recent Apps on Azure post, From AI Adoption to AI Governance, arrived at the same architectural conclusion from a platform-governance angle. It uses APIM to turn model identity and token consumption into a per-model chargeback signal for Microsoft Foundry workloads.

That is useful validation, but the more interesting update is in the product itself: Microsoft Foundry can now create or associate an APIM-based AI Gateway directly from its Admin console. The data path is still APIM. What changed is who can configure and govern it.

This post explains what that means for an App Service-hosted agent, what Foundry now manages, what still belongs in APIM, and what I would change in the sample we built earlier.

The short version: same gateway, new control plane

The architecture is still familiar:

Client -> App Service agent -> APIM AI Gateway -> Foundry model

APIM remains the traffic gateway. It authenticates callers, applies policies, forwards requests, and emits telemetry. App Service remains the application runtime for the agent and its APIs.

The new piece is the Foundry control plane. From Operate > Admin console > AI Gateway, a platform team can now:

Create a new APIM instance or associate a supported existing instance.
Enable the gateway for individual Foundry projects.
Apply independent token limits to projects sharing the gateway.
Verify gateway traffic and inspect logs.

Once the gateway is enabled, separate Foundry experiences can also register custom agents and govern supported MCP tools. Those actions do not happen in the AI Gateway tab itself, but the gateway provides their governed traffic path.

That is a meaningful shift. In the original sample, the infrastructure owner created APIM, imported the model API, deployed policy XML, and wired the agent to the gateway. Foundry does not remove APIM or replace those advanced capabilities. It gives Foundry administrators a first-party path into the same architecture.

What Foundry now owns

Project onboarding

An AI Gateway is associated with a Foundry resource, then enabled for projects inside it. New projects can inherit the gateway automatically. Existing projects must be added explicitly.

This creates a useful governance boundary: multiple teams can share the same APIM instance without sharing one undifferentiated token pool. Each project can receive its own token ceiling. If a project exceeds its configured limit, the gateway returns 429 Too Many Requests while other projects continue using their allocations.

Foundry-level limits

Our earlier sample used an APIM token-limit policy keyed by an APIM subscription. That is still a valid pattern, especially when the consumer boundary is an application, business unit, or external customer.

The Foundry integration adds another natural counter key: the Foundry project. For organizations already organizing models and agents by project, limits can now follow that structure instead of being recreated independently in every client application.

Inventory for agents and tools

The AI Gateway experience is expanding beyond model endpoints in two distinct ways. Foundry can register custom agents running outside Foundry, including agents that use A2A as their communication protocol. Separately, AI Gateway can govern supported MCP tools. The current tools-governance preview is scoped to MCP; it does not cover every Foundry or OpenAPI tool type.

That matters for App Service because the runtime does not need to move. An agent can continue running on App Service while participating in Foundry's inventory, and its supported MCP traffic can flow through the gateway.

These experiences are still preview features, so I would treat them as a control-plane integration to evaluate, not a reason to redesign a working production agent.

What still belongs in APIM

Foundry makes the common path easier. It does not make the full APIM surface unnecessary.

I would still use APIM directly for:

Custom policy expressions and organization-specific headers or claims.
Semantic caching backed by Azure Managed Redis.
Backend pools, priority routing, and circuit breakers.
Private networking, multi-region gateways, and advanced topology decisions.
Custom metrics and dimensions needed for internal chargeback.

The clean mental model is: Foundry manages the AI assets and common project guardrails; APIM manages the traffic contract and advanced gateway behavior.

Applying this to the App Service sample

The existing framework-agnostic AI Gateway sample hosts a Microsoft Agent Framework agent and an MCP server on one Linux App Service. The agent reaches its model through APIM, which applies token limits, semantic caching, token metrics, and managed-identity authentication.

Most of that architecture should stay exactly as it is.

1. Keep the App Service application unchanged

The agent should continue calling a stable gateway endpoint. It should not need to understand whether APIM was provisioned by Bicep, associated in Foundry, or managed by a central platform team. That separation was the reason to introduce the gateway in the first place.

2. Choose the APIM ownership model deliberately

For a new proof of concept, Foundry can create a Basic v2 APIM instance. For production, Microsoft recommends evaluating Standard v2 or Premium v2 based on throughput and networking requirements.

For an existing enterprise gateway, use the Foundry option to associate an APIM instance only after confirming that it meets the eligibility requirements: it must be in the same Microsoft Entra tenant and Azure subscription as the Foundry resource, you need the API Management Service Contributor role (or Owner), and the instance must use a supported v2 tier.

Important migration detail: the original sample defaults to the APIM Developer tier. That instance will not appear in Foundry's Use existing APIM list because the direct integration currently requires a v2 tier. Do not assume an existing classic-tier gateway can simply be attached.

3. Add existing projects explicitly

Creating or associating the gateway at the Foundry resource level does not automatically enable every existing project. Add each existing project to the gateway, then verify its status is Enabled.

4. Decide where each limit belongs

Avoid layering limits without a clear ownership model. A Foundry project limit, an APIM subscription limit, and a model deployment TPM limit answer different questions:

Control	Best boundary	Question it answers
Foundry project limit	Team or workload project	How much shared capacity can this project consume?
APIM policy limit	Subscription, user, tenant, or application	How much can this specific consumer use?
Model deployment quota	Backend deployment	What capacity exists at the model endpoint?

Use all three when the boundaries are intentional. Otherwise, operators end up debugging a 429 without knowing which layer produced it.

5. Verify the data path

After enabling a project, make a model request and confirm that the APIM request metric increments. Microsoft also recommends checking the GatewayLogs table for a successful response and an API name matching the AI Gateway.

Then test the failure path. Set a deliberately small project limit, exceed it, and confirm that the gateway returns 429. A governance feature is not finished until the team has observed both the allowed and denied behavior.

Which path should you choose?

Scenario	Recommended approach
New Foundry proof of concept	Create the AI Gateway from Foundry and start with project-level limits.
Existing v2 enterprise APIM	Associate the existing instance and preserve central networking and policy ownership.
Existing classic-tier APIM	Keep the working gateway or plan a deliberate v2 migration; it cannot be selected directly today.
Advanced routing or custom policy needs	Use Foundry for inventory and common guardrails, then manage the advanced behavior in APIM.
Strict isolation between project groups	Use separate Foundry resources and separate gateways rather than assuming one gateway per project.

The architecture was right; the experience caught up

The most important takeaway is not that everyone should rebuild an AI Gateway in a new portal. It is that the composable architecture now has a more accessible control plane.

App Service still runs the agent. APIM still governs the traffic. Foundry now gives model and platform owners a direct way to connect projects, allocate capacity, and bring agents and tools into the same governance view.

If you already built the earlier sample, the application boundary does not need to change. Evaluate the Foundry integration, decide whether its project model matches your organization, and check the APIM tier before planning any migration. The gateway remains the contribution; Foundry now makes it easier to operate.

Resources

MCP Enterprise Authorization Is Here — What Entra and App Service Can Do Today

jordanselig — Thu, 16 Jul 2026 14:31:39 GMT

A few weeks ago I wrote about what the latest MCP changes meant for scaling on Azure App Service. This time, the protocol change is about a different kind of scale: how an enterprise connects hundreds or thousands of employees to MCP servers without making every person authorize every server one at a time.

Enterprise-Managed Authorization (EMA) is now a stable MCP extension. It makes the organization's identity provider the policy decision point, replaces repeated server-by-server browser prompts with an identity assertion grant, and gives security teams a central place to grant and revoke access.

That sounds a lot like Microsoft Entra preauthorization and App Service Authentication. But while building the sample for this post, I found an important distinction:

A centrally governed OAuth experience is not automatically the EMA protocol.

App Service and Entra can give you a strongly governed MCP endpoint today. The full EMA flow additionally requires the enterprise identity provider to issue an Identity Assertion JWT Authorization Grant, or ID-JAG, and the MCP authorization server to exchange it. Same goal, related building blocks, different wire protocol.

So I built both.

Why the normal OAuth flow gets expensive

Standard MCP authorization is user-driven. A client discovers an MCP server, sends the user through that server's authorization flow, receives an access token, and repeats the process for the next server.

That is a good property for consumer scenarios. The user chooses which application can reach which data.

At enterprise scale, the same pattern becomes an authorization tax:

Every employee has to connect every approved MCP server.
Consent and account-selection prompts create inconsistent setup.
Security teams have to reason about access across many separate authorization relationships.
Offboarding and policy changes are harder to apply from one control plane.
Work and personal identities can get mixed together at the worst possible boundary: the tool that can act on a user's behalf.

EMA moves the authoritative decision back to the enterprise IdP.

What EMA changes on the wire

The user still signs in to the MCP host with the corporate identity provider. The difference comes when the client needs an access token for an MCP server.

The MCP client exchanges the user's identity assertion at the enterprise IdP for an ID-JAG. The stable extension uses OAuth token exchange semantics from RFC 8693.
The IdP evaluates enterprise policy for the requested client, MCP resource, user, and scopes.
If policy allows access, the IdP issues a short-lived, audience-bound ID-JAG.
The client presents that JWT to the MCP authorization server using the JWT bearer grant from RFC 7523.
The authorization server validates the assertion and returns the resource access token the client uses for MCP calls.

The user never goes through a second browser flow at the MCP authorization server. The enterprise IdP already made the decision.

That gives the organization three useful properties:

Authorize once, inherit everywhere. Admins enable approved servers and users receive access according to their existing groups and roles.
Central policy and revocation. Conditional access, employment state, device requirements, and other policy remain in the IdP.
A cleaner identity boundary. The enterprise login is the source of authority instead of an account picker at every MCP server.

What Entra and App Service provide today

Azure App Service already has a very good home for an enterprise MCP resource server: App Service Authentication sits in front of the application and validates Microsoft Entra access tokens before a request reaches your Python process. App Service Authentication’s MCP authorization integration and protected resource metadata support are currently in preview.

In the deployable half of the sample, App Service:

publishes OAuth protected resource metadata for MCP discovery;
requires authentication on /mcp and returns 401 to unauthenticated requests;
validates token signature, issuer, audience, and lifetime;
allows only the configured client application IDs;
injects the validated principal into platform-controlled request headers;
leaves only / and /health public;
sends OpenTelemetry data to Application Insights with managed identity.

The Entra application exposes user_impersonation and preauthorizes Visual Studio Code and Azure CLI. Tenant administrators can then apply Conditional Access to the enterprise application and inspect Entra sign-in logs.

That is centrally governed OAuth. It removes consent prompts for known clients and puts token validation at the platform edge.

It is not native EMA because this path does not ask Entra to issue an ID-JAG through RFC 8693, and there is no RFC 7523 assertion exchange at a separate MCP authorization server.

Here is the practical comparison:

Capability	Entra + App Service path	EMA path
Central tenant policy	Yes	Yes
Protected resource metadata	Yes	Yes
Known-client preauthorization	Yes, through Entra preauthorization	Often still required at the IdP and authorization server
Per-server browser flow	Avoided for preauthorized clients	Avoided by protocol design
Enterprise IdP issues an ID-JAG	No	Yes
MCP authorization server exchanges the ID-JAG	No	Yes
Included in the sample	Deployable Azure app	Local interoperability lab

The distinction matters because "no consent prompt" describes a user experience, not a protocol proof.

Could you run full EMA on App Service today?

Yes. App Service does not prevent you from deploying the complete EMA flow. If your enterprise IdP can issue ID-JAGs, you can implement the resource authorization server and MCP token validation in application code—or point the MCP server at a separate authorization service. App Service then remains the compute platform rather than the authorization boundary.

That approach replaces Easy Auth’s platform-managed token validation with responsibilities such as token issuance, signing-key management, ID-JAG validation, replay protection, scope enforcement, and audit logging. The remaining prerequisite is still an enterprise IdP and MCP client that support the ID-JAG exchange.

I built both paths on purpose

The companion repository is seligj95/app-service-ema-mcp.

The Azure path is a FastAPI MCP server running on Linux App Service. It exposes two read-only tools:

whoami returns the tenant, client, subject, roles, and scopes that App Service Authentication validated.
authorization_model states exactly which authorization model protects the deployment.

The local path runs four components in one process:

a stand-in enterprise IdP;
an MCP authorization server;
an MCP resource server;
an MCP client using the official Python SDK's identity-assertion APIs.

The SDK's IdentityAssertionOAuthProvider covers the second leg: the RFC 7523 JWT bearer grant from the MCP client to the MCP authorization server. The first leg, the RFC 8693 exchange with the enterprise IdP that produces the ID-JAG, is deployment-specific and supplied through the SDK's assertion_provider callback. The lab fulfills that callback with its in-process stand-in IdP.

The lab validates the ID-JAG signature, typ, issuer, audience, client ID, resource, scopes, expiration, and single-use jti. It also tests rejection cases for the wrong issuer, wrong audience, wrong resource, scope escalation, expiration, and replay. The lab deliberately requires a resource claim, which is stricter than the stable specification's optional claim; production authorization servers should choose that posture explicitly.

The stand-in IdP and in-memory token store are deliberately not production components. They make the protocol visible and testable without pretending Entra currently exposes the exact issuance flow used by the lab.

Run the EMA flow locally

Clone the repository and install the project:

git clone https://github.com/seligj95/app-service-ema-mcp
cd app-service-ema-mcp
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"

Then run the complete ID-JAG exchange:

python -m examples.ema_lab.client

The final result is the identity and authorization context returned by the MCP server:

{'subject': 'alice@example.com', 'client_id': 'finance-agent',
 'scopes': ['mcp:whoami'], 'resource': 'http://localhost/mcp'}

Run the test suite to see both the successful exchange and the rejection boundaries:

python -m ruff check .
python -m pytest

One implementation note: at the time of writing, PyPI mcp 1.28.1 package does not yet contain the identity-assertion APIs used by the stable extension. The sample pins an exact commit from the official Python SDK until those APIs ship in a release.

Deploy the Entra-governed MCP server

The repository includes azd and Bicep for a Basic B1 Linux App Service plan, Python 3.14, App Service Authentication, protected resource metadata, Log Analytics, Application Insights, and managed-identity telemetry.

After choosing a subscription:

azd auth login
azd env new ema-mcp
azd env set AZURE_SUBSCRIPTION_ID <subscription-id>
azd env set AZURE_LOCATION eastus
./scripts/configure-entra-app.sh
azd up

The Entra bootstrap runs once before azd up because azd resolves required Bicep parameters before the preprovision hook. After that, the hook reruns the same idempotent configuration on each provision.

Verify the deployed boundary:

./scripts/verify-deployment.sh

The script checks that protected resource metadata is available, an anonymous MCP request is rejected, Azure CLI can obtain the delegated API token, and the same MCP request succeeds with that token.

During deployment validation, I invoked whoami against the deployed app. App Service returned the expected Azure CLI client and the delegated scope:

{
  "authentication_type": "aad",
  "client_id": "04b07795-8ddb-461a-bbee-02f9e1bf7b46",
  "scopes": ["user_impersonation"]
}

The corresponding custom trace arrived in Application Insights through the App Service managed identity, with local telemetry authentication disabled.

The response above is shortened for readability. The complete tool result also includes name, subject, tenant_id, and roles.

Three details worth carrying into production

The platform did the hard security work, but three integration details were easy to miss.

1. Allow the audience Entra actually emits

The client requests api://<application-id>/user_impersonation, but a v2 delegated access token can carry the bare application client ID in aud. The sample allows both the bare client ID and the api:// form while still rejecting tokens for every other resource.

2. Keep the SDK's DNS-rebinding protection

The MCP Python SDK defaults to localhost when a server does not declare its deployment hostname. That is a safe local default, but a deployed request correctly fails with 421 Invalid Host header.

The sample reads WEBSITE_HOSTNAME and allowlists only that exact App Service hostname and its optional port form. It does not disable the protection.

3. Account for App Service claim mapping

App Service maps some JWT claim names before it builds X-MS-CLIENT-PRINCIPAL. For example, the delegated scp value can arrive under the mapped scope claim URI.

The application accepts both representations, while continuing to trust only the principal header injected by App Service. A caller-supplied identity header never becomes proof of authentication because the platform strips and replaces these headers at the edge.

The production checklist

This sample gives you the authorization boundary, not every production control. Before using the pattern for a real tool surface:

Apply Conditional Access deliberately, starting in report-only mode.
Add tool-level scope or role checks when different tools have different sensitivity.
Use managed identity or an explicit on-behalf-of flow for downstream APIs; never forward the MCP resource token.
Add private networking or an API gateway when the workload requires network isolation.
Keep access tokens and ID-JAGs short-lived.
Validate ID-JAG signatures with the enterprise IdP's published JWKS, discovered from its authorization server metadata. The lab's shared HMAC key exists only to keep the local flow self-contained.
Enforce single-use ID-JAG replay protection at the authorization server.
Log authorization decisions without logging bearer tokens or assertion contents.
Replace the SDK Git pin after the identity-assertion APIs reach a tested package release.

The takeaway

EMA is bigger than eliminating another consent screen. It defines how an enterprise IdP becomes the authorization control plane for an MCP fleet, with a portable assertion that an MCP authorization server can validate.

Entra ID and App Service already give you many of the outcomes enterprises need: tenant policy, known-client governance, protected resource discovery, platform token validation, Conditional Access, and centralized telemetry.

Just name the boundary honestly.

Use App Service and Easy Auth when you want a deployable, Entra-governed MCP endpoint today. If your IdP already supports ID-JAG issuance, App Service can also host a custom EMA authorization implementation. Otherwise, use the local lab to understand and test the protocol while platform-native support develops.

That is the part I like most about the extension: enterprise authorization stops being an experience every MCP server invents and becomes a protocol the ecosystem can share.

Resources

How Microsoft 365 built a platform engineering layer on AKS to ship faster at global scale

Suma SaganeGowda — Thu, 16 Jul 2026 05:00:13 GMT

Challenge: Scaling success without increasing complexity 

Microsoft 365 runs a broad portfolio of services across global and sovereign clouds. At that scale, the platform has to do more than schedule workloads. It has to enforce security boundaries, meet enterprise compliance requirements, and support reliable change management across every service that runs on it.

Without a shared foundation, each service team ended up solving the same problems on its own: how to provision correctly, deploy safely, meet policy requirements, and stay resilient as traffic and complexity grew. That slowed teams down and produced unnecessary variation in how services were built and operated.

What Microsoft 365 needed was a hosting experience that could absorb operational complexity centrally so product teams could focus on customer value. 

This is why building on Kubernetes was a natural choice for us. Kubernetes gives Microsoft 365 the ability to run and grow at the scale the service demands without sacrificing consistency. But we also knew that growth can increase system complexity without the right management in place. As services and team sizes increase, we needed systems that mitigate these issues.  

Infrastructure and operational complexity grows faster than teams unless the platform absorbs that complexity. Service teams historically solved infrastructure challenges independently. Each team was able to create their own solution with Azure, but without a standardized platform, we risked trending towards unnecessary operational overhead. As services scaled, we needed to mitigate the reality of: 

Inconsistency and configuration drift across environments

Increased infrastructure work pulling developers away from product work

Increased risk and uncertainty in deployments and change management

We needed a platform layer that could absorb this complexity before it became a tax on every team.

Solution: Evolving the operations experience 

Microsoft 365 uses AKS as the foundation for a standardized hosting layer. AKS provides the orchestration, Azure integration, and operational primitives needed for large-scale environments. Teams builds on that foundation with Microsoft 365-specific patterns, guardrails, and workflows.

That hosting layer gives teams a more consistent path from onboarding to production. Provisioning is aligned to predefined architectures and regional boundaries. Security and compliance controls are applied centrally instead of left to every service team to implement independently. Standardized deployment practices reduce change risk, while shared telemetry improves visibility into health and policy adherence.

The result is a model where developers hand over service intent and application code, while the platform supplies the hosted environment, deployment safety, security and compliance guardrails, and operational scaffolding needed to run well at scale.

COSMIC: Microsoft 365’s Platform Hosting Layer 

This is exactly why we created COSMIC which is Microsoft 365’s platform engineering layer for standardizing hyper-scale cloud services on AKS. Our goal with COSMIC and its platform engineering team was simple: “give us the code and allow us to do the rest.”

COSMIC was designed to create a secure, compliant, common tech stack that removes infrastructure burden from application teams, allowing them to focus on innovative feature work while COSMIC provides a full-service, standardized platform. It functions as an interface layer on top of AKS and, with the help of the COSMIC team, tailors the experience to Microsoft 365’s needs: 

A secure and compliant default posture. Services inherit standardized controls for identity, access, certificates and secrets, approved base images, and policy enforcement.
A streamlined provisioning and onboarding path. Teams are mapped to predefined deployment shapes, regional placement patterns, and hosted environments that reduce setup friction and improve consistency.
Safer deployments and stronger resiliency. Standard rollout patterns, stage validation, telemetry-driven checks, and controlled change boundaries help teams release with lower risk and recover more predictably.
Operational efficiency at scale. Shared observability, common dashboards, and platform-managed runtime practices reduce duplicated effort while helping teams focus on feature delivery instead of infrastructure mechanics.

At a high level, product teams submit a request to COSMIC for the infrastructure and services needed to run their workloads. Based on a service’s requirements, COSMIC provisions infrastructure using standardized architectures with built-in security, compliance, and regional controls. The platform also helps place workloads efficiently and manage capacity at scale.

Once infrastructure is provisioned, teams adopt standardized deployment and operational workflows that automatically inherit Microsoft 365 security, compliance, and reliability standards.

This includes:

Identity, authorization, and secrets management: standardized authentication and authorization patterns, automated certificate management, and centralized secrets management using Azure Key Vault to provide consistent credential handling across the service lifecycle
Standardized and secure runtime environments: standardized base container images with enforced patching and security updates to keep workloads current with the latest security and platform improvements
Built-in resiliency and scalable operations: automated scaling, regional placement controls, health-based recovery, and deployment safety mechanisms to help services remain reliable during traffic spikes, failures, and growth
Rich observability and telemetry: standardized logging, metrics, and monitoring that help teams detect, diagnose, and mitigate issues across both the COSMIC platform and the applications running on it.
Managed deployments and infrastructure automation: centralized deployment systems with built-in validation, compliance enforcement, safe rollout practices, and automated recovery. Infrastructure and environment configuration are standardized through reusable templates and automated workflows, reducing operational overhead as services scale.

Impact: A standardized platform that delights developers

For Microsoft 365 service teams, COSMIC makes the hosted experience more predictable. Teams do not have to become experts in every layer of platform operations in order to build and run business-critical services. Instead, we unlock:

Faster, more consistent deployments – shipping to production is simpler and more predictable, reducing decision fatigue for developers.

Team autonomy preserved - service teams keep their choice in availability using a self-managed environment within COSMIC’s guardrails.

Security and compliance by default – trust requirements are “baked in” to the experience and enforced through the platform, not through manual checklists.

COSMIC is now one of the most widely-adopted internal platforms at Microsoft, with a continually positive growth trend. It’s one of our clearest win-win outcomes: developers focus on the work they love while the platform team maintains systems as the experts in operational excellence.

Learnings: Building your own platform 

Operating Kubernetes at enterprise scale requires simplifying trust, compliance, and reliability into the platform. Additionally, developer experience is key: developer and product team autonomy scales best when guardrails are strong and experience is consistent. 

These properties don’t emerge organically; they require a modern operating model driven by a team of platform experts.

When creating your organization’s model, we recommend keeping some of our learnings in mind:

Limit blast radius in change management - safe deployment practices for all changes are critical. Changes are contained within defined boundaries, with the ability to reroute traffic or fail over to backup services to minimize risk. Every deployment follows the same standardized validation and rollout process.

Create a turnkey experience for service teams - product teams should focus on building great user experiences, not becoming infrastructure experts. COSMIC abstracts away operational complexity such as cluster configuration, patching, scaling policies, pod disruption management, and image lifecycle management so teams can move faster with less operational overhead.

Resiliency and best practices baked in - services inherit platform-level resiliency patterns by default, including automated scaling, health-based recovery, deployment safety checks, regional distribution, and standardized operational guardrails. This helps teams build reliable services without having to reimplement foundational operational practices themselves.

Aggregate observability around what matters - teams get pre-built dashboards and standardized telemetry experiences, while role-based access controls ensure the right people see the right data. Service teams can focus on their application health, while platform operators monitor fleet-wide trends across reliability, performance, utilization, and cost efficiency. This helps reduce noise and surface the signals that matter most.

Turning learnings into features 

Microsoft has absorbed the hardest parts of operating Kubernetes at scale and encoded lessons from COSMIC into Azure Kubernetes Service. Through those learnings, we build features that minimize pain points for our customers. Our experience with COSMIC informed new features that make building your own platform faster and easier:

AKS Automatic: this is how we bring the simplified experience developers needed to customers. AKS Automatic lets developers to create production-ready Kubernetes clusters with built-in best practices and guardrails while Azure handles the node management, scaling, and security.

Azure Kubernetes Fleet Manager: Operating a large, globally distributed fleet requires predictable, low risk ways to coordinate safe multi-cluster upgrades, load balance traffic across clusters, and centralize control across regions and clouds. The same capabilities that enable fleet-level orchestration for ourselves are available for our customers.

Expanded observability: Managed Prometheus and Grafana are integrated with AKS and Azure Monitor for secure, compliant, and cost-effective monitoring, enabling actionable insights at scale without drowning in data or compromising on security and compliance.

What’s next and CTAs

See the Azure Kubernetes Service roadmap

Learn more about Azure Kubernetes Service

Read more about Microsoft as Customer Zero

Give your AI agent two memories with Azure App Service

jordanselig — Wed, 15 Jul 2026 17:23:43 GMT

Agents feel continuous only when they can operate across two very different time horizons. They need the recent turns that make the current conversation coherent, and they need a smaller set of durable facts that can follow an authenticated user into a new conversation.

This sample implements both horizons on Azure App Service with Microsoft Agent Framework, Azure Managed Redis, Azure Cosmos DB for NoSQL vector search, and Azure OpenAI. It is deployed and available as a complete reference implementation with a browser UI, deterministic local mode, tests, Bicep, and Azure Developer CLI support.

Sample: github.com/seligj95/app-service-agent-memory

Why one memory store is not enough

Conversation history is ordered, session-specific, frequently updated, and naturally short-lived. The agent needs it to resolve statements such as “use the second option” or “what did I just say?”

Durable memory is selective, user-scoped, and useful across sessions. It holds facts such as a preferred deployment region, product name, accessibility need, or writing preference. Retrieval is semantic rather than chronological.

Putting both into one unbounded prompt makes cost, latency, privacy, and deletion harder to reason about. The sample instead gives each horizon a purpose-built store and joins them through the Agent Framework context pipeline.

The Azure architecture

The public FastAPI application runs on one always-on App Service Premium v4 instance with Python 3.13. The browser creates demo user and conversation IDs in local storage. That keeps the sample easy to explore, but it is not production authentication.

For every chat turn, the application:

Validates the user ID, session ID, message, and retrieval limit.
Loads bounded conversation history from Azure Managed Redis.
Creates an embedding for the new input.
Runs a partition-scoped vector query in Cosmos DB for the same user.
Adds relevant durable memories to the Agent Framework context.
Runs gpt-5-mini.
Stores the new conversation messages in Redis and refreshes their TTL.
Extracts conservative durable facts, embeds them, deduplicates them, and upserts them to Cosmos DB.

The chat response also returns memory attribution so the UI can show that a memory influenced the turn.

Short-term history with a custom HistoryProvider

Agent Framework's current Python API makes history a context provider. The custom provider only implements the storage boundary; the framework handles when to load and persist messages.

class RedisHistoryProvider(HistoryProvider):
    def __init__(self, store, ttl_seconds, max_messages=40):
        super().__init__("redis-history")
        self._store = store
        self._ttl_seconds = ttl_seconds
        self._max_messages = max_messages

    async def get_messages(self, session_id, *, state=None, **kwargs):
        user_id = _required_state_value(state, "user_id")
        values = await self._store.load(user_id, session_id)
        return [Message.from_dict(json.loads(value))
                for value in values[-self._max_messages:]]

    async def save_messages(self, session_id, messages, *, state=None, **kwargs):
        user_id = _required_state_value(state, "user_id")
        values = [json.dumps(message.to_dict()) for message in messages]
        await self._store.append(
            user_id, session_id, values, self._ttl_seconds, self._max_messages
        )

The Redis key is session:{user_id}:{session_id}. Each append also trims the list and refreshes the seven-day TTL. Serializing the complete Agent Framework Message preserves tool and attribution metadata instead of reducing history to plain strings.

Durable recall with a custom ContextProvider

The durable provider participates before and after the model call.

class CosmosContextProvider(ContextProvider):
    async def before_run(self, *, agent, session, context, state):
        user_id = _required_state_value(state, "user_id")
        embedding = await self._embeddings.embed(_latest_input_text(context))
        recalled = await self._store.recall(
            user_id, embedding, self._recall_limit
        )
        state["recalled_memories"] = [
            item.model_dump(mode="json") for item in recalled
        ]

        if recalled:
            facts = "\n".join(f"- {item.text}" for item in recalled)
            context.extend_instructions(
                self.source_id,
                "Use these durable memories only when relevant:\n" + facts,
            )

    async def after_run(self, *, agent, session, context, state):
        for fact, category in extract_durable_facts(_latest_input_text(context)):
            embedding = await self._embeddings.embed(fact)
            await self._store.remember(
                state["user_id"],
                fact,
                category,
                state["source_turn"],
                embedding,
            )

The Cosmos container uses /user_id as its partition key and a 1,536-dimension cosine quantizedFlat vector index. Recall is always routed to one user's partition and is bounded to a small result set. A stable ID derived from the user scope and normalized content hash makes writes idempotent.

Passwordless by default

App Service uses its system-assigned managed identity for every data service.

Azure OpenAI: Cognitive Services OpenAI User
Cosmos DB: native built-in data contributor
Key Vault: Key Vault Secrets User
Azure Managed Redis: database-scoped Entra access-policy assignment

Azure OpenAI and Cosmos DB local authentication are disabled. Redis requires TLS and Entra authentication, and its access keys are disabled. The application explicitly selects ManagedIdentityCredential in Azure and AzureCliCredential for local real-service development. It does not use a broad production credential chain.

Key Vault remains the secrets boundary for future extensions, although this passwordless sample does not need a runtime secret.

Try it locally without Azure

The deterministic fake mode exercises the real provider pipeline and complete UI without an Azure subscription:

uv sync --python 3.13 --all-groups
uv run uvicorn app.main:app --reload

Open http://127.0.0.1:8000, tell the agent “My favorite launch color is teal,” start a new conversation, and ask for the color. You can inspect attribution, list the stored memory, and forget it.

Deploy with Azure Developer CLI

After creating an azd environment and setting its subscription and supported region, deployment is one command:

azd up --no-prompt

The Bicep creates App Service, Managed Redis, Cosmos DB, Azure OpenAI model deployments, Key Vault, Application Insights, and Log Analytics. A smoke test then checks health, same-session history, explicit remember, new-session recall, list, forget, and absence after forget.

Azure Managed Redis availability is subscription- and region-dependent. The deployed reference uses East US 2 after the service preflight rejected East US for this subscription.

What the demo deliberately does not hide

The browser identity is anonymous and user-controlled. That is useful for understanding the data flow, but a production application must replace it with authenticated claims and authorization on every memory operation.

The sample also keeps one App Service instance. Before scaling out, replace the in-process conversation lock with a distributed lock so concurrent requests cannot reorder one session's history.

Other production work includes private endpoints and VNet integration, consent and retention policy, user export and deletion, content safety, prompt-injection defenses, abuse throttling, per-user quotas, and evaluation of retrieval thresholds.

Learn more

Two memory horizons make the agent easier to operate and easier to trust: session history remains temporary, durable memory remains selective and user-scoped, and both have explicit lifecycle controls.

Load testing Copilot Studio agents with Locust and Azure Load Testing

Krishna-Roy — Wed, 15 Jul 2026 09:47:26 GMT

A conversational agent doesn't answer like an API. You send one message. The server returns 200 OK. And then nothing happens. The real answer is still coming, streaming back over a WebSocket as one reply or several. It might take two seconds, or three minutes while the agent thinks, calls a tool, or even builds another agent. So how do you load test an answer that isn't ready when the request says it is?

This post builds that test — and one 30-minute run answered it: 253 conversations and 3,212 requests with zero failures, including an agent-creation turn that took 8.68 seconds to reply and 163.71 seconds to finish.

Why load test a Copilot Studio agent

Conversational agents built with Copilot Studio run on a platform that automatically scales to support increases in demand and load, as documented in Microsoft's performance-testing guidance. That scaling is not infinite. It stays within the environment's capacity, quotas, throttling, and service limits. A turn can also reach custom logic, connectors, and backend services with separate operating limits. Concurrent load can expose latency or failures in either part of the request path.

Start by taking a single turn apart. A turn is one user message and everything the agent does to answer it, up to the moment it signals turn.complete. The elapsed time spans several parts of the request path, including the Copilot Studio-managed path and external dependencies.

Figure 1. One turn crosses two zones: the Copilot Studio managed path (green), which scales within its capacity and quota limits, and external dependencies (amber), which have their own limits.

The green zone is the path managed by Copilot Studio: the Direct Line channel that accepts the message, the enhanced orchestration runtime that reasons over the agent's parts (its model, instructions, knowledge, tools, skills, and connected agents), and the reply that streams back one message at a time. The platform automatically scales to support increases in demand and load while coordinating those parts within those same limits. Constraints in this zone can also contribute latency or failures.

The amber zone covers everything a turn reaches beyond that managed path. From inside the green zone, the agent's tools call out to these dependencies: custom logic and Power Automate flows; connectors to systems like Salesforce, ServiceNow, SharePoint, and Dataverse; backend APIs and databases; and hosted MCP servers. (A2A connections to remote agents on other platforms existed only for classic agents at the time of this test, so they're out of scope here.) Each dependency has its own configuration and limits. Under load, any part of either zone can add latency or fail; full-agent timings alone do not identify the source.

Test each component in isolation

Component-level tests help isolate constraints before a full-agent run. Apply representative load directly to the cloud flow, connector, backend API, and MCP server one at a time. The results can show the request rate at which latency rises, throttling appears, or requests begin to fail for that dependency.

Figure 2. Each dependency is load-tested separately to measure where latency rises, throttling appears, or requests begin to fail.

An isolated run turns a vague "the agent felt slow" into measured component behavior. A full-agent run can then be compared with those results without assuming in advance which part caused the delay.

Why test the complete agent under load

Testing each component helps find its limit. But users experience the complete conversation, not one component at a time. The agent may take a different path for each message. It may return one reply or several replies. Some steps may also take much longer than others. This is true even when the agent does not call an external system.

A load test shows how the complete agent behaves when many users are active. The results help set realistic expectations for response time and reliability:

More users can mean slower replies. An agent may respond quickly for one tester but slow down when many conversations run at the same time.
Different conversations take different paths. Some work starts only after a user makes a certain choice. A one-message test may never reach that work.
The full answer takes longer than the send request. Sending a message is only the start. The response time ends when the agent has returned its last reply for the turn.
Real results support better targets. Measured response times and error rates provide a clear baseline for production planning.

What this post does

This post shows how to build a Python load test for a conversational agent built with Copilot Studio using Locust, a Python load-testing framework that simulates concurrent users. The test communicates with the agent through the Direct Line API, the channel used by the tested client application to exchange messages with the published agent.

Direct Line uses HTTP requests to get a token, start a conversation, and send a message. Replies arrive over a WebSocket connection. Each Locust virtual user follows this same path, sends a series of messages, and measures the time until the agent completes each turn.

The same Python workload file runs locally and in Azure Load Testing without code changes. Environment-specific settings come from locust.conf locally and locust.azure.conf in Azure. The local validation below uses four virtual users for 15 minutes. The cloud run uses the same two user classes with 16 virtual users for 30 minutes, including text conversations, attachment intake, and selective agent creation.

In this post, I:

Explain how a conversation works over Direct Line and WebSockets.
Build the Locust client one step at a time.
Handle complete turns, errors, and file uploads.
Run both conversation paths under load: users describing requirements in chat and users submitting requirements as file attachments.
Run the same test locally and in Azure Load Testing.

The agent under test. The examples use Automatic Agent Creator, a demonstration Copilot Studio agent that reads a business request, asks follow-up questions, and either recommends an integration approach or creates and publishes a new agent. The Direct Line flow can be adapted to other published agents, but conversation behavior, events, and response shapes must be validated for each agent, client, channel configuration, and product version.

How a Copilot Studio agent talks over Direct Line

The tested agent and client used this shape: open a session, exchange messages while it is alive, and read replies until the agent signals that the turn is done. This load test reproduces that flow end to end so its timings include the reply path, not only the message-send request.

Direct Line can deliver replies through a WebSocket stream or HTTP GET polling. Microsoft's performance-testing guidance says to use WebSockets when the client-facing application uses them; HTTP GET remains available when it does not. The tested client used WebSockets, so this harness does too. The base host in every request below is a Direct Line regional endpoint.

Get a token

Before a client can start a conversation, it needs a conversation token. A Direct Line secret is sent to the token endpoint to request one. Direct Line returns a token for one conversation and an expires_in value that gives the number of seconds until it expires. The client uses this token for the conversation requests that follow.

Figure 3. The secret is exchanged for a conversation token.POST {host}/v3/directline/tokens/generate Authorization: Bearer «Direct Line secret» → 200 { "conversationId": "…", "token": "eyJhbGci…", "expires_in": «seconds until expiry» }

The tested conversations completed before their returned token expiry. A longer-lived client must use the returned expires_in value and refresh the token before it expires.

Start a conversation and open the socket

To start a conversation, the client sends the token to Direct Line in an HTTP request. Direct Line returns two important values: a conversationId, which identifies the conversation, and a streamUrl, which is the WebSocket address used to receive replies. The client opens the WebSocket and keeps it open until the conversation ends.

Figure 4. Starting the conversation returns a streamUrl; the client connects and holds the socket open.POST {host}/v3/directline/conversations Authorization: Bearer «token» → 201 { "conversationId": "…", "streamUrl": "wss://…/stream", "token": "eyJhbGci…" } WS CONNECT wss://…/stream → socket open (HTTP 101)

Send a message

With the conversation started and the WebSocket open, the client can send the first user message. In Direct Line, a message is represented as an activity. The client sends this activity through an HTTP POST, while the agent's replies return through the open WebSocket.

POST {host}/v3/directline/conversations/{id}/activities Authorization: Bearer «token» { "type": "message", "from": { "id": "user-…" }, "text": "…", "locale": "en-US" } → 200 { "id": "…" }

The 200 response contains the ID assigned to the activity. It confirms that Direct Line accepted the message, but it does not contain the agent's answer or mean that the agent has finished processing the message. The answer arrives separately through the WebSocket connection.

Receive replies until turn.complete

After Direct Line accepts the message activity, the agent's replies arrive through the open WebSocket. A turn may contain one agent message or several. When the agent finishes the turn, the stream sends an event named turn.complete.

The client ends the read for the current turn when this event arrives. The WebSocket remains open for the next message.

Two measurements describe the response time. Although TTFB usually means time to first byte, this harness uses the label for the time to the first complete agent message:

TTFB is the time from the start of the send request to the first agent message.
ResponseTime is the time from the start of the send request to the last agent message.

Note. In this post, TTFB means the time to the first complete agent message, not the first network byte.

Figure 5. TTFB measures the time from the send to the first complete agent message. ResponseTime measures the time from the send to the last agent message. The turn.complete event ends the turn read.

Typing activities and the turn.complete event are not used as timing endpoints. The first and last agent messages set the measurements.

For a turn with one agent message, TTFB and ResponseTime are equal. When the agent first sends an acknowledgment and later sends the completed result, ResponseTime is longer than TTFB. In the Azure run reported later, the agent-creation turn averaged 8.68 seconds to the first message and 163.71 seconds to completion.

Note. Direct Line does not provide a general "last message" marker. The standard performance-testing guidance uses replyToId to match replies to the sent message and an inactivity timeout to decide when the response has ended. The new agent experience adds turn.complete as an explicit end-of-turn signal. This client uses that event as the normal stopping point, keeps an overall deadline as a safety check, and uses sender ID because from.role may be missing.

Implement one virtual-user conversation

One virtual user repeats a small cycle: open a Direct Line conversation, send a turn, receive the replies, record the timing, pause, and close. The excerpts below implement that cycle before Locust adds concurrency.

Note. These focused excerpts omit some hardening, debug logging, transcript details, and upload internals. The complete client contains them.

The important imports are shown once and reused below:

import json # Python standard library: decode WebSocket frames import time # Python standard library: measure turn duration import websocket # websocket-client: open and read the WebSocket from locust import FastHttpUser, between from locust.exception import StopUser

json and time come from Python. websocket comes from websocket-client. Locust supplies FastHttpUser, between, and StopUser. Uppercase names such as REPLY_DEADLINE are constants in new_chat_client.py.

Open the session: connect()

At the top of every task, connect() gives the virtual user a unique id, gets a token, starts a conversation, and opens the WebSocket. The unique id matters later: it's how the client tells the agent's replies apart from its own echoed message.

def connect(self): """Get a token, start a conversation, and open the WebSocket.""" self._user_id = "user-" + str(id(self)) self._turn = 0 self._token = self._get_token() if not self._token: raise StopUser() started = self._start_conversation() if not started: raise StopUser() self.conversation_id, stream_url = started self._ws = websocket.create_connection(stream_url, timeout=WS_CONNECT_TIMEOUT) self._ws.settimeout(WS_RECV_TIMEOUT) # each recv() polls for at most a second

The token and start calls are ordinary HTTP, wrapped so Locust records each one and marks a bad status as a failure.

def _get_token(self): url = self.directline + "/v3/directline/tokens/generate" headers = {"Authorization": "Bearer " + self.direct_line_secret} with self.client.post(url, headers=headers, name=self.label + " token", catch_response=True) as response: if response.status_code != 200: response.failure("token HTTP " + str(response.status_code)) return None body = parse_json(response) if not body or not body.get("token"): response.failure("token response missing 'token'") return None return body["token"] def _start_conversation(self): url = self.directline + "/v3/directline/conversations" with self.client.post(url, headers=self._auth_header(), name=self.label + " start", catch_response=True) as response: if response.status_code not in (200, 201): response.failure("start HTTP " + str(response.status_code)) return None body = parse_json(response) or {} conversation_id = body.get("conversationId") stream_url = body.get("streamUrl") if not conversation_id or not stream_url: response.failure("start response missing conversationId or streamUrl") return None if body.get("token"): self._token = body["token"] return conversation_id, stream_url

One turn: say()

say() owns one turn: start the clock, send the activity, receive replies, record both timings, and return the last non-empty agent message. metric_name labels the Locust rows, deadline limits the whole turn, and attach selects the upload path.

def say(self, text, metric_name=None, deadline=REPLY_DEADLINE, attach=None): """Send one message, wait for the reply, and return the reply text.""" self._turn += 1 start = time.perf_counter() posted_id = self._send(text, attach) if posted_id is None: self._record(start, None, None, metric_name, "", "send failed") raise StopUser() result = self._receive(start, deadline) self._record(start, result.t_first, result.t_final, metric_name, result.text, result.error) if result.error: raise StopUser() return result.text

Earlier replies still set the timing boundaries, but result.text contains only the last non-empty message. _send() posts a normal message unless attach selects the upload path explained later.

def _send(self, text, attach=None): if attach: return self._send_file(text, attach) # explained in the attachment section activity = {"type": "message", "from": {"id": self._user_id}, "text": text, "locale": "en-US"} url = (self.directline + "/v3/directline/conversations/" + self.conversation_id + "/activities") with self.client.post(url, json=activity, headers=self._auth_header(), name=self.label + " send", catch_response=True) as response: if response.status_code != 200: response.failure("send HTTP " + str(response.status_code)) return None body = parse_json(response) if not body or not body.get("id"): response.failure("send response missing activity id") return None return body["id"]

The returned activity ID confirms acceptance, not an answer. The answer arrives on the WebSocket, while Locust records the POST as a separate send row.

Read WebSocket frames until the turn ends

_receive() reads JSON frames from the open socket. TurnResult keeps the latest text, the first and final message times, and any error:

class TurnResult: def __init__(self): self.text = "" # last non-empty agent message self.t_first = None # first agent-message time self.t_final = None # latest agent-message time self.error = "" # non-empty when the turn fails

The one-second socket timeout keeps each read responsive; the overall deadline limits the complete turn, including the HTTP send.

def _receive(self, start, max_wait=REPLY_DEADLINE): """Read frames until the agent signals 'turn.complete' or the deadline passes.""" result = TurnResult() deadline = start + max_wait while time.perf_counter() < deadline: try: frame = self._ws.recv() except websocket.WebSocketTimeoutException: continue # no data this second; keep waiting if not frame or not frame.strip(): continue payload = json.loads(frame) for activity in payload.get("activities", []): if self._handle_activity(activity, result): return self._finish(result) # saw turn.complete if result.t_final is None and not result.error: result.error = "no final reply (timeout)" return self._finish(result)

Each activity has one job:

Activity	Client action
User message echo	Ignore it because it came from the virtual user
typing	Ignore it for response-time measurements
Agent message	Set the first and latest message times; keep the latest non-empty text
trace with an ErrorCode	Store the structured turn error
event named turn.complete	End the read for this turn; keep the WebSocket open

_handle_activity() applies the table. _is_bot_reply() filters the user echo and is explained next.

def _handle_activity(self, activity, result): activity_type = activity.get("type") if activity_type == "message" and self._is_bot_reply(activity): now = time.perf_counter() if result.t_first is None: result.t_first = now # first reply -> TTFB result.t_final = now # every reply -> ResponseTime result.text = activity.get("text") or result.text return False if activity_type == "trace": code = self._error_code(activity) if code and not result.error: result.error = "bot error: " + describe_error_code(code) return False if activity_type == "event" and activity.get("name") == "turn.complete": return True return False

Normally turn.complete ends the read. The deadline is the fallback; a turn with no agent message fails.

Record the latency

_record() emits TTFB and ResponseTime for a successful turn. A failed turn emits one failed ResponseTime entry instead of a latency value.

def _record(self, start, t_first, t_final, metric_name, text, error): name = self.label + " " + (metric_name or ("t" + str(self._turn))) fire = self.environment.events.request.fire if error: fire(request_type="CHAT", name=name + " [ResponseTime]", response_time=None, response_length=0, exception=Exception(error), context={}) return fire(request_type="CHAT", name=name + " [TTFB]", response_time=((t_first or t_final) - start) * 1000, response_length=0, exception=None, context={}) fire(request_type="CHAT", name=name + " [ResponseTime]", response_time=(t_final - start) * 1000, response_length=len(text), exception=None, context={})

Pause between turns and close the conversation

The scenario pauses between messages and closes the WebSocket in finally, even when a turn fails.

try: self.connect() self.say(requirement, "T01 requirement") self.think(20, 30) self.say("ok", "T02 confirm") finally: self.close()

That completes one conversation.

Note. This is a trimmed two-turn illustration (T01 requirement then T02 confirm). The demo text scenario reported later runs five turns: T01 requirement, T02 source, T03 target, T04 action (only some conversations reach this branch), and T05 confirm.

Assemble the reusable Locust user

In new_chat_client.py, the methods above belong to WebSocketChatClient. The class extends FastHttpUser and holds their shared configuration:

class WebSocketChatClient(FastHttpUser): """Direct Line WebSocket load client for a new-experience Copilot Studio agent.""" abstract = True host = DIRECTLINE wait_time = between(2, 6) directline = DIRECTLINE direct_line_secret = None # secret mode label = "ws"

abstract = True prevents Locust from running the base directly. A scenario subclass supplies the secret, metric label, and messages. wait_time pauses between complete scenario runs, while explicit think() calls pause between turns. Locust can then create many scenario instances that reuse the same conversation methods.

Handling the new experience

Three behaviors of the new agent experience would quietly break a naive client. Each is a few lines in the methods above.

Preview notice. As of July 13, 2026, the Copilot Studio new agent experience is a production-ready preview. Microsoft identifies its documentation as prerelease and subject to change, and states that production-ready previews are subject to the Supplemental Terms of Use for Microsoft Azure Previews.

Replies may omit a role

In the classic channel, an agent message carries from.role = "bot". In the new experience some replies arrive with only from.id and no role at all. Keying off role == "bot" would drop those messages and report "no reply." The fix treats any message that isn't the client's own echo as a reply:

def _is_bot_reply(self, activity): """True if the message is from the agent (role may be missing), not the client's echo.""" sender = activity.get("from") or {} role = sender.get("role") if role == "bot": return True if role == "user": return False return sender.get("id") != self._user_id # role missing -> not the echo -> a reply

turn.complete is an explicit end-of-turn event

Reading it (rather than waiting out a timeout) is what lets a fast turn finish in a couple of seconds instead of idling. It's the event branch in _handle_activity() above.

Errors come back as a structured code

When something goes wrong, the agent can send a trace activity carrying a locale-independent ErrorCode. The client maps it against the official code list so a run reports why it failed, not just that a reply never came:

def _error_code(self, activity): if activity.get("valueType") != "ErrorCode": return None value = activity.get("value") if isinstance(value, dict) and value.get("ErrorCode"): return value["ErrorCode"] return "error"

Attach files with a multipart upload

A file-reading turn exercises work that a text-only turn does not. This harness sends one or more files through the Direct Line /upload endpoint as multipart/form-data, with an optional message activity in the same request.

The multipart request

The request contains one activity part and one file part per attachment:

POST {host}/v3/directline/conversations/{id}/upload?userId={from.id} Authorization: Bearer «token» Content-Type: multipart/form-data; boundary=----loadtest-«random» ------loadtest-«random» Content-Disposition: form-data; name="activity" Content-Type: application/vnd.microsoft.activity { "type": "message", "from": { "id": "user-…" }, "text": "" } ------loadtest-«random» Content-Disposition: form-data; name="file"; filename="requirement.csv" Content-Type: text/csv «raw file bytes» ------loadtest-«random»-- → 200 { "id": "…" } (same shape as a normal send)

The required userId query parameter identifies the sender. The harness uses the same per-instance ID in userId and activity.from.id, so the echoed activity has the sender ID expected by the reply filter.

The activity JSON contains no attachments array. Direct Line adds the separate file parts as attachments to that activity before sending it to the agent. A successful upload returns the same { "id": "…" } shape as a text send, so the existing WebSocket receive and timing path remains unchanged.

Building the body by hand

Locust's FastHttpUser has no explicit requests-style files= helper. The client therefore assembles the multipart body as bytes. A fresh UUID-based boundary is used for each request, every file is read before the POST begins, and each attachment gets its own file part.

The implementation uses four additional standard-library modules:

import mimetypes import os import urllib.parse import uuid

def _send_file(self, text, attach): """Upload one or more files (optionally with a message) via Direct Line /upload.""" paths = attach if isinstance(attach, list) else [attach] files = [] # read every file first for path in paths: with open(path, "rb") as handle: data = handle.read() name = os.path.basename(path) extension = os.path.splitext(path)[1].lower() mime = UPLOAD_MIME.get(extension) or mimetypes.guess_type(path)[0] or "application/octet-stream" files.append((name, mime, data)) activity = json.dumps({"type": "message", "from": {"id": self._user_id}, "text": text or ""}) boundary = "----loadtest-" + uuid.uuid4().hex # fresh boundary per request dash = ("--" + boundary).encode("utf-8") parts = [ dash + b"\r\n", b'Content-Disposition: form-data; name="activity"\r\n', b"Content-Type: application/vnd.microsoft.activity\r\n\r\n", activity.encode("utf-8") + b"\r\n", ] for name, mime, data in files: # one part per file disposition = _content_disposition("file", name) parts.append(dash + b"\r\n") parts.append(("Content-Disposition: " + disposition + "\r\n").encode("utf-8")) parts.append(("Content-Type: " + mime + "\r\n\r\n").encode("utf-8")) parts.append(data + b"\r\n") parts.append(dash + b"--\r\n") url = (self.directline + "/v3/directline/conversations/" + self.conversation_id + "/upload?userId=" + self._user_id) headers = self._auth_header() headers["Content-Type"] = "multipart/form-data; boundary=" + boundary with self.client.post(url, data=b"".join(parts), headers=headers, name=self.label + " upload", catch_response=True) as response: if response.status_code != 200: response.failure("upload HTTP " + str(response.status_code)) return None posted = parse_json(response) if not posted or not posted.get("id"): response.failure("upload response missing activity id") return None return posted["id"]

The media type comes from a small known-types table, then Python's mimetypes, then application/octet-stream. This labels unknown extensions without claiming that every file type or size can be processed by the agent. The client sends the basename in the multipart header; exact preservation of non-ASCII filenames is not assumed.

Use the same turn API

Scenarios continue to call say(...). _send() chooses the ordinary message endpoint or multipart /upload:

def _send(self, text, attach=None): if attach: return self._send_file(text, attach) # multipart /upload # … otherwise the ordinary text Send Activity from earlier

Text only — the ordinary Send Activity.
Text and a file — a message plus one attachment.
A file with no message — pass text="" with an attachment.
Several files — pass a list, and each becomes its own file part in one upload.

Because /upload returns the same activity-ID shape as a text send, an upload turn is received and timed by the same say() path:

# A conversation that hands the agent a requirements file instead of typing it self.say("Here is my requirement", "T01 requirement", attach="requirement_intake.csv")

Validate the workload locally first

Before running the larger test in Azure, the workload ran locally in one Python process with locust.conf: four virtual users for 15 minutes, starting one user every 30 seconds. This verified both conversation paths, file uploads, pacing, transaction names, and diagnostics. Azure Load Testing then used the same Python workload with locust.azure.conf, increasing the profile to 16 virtual users for 30 minutes and omitting local result paths.

The Python environment used three dependencies beyond the standard library:

python -m pip install "locust==2.42.6" "python-dotenv>=1.0,<2.0" "websocket-client==1.9.0"

The Direct Line secret came from an environment variable in a local .env file, which kept it out of source control:

DL_IA_SECRET=<Direct Line secret>

The local run-time included the ramp period. The complete profile lived in locust.conf:

locustfile = locustfile_discovery_demo.py headless = true users = 4 spawn-rate = 0.0333333333 run-time = 15m stop-timeout = 1200 only-summary = true csv = results/local-15m html = results/local-15m.html

The local launch was then two lines:

New-Item -ItemType Directory -Force results | Out-Null python -m locust --config locust.conf

The 1,200-second stop timeout was an upper bound, not a fixed extension. When the 15-minute window closed, Locust allowed an in-flight conversation to complete instead of interrupting a turn. This mattered because the optional agent-creation turn had a 300-second reply deadline. Each completed conversation wrote a readable transcript under transcripts/; frame-level JSONL was written under transcripts/directline-debug/ because DL_DEBUG_LOG was enabled.

Locust reported the token, conversation-start, send, and upload HTTP calls alongside the [TTFB] and [ResponseTime] chat measurements. Transaction names described logical steps rather than individual system combinations, which kept route variants in the same result rows.

The local validation completed successfully. Both conversation paths, file uploads, and optional agent creation finished with zero failures or exceptions.

Scale the test in Azure Load Testing

The Azure test used one engine to run 16 virtual users for 30 minutes. Eight users followed the text conversation scenario and eight followed the file-attachment scenario. Locust started one user every 30 seconds and kept the configured 20–30 second pause between messages.

Figure 6. One Azure Load Testing engine runs two conversation scenarios: eight users follow the text path and eight follow the file-attachment path for 30 minutes.

Upload the test files to an Azure Load Testing resource. The YAML disables client-generated transcripts and JSONL files because Azure Load Testing publishes only its supported artifacts: engine logs, raw result CSV, and a dashboard report.

version: v0.1 testId: copilot-studio-directline-load displayName: Copilot Studio Direct Line load test description: Load test with text and file conversation scenarios testPlan: locustfile_discovery_demo.py testType: Locust engineInstances: 1 configurationFiles: - new_chat_client.py - requirements.txt - test_attachment/high_route_requirement.docx - test_attachment/high_route_requirement.pdf - test_attachment/high_route_requirement.png properties: userPropertyFile: locust.azure.conf env: - { name: LOCUST_USERS, value: "16" } - { name: LOCUST_SPAWN_RATE, value: "0.0333333333" } - { name: LOCUST_RUN_TIME, value: "1800" } - { name: LOCUST_STOP_TIMEOUT, value: "1200" } - { name: TEXT_USERS, value: "8" } - { name: FILE_USERS, value: "8" } - { name: ATTACHMENT_DIR, value: "." } - { name: DL_TRANSCRIPT, value: "0" } - { name: DL_DEBUG_LOG, value: "0" } failureCriteria: - percentage(error) > 0

Before creating the test, store the Direct Line secret in Azure Key Vault, enable the Azure Load Testing resource's system-assigned managed identity, and grant that identity permission to read the secret. The Azure Load Testing secret guidance covers the identity and Key Vault access steps. In the current Azure CLI, the literal value null (not an omitted or empty argument) tells --keyvault-reference-id to use the load-testing resource's own system-assigned identity.

Azure CLI preview. The az load test and az load test-run commands need the load extension and Azure CLI 2.66.0 or later (currently in preview). Check the current Azure CLI load test reference before running them.

Create the test definition from the YAML and set that Key Vault reference identity:

az load test create ` --load-test-resource "<load-test-resource>" ` --resource-group "<resource-group>" ` --test-id copilot-studio-directline-load ` --load-test-config-file azure-loadtest.yaml ` --keyvault-reference-id null

For a new run, pass the Key Vault secret identifier through Azure Load Testing's dedicated --secret parameter. Locust receives a configured secret as an environment variable with the same name, so the Python workload can continue to read DL_IA_SECRET without code changes:

$runId = "copilot-azure-$(Get-Date -Format 'yyyyMMdd-HHmmss')" $directLineSecretUri = "https://<key-vault-name>.vault.azure.net/secrets/<secret-name>" $runEnv = @( "LOCUST_USERS=16" "LOCUST_SPAWN_RATE=0.0333333333" "LOCUST_RUN_TIME=1800" "LOCUST_STOP_TIMEOUT=1200" "TEXT_USERS=8" "FILE_USERS=8" "ATTACHMENT_DIR=." "DL_TRANSCRIPT=0" "DL_DEBUG_LOG=0" ) $runSecrets = @("DL_IA_SECRET=$directLineSecretUri") az load test-run create ` --load-test-resource "<load-test-resource>" ` --resource-group "<resource-group>" ` --test-id copilot-studio-directline-load ` --test-run-id $runId ` --env $runEnv ` --secret $runSecrets ` --only-show-errors ` --output none

Azure's test-run debug mode was deliberately left off. Debug-mode runs are capped at 10 minutes regardless of the configured Locust duration.

After the run command completes, download the engine logs, raw results, and dashboard report:

az load test-run download-files ` --load-test-resource "<load-test-resource>" ` --resource-group "<resource-group>" ` --test-run-id $runId ` --path "results/azure/$runId" ` --log --result --report --force

The download command creates logs.zip, csv.zip, and reports.zip in the target directory.

Security note. The command passes a Key Vault secret identifier, not the Direct Line secret value. Azure Load Testing stores the identifier, retrieves the secret with the configured managed identity for each run, and exposes it to the Locust process as DL_IA_SECRET. Keep --env for non-sensitive settings only. The 2026-07-12 Demo Run passed the secret through --env, which can expose it in run metadata; the recommended command above avoids that. In CI/CD, the Azure Load Testing task or action can instead receive the value through its secrets input from the pipeline's secret store.

What the 30-minute Demo Run produced

The completed Azure execution is referred to below as the Demo Run. It ran on 2026-07-12 using a single Azure Load Testing engine, with 16 virtual users split evenly between the text and file paths. Treat the Demo Run as a baseline at this load, not a capacity ceiling.

The Demo Run completed with a PASSED verdict and no service error details.

The Locust engine ran its configured 30-minute window, then allowed in-flight conversations to finish. It reached all 16 users after 7 minutes 30 seconds, then held exactly 16 for the rest of the run. It hit the run-time limit at 10:23:22Z, a steady window of about 22 minutes 30 seconds, and exited cleanly about 1 minute 50 seconds later.

Measure	Result
Virtual users	16: 8 text and 8 file
Locust run-time window	30 minutes
Full-load window	About 22 minutes 30 seconds
Completed conversations	253
Text conversations	98
File conversations	155
Recorded request samples	3,212
Failed samples	0
Completed agent-creation turns	5

The 3,212 samples include Direct Line HTTP calls plus the custom [TTFB] and [ResponseTime] entries emitted for chat turns. Every text conversation that started reached T05 confirm, and every file conversation reached T03 confirm.

Path	Conversations	Recorded samples
Text discovery	98	1,507
File attachment	155	1,705
Total	253	3,212

Azure Load Testing packages an offline dashboard with per-minute charts, sampler statistics, and error details. The download command in the previous section saves it as reports.zip. Extract the archive and open reports/index.html. For the Locust-based Demo Run, the downloaded dashboard is also published as the Demo Run report and can be viewed directly. The two cards below summarize latency for the text and file scenarios. Teal shows average TTFB, blue shows average ResponseTime, and orange extends from the average ResponseTime to p90.

Text conversation latency

Figure 7. Text discovery: 98 conversations, 1,507 recorded samples, eight virtual users, and zero errors. Agent creation uses an independent scale so the regular turns remain readable. Each section uses its own scale, so bar lengths compare within a section, not across sections.

Observation. The regular text path peaked at 18.93 seconds p90 for the requirement turn, while agent creation returned its first message in 8.68 seconds on average but needed 163.71 seconds on average, and 202.37 seconds at p90, to complete.

File conversation latency

Figure 8. File intake: 155 conversations, 1,705 recorded samples, eight virtual users, and zero errors. Each section uses its own scale, so bar lengths compare within a section, not across sections.

Observation. Attachment processing was the slowest file turn at 22.19 seconds average and 27.80 seconds p90; token, conversation-start, send, and upload calls remained at or below 220 milliseconds p90.

What the numbers suggest

Agent work dominated the measured latency

During the Demo Run, Direct Line token, conversation-start, send, and upload operations all stayed below 220 milliseconds p90. Complete chat turns took seconds, while the agent-creation branch took minutes. This shows that most of the measured end-to-end time accumulated after Direct Line accepted the message. The results do not separate orchestration, model, tool, or downstream-service time.

TTFB did not describe the complete answer

Some turns returned one message, so TTFB and ResponseTime were equal. Others acknowledged the request and kept working. The target turn averaged 3.08 seconds to first reply and 6.62 seconds to completion. The conditional action turn averaged 3.29 seconds to first reply and 9.46 seconds to completion. Agent creation widened that gap to more than two and a half minutes on average. Measuring only the first reply would hide the expensive part of those turns.

Equal users did not produce equal conversation totals

The user allocation was eight and eight, but the file path completed 155 conversations while the text path completed 98. That difference is consistent with the longer multi-turn text flow and its occasional agent-creation branch. Virtual-user allocation describes concurrency; completed iterations also depend on scenario duration.

Assumptions and guardrails

A few deliberate choices bound what these numbers mean:

Baseline scale, on purpose. The four-user local validation and 16-user Azure profile generate baselines, not a stress test. The guidance warns that load exceeding real user behavior can trigger message-consumption overage and environment throttling, so the Demo Run stayed within confirmed traffic and quota boundaries.
WebSocket transport with secret auth. The client uses Direct Line over WebSockets to match the tested client application and exchanges a Direct Line secret for a token. A test for a client that receives activities through HTTP GET should reproduce that transport instead.
One agent, one region. The numbers describe a single agent on one Direct Line regional endpoint; a different agent, model, or region will have its own signature.
New-experience behavior observed in this run. The client relies on the turn.complete event and role-optional replies observed during the test. Both the product status and response shapes can change while the experience remains in preview.
Capacity confirmed first. A larger run requires prior confirmation that the agent, environment, and connected services support the peak throughput, with a limit increase requested when estimates exceed defaults.

Limitations

Single Azure engine, light load. Sixteen virtual users provide a baseline, not a capacity ceiling. Characterizing saturation needs higher concurrency and multiple Azure Load Testing engines.
One Demo Run. The results describe this 30-minute window and should be compared with repeated runs before setting a service-level target.
Chat turns only. The harness measures the message turn. It doesn't exercise sign-in cards, adaptive-card submits, or streamed token-by-token rendering.
No Azure transcripts. The cloud profile intentionally disabled readable transcripts and client JSONL, so the five agent-creation turns prove completed responses but not an independent resource inventory.

One operational note

Operational note. Confirm the Copilot Studio environment's quotas and the capacity of every connected dependency before increasing users or engine count. The workload should model expected traffic rather than use production systems as an unrestricted stress target.

Wrap-up

For the tested Copilot Studio new-experience agent and WebSocket client, a small reusable Locust client captured the observed Direct Line flow. It matched the client's WebSocket transport, treated non-echo messages as replies, stopped on the turn.complete event emitted by the tested product build, and recorded both first-reply and last-reply times so a multi-message turn did not hide behind its acknowledgment.

The same Python workload runs unchanged from a laptop and from Azure Load Testing engines. Local settings come from locust.conf; Azure settings come from locust.azure.conf and the test YAML. The Direct Line secret is supplied at run time, and a failureCriteria gate can support a release pipeline. The four-user local run validated the scripts and diagnostics; the Demo Run then held its full steady concurrency with 3,212 samples and zero failures. Further tests can add repeated baselines, more users and engines for saturation, additional attachment types, and a longer soak.

Learn more

Plan and create a conversational agent performance test — the planning method, workload model, and test-plan structure this post follows.
Best practices for improving conversational agent performance — quotas, and the agent-side levers for cutting latency.
Agents overview (new experience) — the orchestration model, instructions, knowledge, tools, and connected agents.
Locust documentation — the load-testing framework this harness builds on.
Azure Load Testing documentation — the managed load-testing service.

Get the code

The complete runnable example is available in kroy92/copilot-studio-load-testing. The repository contains new_chat_client.py, the two-class locustfile_discovery_demo.py workload, three attachment fixtures, local and Azure Locust configuration files, requirements.txt, and azure-loadtest.yaml.

The Direct Line secret stays in the ignored .env file locally. For new Azure Load Testing runs, the recommended command retrieves it from Azure Key Vault through the dedicated secret parameter. The same Python workload runs in both environments, with locust.conf used locally and locust.azure.conf used in Azure.

Orchestrate Azure Container Apps Jobs with Apache Airflow

hetvip — Tue, 14 Jul 2026 21:41:26 GMT

Azure Container Apps (ACA) Jobs are a great way to run work that starts, does something, and finishes: nightly batch, data processing, ETL, ML scoring, report generation. They scale to zero, bill per execution, and run any container you give them.

But the moment your "one job" becomes "a set of jobs that depend on each other," a gap appears:

How do I run twenty jobs in parallel, wait for all of them, then run one more job only if they all succeeded — and retry just the one that failed?

A single ACA Job can't express that on its own. What you're describing is an orchestrator, and the most widely adopted one in the data world is Apache Airflow.

This post introduces two open-source templates that connect the two, so Airflow becomes the brain and ACA Jobs become the muscle. Pick the one that matches what you already run:

airflow-on-aca-jobs: you already have Airflow. Drop in an operator and point it at ACA Jobs. Host nothing new.
airflow-hosted-on-aca: you don't have Airflow. Get a full one running on Azure Container Apps with one command.

Both use the same operator and the same DAGs, so you can start with one and move to the other later without rewriting your workflows.

See Airflow orchestrate real ACA Job executions with parallel fan-out, dependency ordering, and automatic retries.

Why ACA Jobs need an orchestrator

A plain ACA Job is great at one thing: run this container to completion, then stop. That covers a scheduled job or a one-off task perfectly.

Real pipelines need more than that:

Dependency ordering: step B runs only after step A succeeds.
Parallel fan-out: launch one execution per file, per store, or per partition, all at once, then wait for the whole batch.
Per-task retries: if one execution in a batch of fifty fails, retry just that one, not the other forty-nine.
Backfills and scheduling: re-run yesterday's pipeline, or run every night with a full history of what happened.

These are the problems an orchestrator solves. Instead of building that logic yourself, you let Airflow handle the graph, the scheduling, and the retries, while ACA Jobs run the compute. You get serverless, scale-to-zero workers, and you didn't have to stand up a scheduler to get them.

The operator that ties them together

Both templates ship the same small plugin: an Airflow operator called AzureContainerAppsJobOperator. In a DAG it looks like any other task:

report_sales = AzureContainerAppsJobOperator(
    task_id="report_store_sales",
    subscription_id="{{ var.value.azure_subscription_id }}",
    resource_group="{{ var.value.aca_resource_group }}",
    job_name="{{ var.value.aca_job_name }}",
    image="python:3.12-slim",
    command=["python", "-c", MY_PROGRAM],
    env_vars={"STORE_NAME": "Seattle"},
    deferrable=True,
)

A few things make this operator easy to work with:

Per-execution overrides. It takes the ACA Job you point it at and overrides the image, command, args, and env_vars for that run. You can drive many different workloads from a single ACA Job definition, and you don't need to build or push a custom image just to try something. The example above runs the stock python:3.12-slim image with an inline program.
Deferrable by default. With deferrable=True, Airflow frees its worker slot while the ACA Job runs and resumes when it finishes. That means your fan-out width is bounded by ACA, not by how many Airflow workers you have. You can launch dozens of parallel executions cheaply.
No secrets required. Authentication resolves in a sensible order: an Airflow Connection if you set one, otherwise an AZURE_ACCESS_TOKEN environment variable, otherwise DefaultAzureCredential (managed identity). In Azure, the hosted template uses a managed identity so nothing sensitive is stored in Airflow at all.

Because both templates share this operator, a DAG written for one runs unchanged on the other.

Option 1: Bring your own Airflow (host nothing)

Choose airflow-on-aca-jobs if you already run Airflow: Azure Managed Airflow, MWAA, Astronomer, or your own deployment. You keep that Airflow exactly as it is and simply teach it to talk to ACA Jobs.

+------------------------------------------+
|  Your Airflow  (you host it, unchanged)  |
|  runs AzureContainerAppsJobOperator      |
+------------------------------------------+
                     |
                     |  ACA Jobs REST API
                     v
+------------------------------------------+
|  ACA Job  (Azure Container Apps)         |
|                                          |
|  store 1 | store 2 | ... | store N       |
|  parallel executions -> scale to zero    |
+------------------------------------------+

Your existing Airflow runs the operator; ACA Jobs run the work. You host nothing new.

Adoption is three small steps:

Copy the operator into your Airflow's plugins/ folder.
Add a DAG that uses AzureContainerAppsJobOperator.
Set three Airflow Variables so the operator knows which job to drive:

Airflow Variable	Value
`azure_subscription_id`	your subscription id
`aca_resource_group`	the resource group holding the ACA Job
`aca_job_name`	the ACA Job name

That's the whole integration. Nothing new to host, no extra scheduler or database, no custom image. ACA Jobs just become another task type Airflow can call.

If you want a job to point at first, the template includes an Azure Developer CLI (azd) deployment that stands up a sample ACA Job for you:

git clone https://github.com/hetvip2/airflow-on-aca-jobs
cd airflow-on-aca-jobs
azd up          # deploys a sample ACA Job, prints its resource group + name

Then copy airflow/plugins/ and airflow/dags/ into your Airflow, set the three Variables, and trigger the DAG.

Option 2: Airflow hosted on ACA (turnkey)

Choose airflow-hosted-on-aca if you don't already have an orchestrator and want one running next to your jobs. One command provisions the whole thing on Azure Container Apps:

                   azd up
                     |
                     v
+------------------------------------------+
|  Airflow control plane on ACA            |
|  web  |  scheduler  |  triggerer         |
|  Postgres (metadata) + Azure Files (dags)|
|  Managed Identity - no secrets stored    |
+------------------------------------------+
                     |
                     |  ACA Jobs REST API
                     v
+------------------------------------------+
|  ACA Job  (Azure Container Apps)         |
|                                          |
|  store 1 | store 2 | ... | store N       |
|  parallel executions -> scale to zero    |
+------------------------------------------+

One command deploys the whole Airflow control plane on ACA, right next to the jobs it drives.

git clone https://github.com/hetvip2/airflow-hosted-on-aca
cd airflow-hosted-on-aca
azd env new my-airflow
azd up          # prints your Airflow URL when it finishes

azd up deploys a complete, working Airflow control plane on ACA:

airflow-web, airflow-scheduler, and airflow-triggerer running as Container Apps on LocalExecutor, so there's no Celery or Redis to operate.
A Postgres metadata database.
A user-assigned managed identity with permission to call the ACA Jobs API, so the operator authenticates with no secrets stored in Airflow.
A sample ACA Job for Airflow to drive out of the box.

Your DAGs and plugins live on a mounted Azure Files share, so you ship new workflows by re-uploading files rather than rebuilding an image:

cp my_dag.py airflow/dags/
azd hooks run postprovision   # uploads dags + plugins to the share

Airflow picks up the change within a minute. You now own a real orchestrator, hosted serverlessly on the same platform as your jobs.

Which one should you pick?

	Option 1: `airflow-on-aca-jobs`	Option 2: `airflow-hosted-on-aca`
Best when	You already run Airflow	You don't have Airflow yet
Setup	Copy the operator + a DAG + 3 Variables	`azd up` (one command)
Who hosts Airflow	You do (unchanged)	Azure Container Apps
Authentication	Connection or short-lived token	Managed identity, nothing stored
Ownership	Lowest: nothing new to run	Turnkey: a full orchestrator you own

The important part: the workload never changes. The same DAG and the same operator drive the same ACA Job executions in both. Start wherever you are today, and switch later with zero changes to your pipelines.

See it end to end

Picture a retailer that wants one number every night: total sales across all stores. Each store reports its own sales as a separate ACA Job execution, all running in parallel. When every store is in, a final job adds them into the company total.

That one workflow exercises exactly what a plain Job can't do alone:

parallel fan-out: one ACA Job execution per store, all at once
dependency ordering: the roll-up runs only after every store reports
per-task retries: if a store's execution fails, Airflow retries just that store, and the nightly total still lands

In Airflow's Graph view you watch the store tasks light up together, then the roll-up run last. In the Azure portal you watch real executions appear under your ACA Job and scale back to zero when they finish. Same job, same DAG, whichever template you chose.

Call to action

If you run batch, ETL, or any multi-step work on Azure Container Apps Jobs, give one of these templates a try:

Already have Airflow? Start with airflow-on-aca-jobs.
Need an orchestrator? Start with airflow-hosted-on-aca.

Both are open source, deploy with azd up, and share the same operator so you can move between them freely. Try them out and let us know what you orchestrate.

From AI Adoption to AI Governance - Using APIM as the Gateway for Azure AI Foundry

Abhishek_Mittal — Tue, 14 Jul 2026 08:19:35 GMT

Co-authored by Gaurav Jain (Senior Cloud Solution Architect @ Microsoft) and Abhishek Mittal (Cloud Solution Architect @ Microsoft)

Enterprises move through three phases of AI adoption: evaluating models, building apps and agents, and operationalizing them in production. The first two are easier to accelerate. The third is where governance becomes critical. Once multiple teams share an AI endpoint, leaders need clear answers to practical questions: which model consumed tokens, which team used them, and who is authorized to call it?

This post shows how to place Azure API Management (APIM) in front of Azure AI Foundry as an AI Gateway, turning a shared endpoint into a governed control point for per-model token visibility, chargeback, and budget alerts — with no changes to client code. It also shows where Azure Front Door and Web Application Firewall (WAF) fit in a secure AI Landing Zone.

The problem: AI adoption is outpacing AI governance

A common starting point is an Azure OpenAI resource running multiple models. The team already has operational telemetry, but governance needs a different view: per-model token usage for chargeback, budget alerts, and capacity planning, captured in one place.

Azure gives you rich resource-level telemetry out of the box, and that is exactly where we started:

Azure Monitor — Metrics blade: shows token usage split by model and deployment in near real time.
Diagnostic settings: stream the resource's metrics and request logs into Log Analytics (the AzureMetrics and AzureDiagnostics tables).

Azure Monitor provides useful resource-level telemetry, including metrics and diagnostic logs. A governance view needs something different: model identity and token counts correlated in a single record, so teams can build a per-model, month-to-date ledger for chargeback and alerting.

The AzureMetrics table carries the token totals, aggregated at the resource level.
The AzureDiagnostics logs carry the model and deployment name at the request level.

Each stream does its job well. Correlating them into one per-model, month-to-date ledger — and alerting on it — is a governance concern that sits above any single resource. Azure Monitor metric alerts, for instance, work on a rolling 24-hour window that maps cleanly onto a per-day token budget; a month-to-date, per-model chargeback ledger is simply a different shape of question — and a natural fit for a dedicated control point.

This transition is the focus of this post: moving from AI adoption to AI governance by introducing a control point where model identity and token usage are captured together by design. The natural home for that control point is an AI gateway — and we build it next with Azure API Management in front of Azure AI Foundry.

The pattern: APIM as the AI Gateway for Azure AI Foundry

The AI gateway in Azure API Management is a set of capabilities to secure, scale, monitor, and govern the AI models, agents, and tools behind your apps. It isn't a separate product — it extends the existing API Management gateway. As Microsoft's guidance puts it, as AI adoption matures the gateway helps you authenticate and authorize access to AI services, load balance across endpoints, monitor and log AI interactions, and manage token usage and quotas across multiple applications.

APIM becomes the governed front door for Azure AI Foundry. Clients continue calling an OpenAI-compatible endpoint; APIM authenticates to Foundry with a system-assigned managed identity, forwards the request, and emits per-model token telemetry to Azure Monitor and Application Insights. The result is per-model visibility without client-side changes.

Models behind the gateway

The gateway can front any model deployment in Azure AI Foundry — Azure OpenAI models, other Foundry models, or a mix — and the pattern is identical no matter which you run. For a concrete reference, the walkthrough in this post sits in front of two existing deployments:

Deployment	Model	Provisioned capacity (TPM)
gpt-4.1	gpt-4.1	500K
gpt-5	gpt-5	50K

Example deployments referenced throughout this post. Note the deliberate capacity gap — gpt-5 at 50K TPM versus 500K for gpt-4.1 — exactly the kind of asymmetry that makes per-model visibility a governance requirement, not a nice-to-have.

Architecture

Figure 1 — Architecture / component flow: consumers call one governed API; the inbound policy authenticates with a managed identity, resolves the model, and emits per-model token metrics to Azure Monitor and Application Insights.

The starting point (“before”): a pass-through without a usage signal

By default, APIM operates as a straightforward proxy. If you import a Foundry API and keep the default configuration, the policy simply selects the backend service:

A plain pass-through API. It forwards traffic faithfully — it simply doesn't surface a usage signal yet. This is our “before.”

Pass-through configuration works, but it does not distinguish traffic by model. All requests flow through the same stream, with no per-model chargeback signal, no capacity warning, and no clear view of which deployment is driving consumption. To govern the workload, the gateway must understand the traffic — not just relay it.

The governed gateway (“after”): a policy that sees every token

The custom inbound policy below is the heart of the pattern. It does four things in order:

set-backend-service — select the Azure AI Foundry backend.
authentication-managed-identity — obtain an Entra ID token for cognitiveservices.azure.com using the APIM system-assigned identity. No keys ever leave the gateway.
set-variable deployment-id — resolve the model name from either the URL path or the request body (more on why below).
azure-openai-emit-token-metric — emit prompt, completion, and total token counts to Azure Monitor, dimensioned by model.

<policies> <inbound> <base /> <set-backend-service backend-id="foundry-backend" /> <authentication-managed-identity resource="https://cognitiveservices.azure.com" />  <set-variable name="deployment-id" value="@{ var path = context.Request.Url.Path ?? ""; var m = System.Text.RegularExpressions.Regex.Match(path, "/deployments/([^/?]+)"); if (m.Success) { return m.Groups[1].Value; } try { var body = context.Request.Body?.As<JObject>(preserveContent: true); var model = body?["model"]; if (model != null && !string.IsNullOrEmpty(model.ToString())) { return model.ToString(); } } catch (Exception) { } return "unknown"; }" />  <azure-openai-emit-token-metric namespace="genai-tokens"> <dimension name="ModelDeploymentName" value="@((string)context.Variables["deployment-id"])" /> <dimension name="ModelName" value="@((string)context.Variables["deployment-id"])" /> <dimension name="APIId" value="@(context.Api.Id)" /> <dimension name="Subscription" value="@(context.Subscription?.Id ?? "none")" /> <dimension name="Client IP" value="@(context.Request.IpAddress)" /> <dimension name="Product ID" value="@(context.Product?.Id ?? "none")" /> </azure-openai-emit-token-metric> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies>

The full custom policy, applied at API scope. The highlighted value is the model-name resolution feeding a per-model token metric.

Request flow, end to end

Figure 2 traces a single chat request from top to bottom — from the caller, through the gateway's inbound policy, out to Azure AI Foundry, and into your telemetry.

Figure 2 — Per-request flow: authentication, model resolution, forwarding, and token-metric emission.

The request flow is straightforward: the client calls the APIM endpoint, the gateway selects the Foundry backend, authenticates with managed identity, resolves the model name, forwards the request, emits token metrics, and returns the response unchanged. Governance is added at the gateway without requiring client-side changes.

Why dual-shape model resolution matters

The azure-openai-emit-token-metric policy can emit usage, but it still needs a model dimension. Different API surfaces place the model name in different locations: Foundry Model Inference and Anthropic-style requests use the body, while Azure OpenAI-compatible calls use the URL path. The policy handles both shapes, so one gateway can govern all callers consistently.

Observability: per-model token visibility and chargeback

Metrics land in Azure Monitor / Application Insights under the namespace genai-tokens. The policy records Total Tokens, Prompt Tokens, and Completion Tokens, each tagged with Model Name, Model Deployment Name, API Id, APIM Product Subscription, Client IP, and Product ID. The data can then be queried directly. Per-model consumption over time:

customMetrics | where name == "Total Tokens" | where timestamp >= startofmonth(now()) | extend ModelName = tostring(customDimensions["ModelName"]) | summarize TotalTokens = sum(valueSum), Calls = sum(valueCount) by ModelName

Per-model token consumption (Application Insights customMetrics).

And a per-model, per- API product subscription view for chargeback:

customMetrics | where name == "Total Tokens" | where timestamp >= startofmonth(now()) | extend Model = tostring(customDimensions["ModelName"]), Sub = tostring(customDimensions["Subscription"]) | summarize Tokens = sum(value) by Model, Sub, name | order by Tokens desc

Chargeback: tokens by model and consuming subscription.

The gateway records model identity and token usage together, so the chargeback view is built in. The same signal supports dashboards, daily budget alerts, capacity planning, and cost allocation by subscription or product — without changing client code.

Turning the signal into a Cost Guardrail: a 24-hour token alert

Because the token totals now carry the model name, you can put a hard guardrail on spend. Wrap a query in an Azure Monitor log search alert rule that sums Total Tokens over the last 24 hours per model and returns only the deployments that breach a daily budget:

// Rolling 24-hour token-budget guardrail — returns any model over its daily cap let dailyTokenBudget = 50000; // max Total Tokens per model in a rolling 24h window customMetrics | where name == "Total Tokens" | where timestamp > ago(24h) extend Model = tostring(customDimensions["ModelName"]) | summarize TokensLast24h = sum(value) by Model | where TokensLast24h > dailyTokenBudget | extend OverBudgetBy = TokensLast24h - dailyTokenBudget | project Model, TokensLast24h, DailyBudget = dailyTokenBudget, OverBudgetBy

A rolling 24-hour token-budget check. The alert rule fires whenever this query returns one or more rows.

Configure this as a scheduled log search alert: evaluate on a short cadence (for example, hourly over the trailing 24-hour window), set the alert logic to fire when the result count is greater than zero, and attach an action group that notifies the team through an email distribution list or Microsoft Teams channel. When any model crosses its rolling 24-hour token budget, the owning team is alerted, so overspend is detected within the day rather than at invoice time. Tune dailyTokenBudget per model, or add a single all-up cap, and translate token budgets into estimated daily cost ceilings to maintain continuous spend visibility.

Completing the picture: securing the AI Landing Zone with Front Door + WAF

APIM governs model usage; Azure Front Door with WAF governs public access. Placing WAF at the edge protects the AI endpoint from common web attacks, malicious bots, abusive callers, and unwanted source IP ranges before traffic reaches APIM or Foundry.

What Front Door + WAF adds

OWASP protection: The Azure-managed Microsoft_DefaultRuleSet_2.1 helps defend against OWASP Top 10 web attacks and known CVEs; Microsoft_BotManagerRuleSet_1.0 helps block malicious bots. Run the policy in prevention mode so offending requests are rejected with a 403, not just logged.
IP restriction and rate limiting: Custom WAF rules restrict access to known IP ranges and throttle abusive callers before they reach APIM or Foundry.
Global edge: Front Door terminates TLS at the edge and provides a single, DDoS-protected public entry point for the workload.

Defense in depth across the landing zone

Layered against the Azure AI Foundry landing zone baseline, the request path looks like this:

Defense in depth: WAF at the edge, governance at the gateway, isolation on the network, Foundry reachable over a private endpoint.

The baseline Azure AI Foundry landing zone reinforces every layer: private endpoints keep PaaS services (Foundry, Key Vault, Storage, AI Search) off the public internet; a system-assigned managed identity helps remove API keys; a hub-and-spoke topology routes egress through Azure Firewall; Azure Key Vault holds the Front Door TLS certificate; and Azure Policy enforces guardrails across the subscription. The gateway pattern from Sections 2–6 slots directly into this architecture as the governed control point for model traffic.

Outcomes and what to extend next

With APIM and the security edge in place, the shared endpoint supports per-model chargeback, capacity planning, zero client changes, and a stronger security baseline. The same gateway pattern can then be extended with token quotas, semantic caching, content safety, resiliency, and a unified model API (preview).

Closing thoughts

Moving from AI adoption to AI governance does not require re-architecting every app; it requires a control point. APIM in front of Azure AI Foundry provides that point: one policy turns token usage into a per-model governance signal, and Front Door with WAF provides a hardened edge. Start with visibility, then add quotas, safety, and resiliency as adoption scales.

References

A Paradigm Shift in Cloud Operations with Azure SRE Agent

Nir_Mashkowski — Wed, 08 Jul 2026 22:41:52 GMT

Cloud operations are entering a new era. As systems grow in scale and complexity, the traditional model of reactive incident response, where engineers manually piece together signals across dozens of tools and portals, juggling all that context alone, is no longer sustainable. The operational toil required to keep systems running shipping new capabilities.

The question is straightforward: what if engineers could spend most of their time building instead of maintaining?

To help organizations make that shift, today we’re sharing how Zafin, Provation Medical, and InEight are rethinking cloud operations with Azure SRE Agent. The SRE Agent product team has been working side by side with these customers, embedding with their engineering teams to agentify their cloud operations. What follows is the story of that collaboration and the results it produced.

From hours to minutes across industries

We worked closely with Zafin starting October last year. As the onboarding progressed, and Zafin’s scenarios became more sophisticated, Zafin’s security team needed confidence that the agent would respect their access boundaries at scale. We worked closely to configure granular RBAC, and scoping needed for multi-user rollout.

“Azure SRE Agent transformed how we approach incident response. We’ve moved from fragmented signals and manual triage to an intelligence-driven model where agents collect evidence, classify issues, and recommend actions before our engineers even engage. We’ve taken incident triage from hours down to minutes, and we’re now expanding this automation across observability, health monitoring, and incident management. As an AI platform company serving tier 1 banks globally, that speed, accuracy, and enterprise-grade governance is exactly what we need.”

— George K Mathew, SVP Cloud & Business Operations, Zafin

We partnered with Provation on initial onboarding, connecting Azure DevOps as an incident source so the agent could begin triaging production support tickets. From there, they expanded into proactive health checks on their own.

“When software runs reliably, care teams can focus on their patients instead of technology. At Provation, a leading provider of clinical productivity software, we’re continuing to advance our AI-powered software development lifecycle with Azure SRE Agent, Microsoft’s AI-powered reliability service.

When a support ticket comes in, Azure SRE Agent pulls together the context an engineer needs to understand what the system was doing, what code recently changed, likely contributing factors, and recommended next steps. That analysis drops directly into our team’s normal workflow. In its first month, Azure SRE Agent provided analysis for all of our production-related tickets. Instead of checking multiple locations, engineers can start with more context already in front of them, helping work move forward more consistently and efficiently.

Azure SRE Agent also supports our development environments, helping engineers review emerging patterns earlier in the process and create follow-up work with measurable first-month results, contributing to more than a quarter of related investigation tickets during that period.

That’s what AI-assisted software development looks like day to day at Provation: intelligent tools integrated into the systems our teams already use, so healthcare providers get a smoother experience from start to finish.”

— Paul Snider, CTO, Provation Medical

With InEight, our engagement started with an on-site workshop where the agent diagnosed a live production bug their team had been unable to reproduce. That result drove rapid expansion, and we collaborated closely as InEight scaled from one product to multiple products and teams in three months.

“SRE Agent is helping InEight transform how engineering operates. By embedding AI into software delivery, reliability engineering, quality assurance, security, and operational workflows, we are reducing manual effort, accelerating delivery, improving stability, and creating a scalable foundation for future growth. Incident investigation is down 80 percent, build failure triage down 80 percent, and bug investigation down 67 percent. Azure SRE Agent is the only tool we have found that reasons across source code, live telemetry, and Azure infrastructure simultaneously, in a single conversation. For a company operating a large suite of integrated products on Azure, that capability is not incremental. It is transformational.”

— Jim Ellerbeck, Vice President of Technology, InEight

Built on governance and memory

Faster resolution is the most visible outcome, but not the full story. The reason these customers trust the agent with production operations comes down to two things: governance and memory.

Governance is what makes this level of autonomy possible. The agent explains what it intends to do and why before acting, and every interaction produces a full audit trail. Routine operations run autonomously; actions designated high-impact pause for in-workflow sign-off. VNet integration routes traffic through your own network - NSG rules, private DNS, and firewalls all apply - while least-privilege access and granular tool-level policies keep the agent operating under your rules.

Memory is what makes the system compound. Every investigation captures root causes, resolution steps, team preferences, and operational patterns. That knowledge persists across conversations. New team members ramp faster. On-call quality stays consistent regardless of who is paged. The collective expertise of the team grows automatically and never leaves when people do.

From maintaining to building

The pattern emerging from these customers points to a fundamental shift in what it means to run services. Traditional operations are giving way to an agent-driven model where the cognitive burden of monitoring, diagnosing, and resolving issues is lifted.

When the agent handles the investigative toil, captures institutional knowledge, and gets smarter with every interaction, engineering teams can redirect their energy toward building the next generation of products and services. The teams adopting this model are not just operating faster. They are innovating faster, because their best people are no longer trapped in reactive maintenance cycles.

Azure SRE Agent is generally available.

Visit https://sre.azure.com to create your first agent in minutes.

Auditing and Telemetry for the Agent Governance Toolkit - Getting Started with .NET Core

daisami — Thu, 02 Jul 2026 13:35:00 GMT

Introduction

We've entered an era where AI agents autonomously invoke tools — reading and writing files, calling APIs, querying databases. Convenient as this is, without a mechanism to control who can call what, and under what conditions, you can't put it into production. The Agent Governance Toolkit (AGT), open-sourced by Microsoft, is exactly the toolkit for embedding that "gatekeeper" into AI agents. This article walks through getting started with AGT in .NET (C#), based on the following repository:

https://github.com/normalian/MyAGTSamples/tree/main/AGTAuditBlobTelemetryApp01

This sample walks through implementation examples focused on production-oriented audit log collection and telemetry integration. For basic usage of things like Policies, please refer to the following repository:

https://github.com/microsoft/agent-governance-toolkit

In environments where AI agents make important decisions and execute tools, recording an audit trail of "what happened" and "when it happened" is critical. AGT can export governance events to external systems, integrating seamlessly with production infrastructure such as Azure Blob Storage and Application Insights.

The Importance of Auditing and Telemetry in Production

Meeting Regulatory Requirements

In industries like finance and healthcare, an audit trail of an AI system's decision-making process and execution details is often legally required.

Anomaly Detection and Post-Incident Response

By analyzing audit logs, you can detect signs of anomalous agent behavior and respond quickly.

Performance Analysis

Collecting telemetry data lets you identify latency bottlenecks and failure patterns in agent execution.

Compliance Reporting

Audit trails can serve as evidence for security audits and compliance reviews.

Overview of the AGTAuditBlobTelemetryApp01 Project

This sample project builds on the previous sample (AGTIdentityWithMAFApp02), adding the following:

Feature	Description
Blob auditing: BlobAuditSink	Persists every governance event as an append-only record to an Azure Blob Storage Append Blob.
Telemetry integration	Exports metrics, traces, and logs to Application Insights using Azure.Monitor.OpenTelemetry.Exporter
Direct Governance Evaluation	A demo that directly evaluates allow/deny decisions for tool calls based on policy
Thread safety	Uses SemaphoreSlim to protect against concurrent access from multiple threads

Project Structure

AGTAuditBlobTelemetryApp01/ │ AGTAuditBlobTelemetryApp01.csproj │ AppConfiguration.cs │ BlobAuditSink.cs │ Program.cs │ README.md └───policies default.yaml

Architecture Overview

Data Flow

Microsoft Agent Framework agent execution: a tool call occurs
Governance decision: the GovernanceKernel decides allow/deny based on policy
Event firing: a GovernanceEvent is generated in memory
Audit output (path ①): BlobAuditSink appends the log to Blob in JSON format
Telemetry output (path ②): a span is recorded via ActivitySource and exported to Application Insights

Key Components Explained

1. The BlobAuditSink Class

A component that records audit logs to an Append Blob in Azure Blob Storage.

internal sealed class BlobAuditSink : IDisposable { private readonly AppendBlobClient _appendBlobClient; private readonly SemaphoreSlim _gate = new(1, 1); private bool _initialized; // TODO: Use Managed Identity in production public BlobAuditSink(string storageAccountUri, AzureCliCredential credential, string containerName, string blobName) { // Initialize the Blob container client var serviceClient = new BlobServiceClient(new Uri(storageAccountUri), credential); var containerClient = serviceClient.GetBlobContainerClient(containerName); containerClient.CreateIfNotExists(); _appendBlobClient = containerClient.GetAppendBlobClient(blobName); } public void Append(GovernanceEvent governanceEvent) { // Append the governance event to the Blob in JSON format var record = new GovernanceAuditRecord( governanceEvent.EventId, governanceEvent.Timestamp, governanceEvent.Type.ToString(), governanceEvent.AgentId, governanceEvent.SessionId, governanceEvent.PolicyName, new Dictionary<string, object>(governanceEvent.Data)); var payload = JsonSerializer.Serialize(record) + Environment.NewLine; var bytes = Encoding.UTF8.GetBytes(payload); _gate.Wait(); // Thread safety: only one thread may append to the Blob at a time try { EnsureBlobExists(); using var stream = new MemoryStream(bytes, writable: false); _appendBlobClient.AppendBlock(stream); } finally { _gate.Release(); } } }

Key Points

SemaphoreSlim: Controls concurrent access from multiple threads, restricting append operations on the Append Blob to a single thread at a time
AppendBlobClient: Rather than traditional Upload/Download, this always appends to the end — ideal for log files
JSONL format: one line = one record, which suits streaming processing and ingestion into spreadsheets. This can also be ingested into data platforms such as Microsoft Fabric.
⚠️ Security: Tool arguments may end up containing API keys or personal information. See the "Security Considerations: Sanitizing Audit Logs" section below for details.

2. OpenTelemetry Integration

using var meterProvider = Sdk.CreateMeterProviderBuilder() .SetResourceBuilder(resourceBuilder) .AddMeter(GovernanceMetrics.MeterName) .AddAzureMonitorMetricExporter(options => { options.ConnectionString = applicationInsightsConnectionString; }) .Build(); using var tracerProvider = Sdk.CreateTracerProviderBuilder() .SetResourceBuilder(resourceBuilder) .AddSource(GovernanceMetrics.MeterName) .AddSource(ActivitySource.Name) .AddAzureMonitorTraceExporter(options => { options.ConnectionString = applicationInsightsConnectionString; }) .Build();

Key Points

MeterProvider: sends metrics from GovernanceMetrics.MeterName (= "agent_governance") to Azure Monitor
TracerProvider: records spans generated by ActivitySource into Application Insights
ResourceBuilder: attaches a service name and version, making it possible to distinguish between multiple agents

3. The Governance Event Audit Hook

kernel.OnAllEvents(evt => { // Create a span: recorded to Application Insights as trace information using var activity = ActivitySource.StartActivity($"GovernanceEvent.{evt.Type}", ActivityKind.Internal); activity?.SetTag("event.type", evt.Type.ToString()); activity?.SetTag("agent.id", evt.AgentId); activity?.SetTag("policy.name", evt.PolicyName ?? "(none)"); activity?.SetTag("tool.name", evt.Data.GetValueOrDefault("tool_name", "(unknown)")); // Append the audit log to Blob Storage auditSink.Append(evt); // Emit a log entry logger.LogInformation( "[Audit -> Blob] {EventType} policy={PolicyName} tool={ToolName} agentId={AgentId}", evt.Type, evt.PolicyName ?? "(none)", evt.Data.GetValueOrDefault("tool_name", "(unknown)"), evt.AgentId); });

Key Points

OnAllEvents callback: captures every governance event (PolicyCheck, PolicyViolation, ToolCallBlocked, etc.)
Dual output: written simultaneously to Blob Storage (audit trail) and Application Insights (analytics telemetry)
Span tags: allow later filtering and grouping within Application Insights

4. Direct Governance Evaluation

A demo that judges allow/deny purely based on policy, without actually executing the tool call.

// Allow example var allowed = kernel.EvaluateToolCall( agent.Did, "GetLocation", new Dictionary<string, object> { ["city"] = "London" }); Console.WriteLine($"GetLocation call allowed? {allowed.Allowed}"); Console.WriteLine($"GetLocation call reason: {allowed.Reason}"); // Deny example var blocked = kernel.EvaluateToolCall( agent.Did, "execute_shell", new Dictionary<string, object> { ["cmd"] = "rm -rf /" }); Console.WriteLine($"Blocked call allowed? {blocked.Allowed}"); Console.WriteLine($"Blocked call reason: {blocked.Reason}");

Key Points

Pre-execution decision: can be evaluated before the tool is actually run — efficient for things like log processing
Detailed reasoning: allowed.Reason lets you see exactly which policy rule produced the decision

Governance Policy Explained

apiVersion: governance.toolkit/v1 version: "1.0" name: audit-blob-telemetry-policy default_action: deny rules: - name: allow-weather condition: "tool_name == 'GetWeather'" action: allow priority: 10 - name: allow-time condition: "tool_name == 'GetTime'" action: allow priority: 10 - name: allow-location condition: "tool_name == 'GetLocation'" action: allow priority: 10 - name: block-shell condition: "tool_name == 'execute_shell'" action: deny priority: 100

Policy Strategy

Default action: deny: any tool not explicitly allowed is denied (a whitelist approach)
Priority-based rules: higher priority is evaluated first. block-shell has priority 100, giving it an explicit deny

Run the Sample

Prerequisites

Set the following environment variables:

$env:AZURE_OPENAI_ENDPOINT = "https://your-resource.openai.azure.com/" $env:AZURE_OPENAI_DEPLOYMENT = "gpt-5.4-mini" $env:APPLICATIONINSIGHTS_CONNECTION_STRING = "InstrumentationKey=...;..." $env:AZURE_STORAGE_ACCOUNT_NAME = "yourstorageaccount" $env:AUDIT_STORAGE_CONTAINER = "agt-audit"

Execution Result

================================================== AGT Audit to Blob + Telemetry to Application Insights ================================================== Agent DID: did:mesh:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Agent Status: Active Telemetry: Application Insights via AzureMonitorExporterOptions Audit: Azure Blob Storage container 'agt-audit' ✓ OpenTelemetry providers initialized - Metrics: agent_governance - Traces: AgentGovernance.Examples.AuditBlobTelemetry - Logs: AgentGovernance.Examples === Agent Run 1 === [Audit -> Blob] PolicyCheck policy= tool=GetWeather agentId=did:mesh:... Role: assistant FunctionCall: GetWeather({"city":"Seattle"}) Role: tool FunctionResult: call_xxx = Seattle is sunny, 22°C Role: assistant Text: The weather in Seattle is sunny with a temperature of 22°C. === Agent Run 2 === [Audit -> Blob] PolicyCheck policy= tool=GetTime agentId=did:mesh:... Role: assistant FunctionCall: GetTime({"timezone":"Tokyo"}) Role: tool FunctionResult: call_yyy = Current time in Tokyo: 09:30:00 UTC Role: assistant Text: The current time in Tokyo is 09:30:00 UTC. === Direct Governance Demo - Allow Example === GetLocation call allowed? True GetLocation call reason: Matches allow-location rule === Direct Governance Demo - Deny Example === Blocked call allowed? False Blocked call reason: Matches block-shell rule (deny) === Flushing Telemetry to Application Insights === Flushing metrics... Flushing traces... Flushing logs... Waiting 5 seconds for data transmission... ✓ Telemetry flush complete.

Security Considerations: Sanitizing Audit Logs

In production, tool arguments may end up containing API keys, passwords, or personally identifiable information (PII). Because of this, it's important to apply masking before writing to Blob.

Masking Strategy:

Target Key	Masking Example	Description
api_key, token	sk***...d7e2	Show only the first and last 2 characters
password, secret	***	Fully hidden
email	us***...com	Mask the middle of the email address
cmd (shell command)	Excluded from masking, since knowing whether it was executed matters	However, treat separately if the arguments contain sensitive data

Checking Blob Storage Audit Logs

Open the Blob Storage account in the Azure Portal and check the `governance-audit-YYYYMMDD.jsonl` file in the `agt-audit` container.

Example content:

{"EventId":"evt-42a6f9d598c44d56b15152699ac8eafa","Timestamp":"2026-07-02T01:47:32.0371686+00:00","Type":"PolicyViolation","AgentId":"did:agentmesh:weatheragent","SessionId":"maf-e2ed4a320f944bbf","PolicyName":null,"Data":{"message":"What is the weather in Seattle?","allowed":false,"action":"deny","reason":"No matching rules; default action is deny.","evaluation_ms":37.2383}} {"EventId":"evt-7ff65a1c8bd84195a26722adc20ce2bf","Timestamp":"2026-07-02T01:47:41.1239754+00:00","Type":"PolicyViolation","AgentId":"did:agentmesh:weatheragent","SessionId":"maf-9f949b3838054324","PolicyName":null,"Data":{"message":"What time is it in Tokyo?","allowed":false,"action":"deny","reason":"No matching rules; default action is deny.","evaluation_ms":0.1392}} {"EventId":"evt-baf03404cd8a477eaaa5e0af93413a74","Timestamp":"2026-07-02T01:47:48.501694+00:00","Type":"PolicyViolation","AgentId":"did:agentmesh:weatheragent","SessionId":"maf-e88d7382ed3b49a0","PolicyName":null,"Data":{"message":"Where is Paris located?","allowed":false,"action":"deny","reason":"No matching rules; default action is deny.","evaluation_ms":0.0919}} {"EventId":"evt-107e6c8e18e14fdcb8272844e57c45b1","Timestamp":"2026-07-02T01:47:53.9755476+00:00","Type":"PolicyCheck","AgentId":"did:mesh:4c3ca93fec8b9d78fd05839694ed7df8","SessionId":"session-775002743f0f4d28","PolicyName":"allow-location","Data":{"tool_name":"GetLocation","allowed":true,"action":"allow","reason":"Matched rule \u0027allow-location\u0027 with action \u0027Allow\u0027.","evaluation_ms":4.3211,"arguments":{"city":"London"}}} {"EventId":"evt-6bc7e6b8632a41cb885fc5e87097c6ac","Timestamp":"2026-07-02T01:48:02.4498334+00:00","Type":"ToolCallBlocked","AgentId":"did:mesh:4c3ca93fec8b9d78fd05839694ed7df8","SessionId":"session-bfda81e9ce30429a","PolicyName":"block-shell","Data":{"tool_name":"execute_shell","allowed":false,"action":"deny","reason":"Matched rule \u0027block-shell\u0027 with action \u0027Deny\u0027.","evaluation_ms":0.065,"arguments":{"cmd":"rm -rf /"}}}

How to Use It

KQL Query — Kusto Query Language** — analyze logs like this:

// Aggregate allow/deny counts for metrics under agent_governance. AppMetrics | where Name startswith "agent_governance." | where TimeGenerated > ago(24h) | extend Decision = tostring(Properties.decision) | summarize Count = sum(Sum) by Name, Decision | order by Name asc, Decision asc

Checking in Application Insights

In the Azure Portal, go to Application Insights → Search → Traces tab to check the agent's execution performance.

Real-World Usage Scenarios

Scenario 1: Detecting Anomalous Access Patterns

// Raise an alert if a given agent receives many tool denials in a short time window // Note: since PolicyName can be null, combine a Type-based check with null-safe checks var recentDenials = auditLogs .Where(l => l.Timestamp > DateTime.UtcNow.AddMinutes(-5)) .Where(l => l.Type is "PolicyViolation" or "ToolCallBlocked" || (l.PolicyName?.Contains("deny") ?? false) || (l.PolicyName?.Contains("block") ?? false)) .GroupBy(l => l.AgentId) .Where(g => g.Count() > 10); foreach (var agentGroup in recentDenials) { NotifySecurityTeam($"Agent {agentGroup.Key} attempted " + $"{agentGroup.Count()} forbidden operations in the last 5 minutes"); }

Notes:

Type-based checks: as the JSONL sample shows, default-deny PolicyViolation entries have a PolicyName of null. Making the Type-based check the primary criterion avoids exceptions from missing null checks.
Null-safe operators: combining PolicyName?.Contains() with ?? false avoids null-reference exceptions while still allowing policy-name-based filtering.

Scenario 2: Agent Behavior Monitoring Dashboard

Build a custom Application Insights dashboard that shows, in real time:

Agent count: number of currently active agents
Denial rate: percentage of policy denials over the past hour
Average response time: average latency of tool calls
Error rate: percentage of tool calls that raised an exception

Considerations for Production

1. Scalability

At large agent-fleet scale, writes to Blob Storage may become a bottleneck.

Mitigations:

Hierarchical partitioning with Azure Data Lake Storage (ADLS)
Batch processing via Event Hubs
Load-balancing writes across multiple Append Blobs

2. Cost Optimization

Storing large volumes of logs in Blob Storage increases storage costs.

Mitigations:

Automatic deletion of old logs (lifecycle policies)
Staged tiering from Hot → Cool → Archive
Storing only the information you actually need (filtering)

3. Performance

Emitting telemetry to Application Insights may add latency.

Mitigations:

Use batch processing mode (the default)
Limit the number of traces (sampling)
Asynchronous output

Summary

Integrating AGT's auditing and telemetry gives you:

A complete audit trail: every policy decision and execution result is recorded
Real-time monitoring: agent behavior visualized through Application Insights
Anomaly detection: early warning through pattern recognition
Production readiness: a scalable, robust architecture

In production environments where AI agents play a critical role, transparency and traceability are essential. By leveraging AGT's auditing and telemetry features, you can achieve safe, reliable agent operations.

References

Agent Governance Toolkit - GitHub https://github.com/microsoft/agent-governance-toolkit
Azure Blob Storage - Append Blob https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-append
Azure Monitor OpenTelemetry Exporter https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-enable
Application Insights - KQL Query Reference https://learn.microsoft.com/en-us/azure/azure-monitor/logs/kql-quick-reference
Sample code repository https://github.com/normalian/MyAGTSamples

I hope this article helps with implementing audit and telemetry capabilities for AI agents in production. If you have any questions or feedback, please let me know in the comments!

How to build long-running MCP tools on Azure Functions

lily-ma — Wed, 01 Jul 2026 16:10:42 GMT

Recently, a customer building servers with the Azure Functions MCP extension reached out and asked:

How do I handle tools that take longer than the client is willing to wait?

This becomes especially relevant when tool calls move beyond simple request/response into multi-step workflows and long-running operations.

At the same time, MCP is evolving to address exactly this. The Tasks extension is introduced in the 2026-07-28 release candidate, defining a standard way to model long-running work.

In this post, we’ll walk through how to build long-running MCP tools on Azure Functions using Durable Functions , a framework for authoring stateful, long-running workflows as ordinary code, with checkpointing, scaling, and recovery handled automatically.

MCP tools today

Today, MCP tools are fundamentally request/response:

the client issues a tools/call
the server returns a result

This works well for fast operations, but breaks down when:

workflows take minutes
execution depends on multiple steps
latency is unpredictable

In practice, clients enforce their own tool-call timeouts. These aren't standardized by the MCP spec and vary per client, but they're often in the ~30–60 second range. If a tool exceeds that window:

In practice, clients often enforce short timeouts. If a tool exceeds that window:

the client times out
the agent observes a failed call
the underlying work may still be running

So the core issue is that you have synchronous tool calls don’t naturally model long-running work.

The MCP Tasks extension

The Tasks extension to address this. With the extension, a server can respond to a tools/call with an asynchronous task handle instead of a final result, and the client drives the lifecycle from there:

tasks/get: poll the task's status
tasks/update: submit input back to the server if the task reaches input_required
tasks/cancel: cancel an in-flight task

A task carries a status ("working", "input_required", "completed", "failed", or "cancelled") and on completion, the final result. Task creation is server-directed: the client advertises support by including the extension in its per-request capabilities, and the server decides per request whether to return a task. A server won't return a task to a client that hasn't advertised support.

It's important to note that Tasks rely on ecosystem support. Clients must advertise the extension, and MCP SDKs must implement the task lifecycle, before servers can use it. So while Tasks is now a defined extension, broad client and SDK support is still in progress.

Implement long-runng tasks with Durable Functions today

Until the Tasks extension is broadly supported across clients, we need a pattern that works with existing request/response clients and supports long-running execution. The following samples show how, using Durable Functions:

Python
NET

The long-running work in this sample mines a short chain of blocks. Each block requires solving a computational puzzle where the system keeps trying different inputs until it finds one that produces a result matching a specific pattern (for example, starting with a certain number of zeros). Because this involves lots of trial and error, it naturally takes time, making it a good example of a long-running workflow.

The server in the sample exposes two tools:

start_mining

Starts a Durable Functions orchestration to mine the blocks
Waits briefly (within a configurable budget)
Returns result inline if completed within budget OR returns workflow_id if still running

get_mining_result

Takes the workflow_id
Returns the current state, e.g. "completed", "running", "failed", or "not_found"

To ensure that the agent calls the tools in the right order, workflow_id is a required parameter of get_mining_result, so the agent can't poll without starting a mining run first. Also, the "running" response carries a poll_after_seconds and a next instruction, ensuring the agent to poll again if work is not done rather than give up or assume completion.

Even so, the poll path still relies on the agent correctly remembering, and not hallucinating, the workflow_id it was handed. If it garbles or invents an id, the poll lands on the wrong instance or none at all (which is why get_mining_result returns "not_found" rather than guessing).

What changes with the Tasks extension

Once the Tasks extension is fully implemented across clients and SDKs, the model becomes simpler and more reliable: the server returns a Task handle, the client manages the polling and lifecyle calls, and the SDK tracks execution state. This removes a key limitation of today’s solution, which requires the agent to remember and correctly pass identifiers like workflow_id.

Call to action

Try out the sample and let us know whether it addresses your MCP needs around long-running or workflow type tools!

Azure Container Apps Express for Shipping Container Apps Fast

hetvip — Tue, 30 Jun 2026 20:10:51 GMT

ACA Express Apps are a strong fit for teams that need to ship quickly and can't afford long platform setup cycles. This includes startups, internal platform teams, and product groups deploying APIs, web apps, or agent endpoints that scale with uneven demand. If the priority is fast path-to-production, predictable wake-up behavior, and minimal infrastructure overhead, this model is likely the right choice.

To put real numbers behind that, I built a live demo that races Express against a Consumption environment on the same app. The measurements below come from that demo, not from a spec sheet.

MicroVMs make cold starts practical

Cold start delays usually come from rebuilding runtime state whenever an app wakes up. ACA Express Apps reduce that overhead with MicroVM-based startup paths built for fast boot and isolation. The result is faster instance readiness without trading off security.

The gap shows up clearly when both apps have scaled all the way to zero. Waking from a genuine cold start, Express comes back in about 1.5 seconds. The same app in a Consumption environment takes about 20 seconds to answer the first request. Both were measured live in the browser, from request to first response.

Disk and memory state restore is the speed multiplier

State restoration skips the app's internal boot sequence entirely. Instead of replaying the same initialization work on every start, ACA Express Apps can restore disk and memory state so the app starts closer to ready. That reduces time-to-first-request and smooths scale events, especially for framework-heavy workloads. It's also what lets scale-to-zero stay practical: the app costs nothing while idle, but the wake-up penalty stays in the low single-digit seconds instead of the tens of seconds you'd otherwise pay.

Environmentless changes the deployment experience

Skipping the environment setup completely changes the deployment workflow. Teams can ship the container app without first managing environment sprawl, while still getting the runtime foundations they need. For fast-moving teams, that means less setup overhead and a shorter path to production.

You can see how little there is to fill in. Creating an Express app is a single short form. There is no environment to stand up first.

And once it's created, the manage view gives you the live URL, status, and the basics you need to operate it.

The numbers, side by side

Everything below was measured on the same container image, in the West Central US region.

What's measured	Express	Consumption
Cold start from zero (request to first response)	~1.5 s	~20 s
Environment provisioning	~14 s	~120 s
First-time deploy (environment + app, zero to live URL)	~52 s	~166 s
App deploy only (environment already exists)	~30 s	~30 s

Express is much faster on the two steps that build infrastructure from scratch: cold start and environment provisioning. Once an environment already exists, the two are about the same. Express isn't a different app runtime, it's the same platform with the first-time setup cost stripped down.

Get started

Express is in public preview. You can have a container on a live URL in the time it takes to read this post.

📖 Azure Container Apps Express overview — concepts, capabilities, and the current feature support matrix.
🚀 Create your first Express app — the CLI commands and portal steps to get an app running.
🛠️ New Container Apps portal — create and manage Express apps in the streamlined UI.
🧪 Test Express apps locally — validate your container before you deploy.
❓ Express FAQ — preview status, limits, regions, and how Express relates to standard Container Apps.

👉 Deploy an Express app · Read the docs · Browse the FAQ

When speed matters, ACA Express is the best tool for deploying containers. It skips the platform setup delays without sacrificing reliability under load.

A Better Way to View Logs in Kudu for Azure App Service on Linux

TulikaC — Thu, 25 Jun 2026 09:46:14 GMT

Logs are often the fastest way to understand what is happening inside your application. Whether you are investigating startup behavior, runtime errors, failed requests, dependency issues, or unexpected application behavior, having the right log view can make troubleshooting much easier.

To make this easier, we have added a new Log stream page in Kudu for Azure App Service on Linux, available under the Logs dropdown. This experience gives you a single place to stream, browse, search, and filter logs so you can understand what is happening in your app faster.

Opening the Logs page

You can open Kudu from the Azure portal:

Go to your App Service.
Select Advanced Tools.
Click Go.

You can also open Kudu directly by going to: https://<app-name>.scm.azurewebsites.net

From there, open the Logs page.

View live logs across your app and platform

The Logs page lets you view logs as they are being written, with filters for timeframe, instance, container, log type, and level.

This helps when you want to focus on a specific instance, look only at errors, or separate application logs from platform events.

For example, you can use platform logs to understand container lifecycle events, restarts, startup behavior, warmup probe activity, and other platform-side events related to your app.

Quickly find the log entries that matter

You can use keyword search to narrow down the log stream or historical logs. This is useful when you are looking for a specific error message, request path, exception, dependency failure, timeout, or any application-specific keyword.

Instead of scanning through hundreds of entries, you can search for the terms that are relevant to the issue you are investigating.

Investigate issues within a specific timeframe

The Log stream page also supports viewing logs for a selected time range. This is useful when you know when an issue occurred and want to inspect both application and platform activity around that time.

For example, you can filter to a specific timeframe, switch to Application logs, and check what your app was doing when the issue happened.

This can help you troubleshoot scenarios such as failed requests, application exceptions, slow startup, container restarts, dependency issues, or configuration problems.

Summary

The new Log stream page in Kudu makes it easier to work with logs for Azure App Service on Linux. With live streaming, keyword search, historical views, and filters for application and platform logs, you can quickly narrow down the information you need and troubleshoot issues more efficiently.

We are continuing to improve the App Service Linux experience to make diagnostics simpler and more useful for day-to-day development and operations.

IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)

johnsonshi_msft — Wed, 24 Jun 2026 22:28:46 GMT

By Johnson Shi, Aviral Takkar, Bin Du

Introduction

Two of the most common networking questions we hear from teams running Azure Container Registry (ACR) are:

"Can my registry serve clients on IPv6 networks?" — Teams operating IPv6-only or dual-stack networks need their container registry reachable over IPv6.
"How do we start moving registry traffic toward IPv6 without breaking anything?" — Organizations guarding against IPv4 address exhaustion, or operating under IPv6 transition mandates, want a migration path that doesn't disrupt existing IPv4 clients.

Today, we're announcing the public preview of IPv6 dual-stack endpoints for Azure Container Registry for public endpoints and firewall rules, with IPv6 over private endpoints planned for GA. Set your registry's endpoint protocol to IPv4AndIPv6, and its endpoints become reachable over both IPv4 and IPv6 — so IPv4-only, dual-stack, and IPv6-capable clients all connect to the same registry, each over whichever protocol their network stack selects.

Key Takeaways

ACR registries now support an endpointProtocol setting with two values: IPv4 (default) and IPv4AndIPv6 (dual stack, preview).
Dual stack is additive — your registry continues serving IPv4 clients exactly as before. There is no IPv6-only mode.
Dual stack requires dedicated data endpoints to be enabled (--data-endpoint-enabled true), and dedicated data endpoints require the Premium SKU. The service enforces this requirement.
You can enable it today with Azure CLI 2.87.0 via az acr update --endpoint-protocol IPv4AndIPv6.
FQDN-based client firewall rules keep working unchanged; IP-based allowlists need to account for IPv6 traffic.
Limitation: This public preview covers IPv6 for the registry's public endpoints and firewall rules only. IPv6 over private endpoints is planned for a future release.
Limitation: ACR Tasks isn't supported on a registry that has IPv6 dual-stack enabled. Tasks does not work when the endpoint protocol isIPv6 dual-stack, including quick builds (with az acr build) and quick task runs (with az acr run). Support is planned for a future release.

How to enable it

On an existing registry (Azure CLI 2.87.0 or later)

Dual stack requires dedicated data endpoints, so enable both in a single update:

az acr update --name <your-registry> --data-endpoint-enabled true --endpoint-protocol IPv4AndIPv6

If dedicated data endpoints are already enabled, set the endpoint protocol on its own:

az acr update --name <your-registry> --endpoint-protocol IPv4AndIPv6

Verify the configuration:

az acr show --name <your-registry> --query "{endpointProtocol:endpointProtocol, dataEndpointEnabled:dataEndpointEnabled}"

{ "dataEndpointEnabled": true, "endpointProtocol": "IPv4AndIPv6" }

Note: If your clients sit behind a firewall and you're enabling dedicated data endpoints for the first time, add firewall rules for <your-registry>.<region>.data.azurecr.io before enabling — switching from *.blob.core.windows.net to dedicated data endpoints changes where layer blobs are downloaded from. See Dedicated data endpoints for details.

Reverting to IPv4

Dual stack is reversible at any time:

az acr update --name <your-registry> --endpoint-protocol IPv4

Reverting the endpoint protocol leaves dedicated data endpoints enabled; disable them separately if desired.

Scope of this preview

This public preview enables IPv6 for the registry's public endpoints — the login server, dedicated data endpoints, and regional endpoints (if enabled). IPv6 over private endpoints isn't part of this preview. Support is planned for a future release. Until then, registries reached through a private endpoint continue to use IPv4.

Additionally, IPv6 dual-stack support for ACR Tasks, including support for `az acr build` and `az acr run`, are not supported in the public preview. Support is planned for a future release.

Requirements and how features compose

Requirement	Why
Premium SKU	Dedicated data endpoints are a Premium feature.
Dedicated data endpoints enabled	`IPv4AndIPv6` requires `dataEndpointEnabled: true`; the service rejects the setting otherwise.
Azure CLI 2.87.0+	Adds `--endpoint-protocol` to `az acr update`.

For geo-replicated registries, the endpoint protocol is a registry-level setting, and dedicated data endpoints exist in every replica region. Firewall guidance: rules based on registry FQDNs — the login server, dedicated data endpoints, and regional endpoints (if enabled) — continue to work unchanged for dual-stack registries; only IP-address-based allowlists need updating for IPv6.

To learn more, see IPv6 dual-stack endpoints in Azure Container Registry (preview) and the ACR endpoint reference.

If you have further questions about IPv6 dual-stack endpoints or dedicated data endpoints, reach out to us on the Azure Container Registry GitHub repository or file feedback through the Azure portal.

Only 8.5% of MCP Servers Use OAuth — Here's How to Host One Securely on App Service

jordanselig — Tue, 23 Jun 2026 19:20:54 GMT

The Model Context Protocol exploded onto the scene because it's easy. Stand up a server, expose a few tools, point Claude or VS Code at it, and your agent can suddenly read files, hit APIs, and run code. That same ease is the problem: most MCP servers ship with no authentication at all, and they're getting pushed straight to the internet.

The numbers are bleak-into-an-incident-report bad. Astrix Research's State of MCP Server Security 2025 found that only 8.5% of MCP servers use OAuth — the rest lean on static API keys or nothing. And the CVEs have already started:

CVE-2025-6514 — a CVSS 9.6 OS command-injection flaw in mcp-remote. If a client connects to a malicious or hijacked MCP server, the server can inject shell commands through the OAuth authorization_endpoint during discovery and achieve remote code execution on the client. Roughly half a million downloads were exposed.
CVE-2025-49596 — RCE in the MCP Inspector dev tool, which shipped with no authentication on its local web UI. A crafted request from a webpage you happened to visit could execute code on your machine.

The throughline: MCP doesn't enforce security at the protocol level. The spec is explicit that authorization is optional and implementation-dependent. That's a reasonable design choice for a transport, but it means you own the perimeter. Skip it, and you've published an unauthenticated RPC endpoint that can read secrets and run tools.

So let's not skip it. This post walks through a hardened MCP server on Azure App Service that closes every gap above — and most of it is platform configuration, not code you have to write and get right yourself.

Sample: seligj95/app-service-secure-mcp. One azd up (plus an Entra app registration the hook creates for you) gives you an MCP server behind Easy Auth, talking to Key Vault over a managed identity, with no public network access, fronted by API Management, and an Application Insights alert watching for abuse.

The threat model for a hosted MCP server

Before the architecture, be honest about the attack surface. When an MCP server is internet-reachable, the bad days look like this:

Unauthenticated tool invocation. Anyone who finds the endpoint calls your tools. If one of them reads a database or a secret, that's the whole game.
Credential exfiltration. A tool that returns a secret value — even "helpfully," for debugging — hands credentials to whatever is driving the session.
Prompt injection via tool responses. A compromised or malicious tool return can carry instructions that hijack the calling agent.
Path traversal / injection. A tool that concatenates user input into a file path or shell command is the same class of bug we've fought for 25 years, now with an LLM cheerfully supplying the payload.
Lateral movement. A server running with a broad identity or a network line of sight to everything becomes a pivot point.

The architecture below maps a defense to each one. None of it is exotic — it's the App Service security stack, pointed at MCP.

The architecture

Five layers, each one a checkbox or a few lines of Bicep. Let's take them in order.

1. Easy Auth — spec-compliant OAuth you don't have to write

The single most important fix is also the easiest: turn on App Service built-in authentication (Easy Auth) and point it at Entra ID. Now App Service validates the token and rejects unauthenticated requests at the platform, before a single line of your Python runs.

App Service Authentication also has a built-in MCP server authorization mode (Preview) that makes your server comply with the MCP authorization spec: it serves Protected Resource Metadata (PRM) so a compliant MCP client can discover the authorization server and complete the OAuth handshake itself — instead of just getting a bare 401.

In the sample that's an authsettingsV2 resource:

resource authSettings 'Microsoft.Web/sites/config@2024-04-01' = {
  parent: web
  name: 'authsettingsV2'
  properties: {
    globalValidation: {
      requireAuthentication: true
      unauthenticatedClientAction: 'Return401'   // reject, don't redirect
    }
    identityProviders: {
      azureActiveDirectory: {
        enabled: true
        registration: {
          clientId: authClientId
          openIdIssuer: '${environment().authentication.loginEndpoint}${authTenantId}/v2.0'
        }
        validation: {
          allowedAudiences: [ 'api://${authClientId}' ]
        }
      }
    }
  }
}

The piece that makes it MCP-compliant — not just "returns 401" — is enabling PRM. That's one app setting that publishes the metadata document MCP clients look for:

{
  name: 'WEBSITE_AUTH_PRM_DEFAULT_WITH_SCOPES'
  value: 'api://${authClientId}/user_impersonation'
}

unauthenticatedClientAction: 'Return401' gives a clean 401 instead of a login redirect, and PRM turns that 401 into a discoverable OAuth challenge — the client follows the metadata, signs the user in, and retries with a valid token. Recall that 8.5% figure: this is the spec-compliant OAuth the other 91.5% are missing, and you got it from configuration, not code.

One gotcha worth calling out: when App Service creates the Entra registration for you, the default policy only accepts tokens the app itself obtained. For a real MCP client to connect, add its client id to the allowed-applications policy and preauthorize it on the app registration. (Entra has no Dynamic Client Registration, so the client ships a known client id; for VS Code / GitHub Copilot, preauthorization avoids a consent prompt the client won't surface.)

The bonus is that the validated identity is handed to your code. App Service injects the caller's claims into every forwarded request as the X-MS-CLIENT-PRINCIPAL headers — and crucially, it strips any client-supplied copy first, so they can't be forged. The whoami tool just reads them:

def _client_principal(request: Request) -> Dict[str, Any]:
    raw = request.headers.get("x-ms-client-principal")
    # base64-encoded JSON of the caller's claims, injected by Easy Auth
    ...
    return {"authenticated": bool(raw), "name": name, ...}

Your tools now know who's calling without you owning any of the token machinery.

2. Managed identity — stop storing the keys to the kingdom

The static-API-key habit is how secrets leak. Replace it with a system-assigned managed identity: App Service gets an Entra identity that Azure manages, and your code authenticates to Key Vault, Storage, or Azure OpenAI with no stored credential.

This matters for a subtle reason the MCP authorization guidance calls out explicitly: the token a client presents represents access to your server, not to Key Vault. Never forward it downstream — use the managed identity (or an on-behalf-of token) for that hop. Pass-through is a vulnerability; delegation is the fix, and the managed identity is how you delegate without holding a secret.

resource web 'Microsoft.Web/sites@2024-04-01' = {
  identity: { type: 'SystemAssigned' }
  ...
}

In Python, DefaultAzureCredential resolves to that identity automatically — the same code runs locally against your az login and in Azure against the MI:

from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

credential = DefaultAzureCredential()
client = SecretClient(vault_url=KEY_VAULT_URI, credential=credential)

And least privilege is one role assignment. The sample grants the identity exactly Key Vault Secrets User — read secret values, nothing else:

var keyVaultSecretsUserRoleId = '4633458b-17de-408a-b874-0445c86b69e6'
resource appSecretsUser 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(keyVault.id, appPrincipalId, keyVaultSecretsUserRoleId)
  scope: keyVault
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', keyVaultSecretsUserRoleId)
    principalId: appPrincipalId
    principalType: 'ServicePrincipal'
  }
}

There is now no secret used to read secrets. That's the chain of custody you want.

3. Key Vault references — and a tool that won't leak them

Two pieces here. First, Key Vault references keep secrets out of your configuration. You point an app setting at a vault URI, and App Service resolves the value at runtime via the managed identity:

{
  name: 'SECURE_CONFIG_VALUE'
  value: '@Microsoft.KeyVault(SecretUri=${secureConfigSecretUri})'
}

The plaintext never appears in your repo, your Bicep, or the portal's app settings blade — it shows up as a resolved reference.

Second, and this is the part developers get wrong: a tool that reads a secret should never return the secret. The sample's read_secret_metadata proves the managed-identity path works end to end, then deliberately withholds the value:

async def tool_read_secret_metadata(secret_name: str = "demo-secret"):
    secret = client.get_secret(secret_name)
    return {
        "available": True,
        "version": secret.properties.version,
        "value_length": len(secret.value),   # length, never the value
        "note": "Value intentionally withheld — metadata only.",
    }

If your MCP server has a get_secret tool that returns the secret, you've built a credential-exfiltration API with a friendly name. Return metadata; act on the value server-side.

The same discipline applies to input. The safe_lookup tool matches against a fixed allow-list and refuses anything that smells like traversal or injection — it never touches a filesystem or a shell:

suspicious = any(t in key for t in ("..", "/", "\\", ";", "|", "$(", "`"))
if key in DOCS:
    return {"topic": key, "doc": DOCS[key], "found": True}
return {"found": False, "rejected_as_suspicious": suspicious, ...}

safe_lookup("../../etc/passwd") comes back rejected_as_suspicious: true. That is the entire fix for a whole class of CVEs.

4. Private endpoints + APIM — take the server off the internet

Authentication is necessary but not sufficient. The strongest version of "don't expose your MCP server" is to not expose it — give the App Service and Key Vault private endpoints, disable public network access, and let API Management be the only public door.

resource web 'Microsoft.Web/sites@2024-04-01' = {
  properties: {
    virtualNetworkSubnetId: appSubnetId     // outbound: reach KV's private endpoint
    publicNetworkAccess: 'Disabled'         // inbound: no public access at all
    ...
  }
}

Now the App Service hostname returns nothing from the internet. The only ingress is the APIM gateway, which runs the security policy before traffic ever reaches the VNet — validate the Entra JWT, rate-limit per caller, and (the documented extension point) run a content-safety check:

<inbound>
  <base />
  <validate-jwt header-name="Authorization" failed-validation-httpcode="401">
    <openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
    <required-claims>
      <claim name="aud"><value>api://{clientId}</value></claim>
    </required-claims>
  </validate-jwt>
  <rate-limit-by-key calls="60" renewal-period="60"
                     counter-key="@(context.Request.IpAddress)" />
  <set-backend-service backend-id="mcp-backend" />
</inbound>

This is defense in depth: APIM validates the token, and Easy Auth validates it again at the app. An attacker has to get past a public gateway with JWT enforcement and rate limiting just to reach a private endpoint that also demands a valid token. Compare that to the median MCP server, which is a raw port on the internet.

The honest trade-off: this is a security reference architecture, not a 60-second demo. APIM takes ~30–45 minutes to provision, and because the app is private, you test through the gateway, not the App Service hostname. That friction is the point — it's the same friction an attacker hits.

5. Monitoring — see the abuse before it's an incident

The last layer is visibility. The Azure Monitor OpenTelemetry distro auto-instruments FastAPI, and the audit_event tool emits a structured custom event per call. A scheduled-query alert watches the rate of those events and fires when tool invocations spike — the signature of an agent looping over a sensitive tool, or someone probing the surface:

criteria: {
  allOf: [{
    query: 'customEvents | where name == "mcp_tool_audit" | summarize calls = count() by bin(timestamp, 5m)'
    metricMeasureColumn: 'calls'
    operator: 'GreaterThan'
    threshold: 100
  }]
}

Tune the threshold to your baseline. The point is that "is someone hammering my credential-reading tool?" becomes an alert, not a forensic exercise after the fact.

Deploy it

azd auth login
azd up

A preprovision hook creates the Entra ID app registration and stashes its client id in the azd environment, so Easy Auth and the APIM policy wire themselves up. Then Bicep provisions the VNet, private endpoints, Key Vault, App Service, APIM, and the monitoring stack. (Grab a coffee for the APIM step.)

To verify, get a token and call through the gateway:

TOKEN=$(az account get-access-token \
  --resource "api://$(azd env get-value AZURE_AUTH_CLIENT_ID)" \
  --query accessToken -o tsv)

curl -s -X POST "$(azd env get-value APIM_MCP_URL)" \
  -H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"whoami","arguments":{}}}'

The response shows the authenticated principal. Drop the token and you get a 401 from APIM — exactly what you want an unauthenticated caller to see.

To let a real MCP client (VS Code, Claude) sign users in itself rather than pasting a bearer token, point it at the same URL: PRM is already published, so the client discovers the auth server and runs the OAuth flow. Just make sure its app id is allowed — azd env set AZURE_MCP_CLIENT_APP_ID <client-id> before azd up adds it to the allowed-applications policy — and preauthorize it on the server's app registration so clients that don't surface a consent prompt can connect.

Once it's deployed and you've verified it, take the App Service off the public internet with a one-line flip — azd env set LOCK_DOWN_WEB_APP true && azd provision. (The first deploy keeps public access on just long enough to push the code, because a fully-private app can only be deployed from inside its VNet. The sample's README walks through both phases.)

Why this matters

MCP is going to be the USB-C of agent tooling, and right now most of the connectors are unauthenticated and exposed. The CVEs aren't hypothetical — they have numbers and CVSS scores. But the fix isn't a research project. On App Service, the perimeter is mostly configuration: flip on Easy Auth, use a managed identity, reference Key Vault, go private, front it with APIM, and alert on the logs.

That's the difference between "I shipped an MCP server" and "I shipped an MCP server I'd put in production." If you're hosting MCP — especially anywhere a compliance auditor will eventually look — start from the secure shape, not the demo shape.

Try it

Sample repo: github.com/seligj95/app-service-secure-mcp
Astrix — State of MCP Server Security 2025: astrix.security
MCP authorization spec: modelcontextprotocol.io
App Service authentication: learn.microsoft.com

MCP Just Went Stateless — What the 2026 Spec Changes About Scaling on App Service

jordanselig — Tue, 23 Jun 2026 16:14:36 GMT

A couple of months ago I wrote about scaling MCP servers behind App Service's built-in load balancer. The trick back then was to lean on stateless HTTP transport so any instance could serve any request — and to make sure you turned off ARR affinity so the load balancer was actually free to spread traffic around.

That post still works. But the MCP spec just caught up to it in a big way.

The 2026-07-28 release candidate is the largest revision of the Model Context Protocol since it launched, and the headline change is exactly the thing we were working around: MCP is now stateless at the protocol layer. The handshake is gone, the session header is gone, and the sticky-routing-and-shared-session-store dance that horizontal deployments used to need is no longer part of the protocol at all.

If you're hosting an MCP server on App Service, this is good news — and it means a few of the steps from my last post are now things the protocol does for you. Here's what changed, and what (if anything) you need to do about it.

Here's the before and after, straight from the spec. In 2025-11-25, the client POSTs an initialize call to /mcp first and gets a session ID back:
{"jsonrpc":"2.0","id":1,"method":"initialize",
 "params":{"protocolVersion":"2025-11-25","capabilities":{},
 "clientInfo":{"name":"my-app","version":"1.0"}}}
Heads up on timing: 2026-07-28 is a release candidate as I write this; the final spec ships July 28, 2026. It contains breaking changes, so treat this as "get ready" guidance rather than "rip everything out today."

Quick recap: how we scaled MCP before

In the original post, the recipe looked like this:

Run the MCP server in stateless HTTP mode (the 2025-11-25 transport).
Scale App Service out to N instances (the sample used three).
Set clientAffinityEnabled: false so there's no ARR affinity cookie pinning a client to one instance.
If you genuinely needed cross-request state, externalize it — typically into Azure Cache for Redis — so every instance saw the same data.
Watch traffic spread across instances in Application Insights via cloud_RoleInstance.

The catch: even in "stateless HTTP" mode, the 2025-11-25 protocol still started every connection with an initialize handshake and handed back an Mcp-Session-Id that the client had to send on every follow-up request. That session ID pinned a client to whichever instance issued it — so to scale cleanly you either kept affinity on (and gave up even load balancing) or did real work to share session state across instances.

That's the part the 2026 spec deletes.

What the 2026 spec actually changes

The handshake and the session are gone

Two proposals do the heavy lifting:

SEP-2575 removes the initialize / initialized handshake. The protocol version, client info, and client capabilities that used to be exchanged once at connect time now ride along in _meta on every request. A new server/discover method lets a client ask for server capabilities when it actually wants them.
SEP-2567 removes the Mcp-Session-Id header and the protocol-level session that came with it.

With both gone, any MCP request can land on any instance. The sticky routing and shared session stores that horizontal deployments needed before just aren't required at the protocol layer anymore.

Here's the before and after, straight from the spec. In 2025-11-25, the client POSTs an initialize call to /mcp first and gets a session ID back:

{"jsonrpc":"2.0","id":1,"method":"initialize", "params":{"protocolVersion":"2025-11-25","capabilities":{}, "clientInfo":{"name":"my-app","version":"1.0"}}}

…then every later call has to carry the Mcp-Session-Id header the server handed back, which pins it to that instance:

{"jsonrpc":"2.0","id":2,"method":"tools/call", "params":{"name":"search","arguments":{"q":"otters"}}}

In 2026-07-28, the same tool call is one self-contained request that any instance can answer. The routing info rides in headers — MCP-Protocol-Version, Mcp-Method, and Mcp-Name — and the body carries everything else:

{"jsonrpc":"2.0","id":1,"method":"tools/call", "params":{"name":"search","arguments":{"q":"otters"}, "_meta":{"io.modelcontextprotocol/clientInfo":{"name":"my-app","version":"1.0"}}}}

No handshake, no session ID, nothing to pin.

Traffic you can route and cache at the edge

A few smaller changes make this traffic much friendlier to the infrastructure App Service already gives you:

Routable headers (SEP-2243): Streamable HTTP now requires Mcp-Method and Mcp-Name headers, so load balancers, gateways, and rate-limiters can route or throttle on the operation without cracking open the request body. (Servers reject requests where the headers and body disagree.)
Cacheable lists (SEP-2549): tools/list and resource-read results now carry ttlMs and cacheScope, modeled on HTTP Cache-Control. Clients know exactly how long a tool list is fresh and whether it's safe to share across users — no more holding an SSE stream open just to learn the list changed.
Traceable calls (SEP-414): W3C Trace Context (traceparent, tracestate, baggage) propagation in _meta is now documented with fixed key names. A trace that starts in the host app can follow a tool call through the client SDK, your MCP server, and whatever it calls downstream — and show up as one span tree in any OpenTelemetry backend, including Application Insights.

That last one pairs really nicely with the App Insights setup from the original sample, which already tags spans with cloud_RoleInstance.

Why this is easier on App Service now

App Service's built-in load balancer has always wanted to round-robin your requests. The thing stopping it from doing that cleanly with MCP was the protocol's own session affinity. Now that the protocol is stateless:

No affinity tuning to reason about. You still want clientAffinityEnabled: false, but there's no longer a protocol session fighting it.
Any instance serves any request, for real. Scale from 3 to 10 instances and the load balancer just spreads the work — no shared session store required for protocol state.
Less Redis glue. In the old model, Redis was often there to share protocol session state. That reason is gone (see the next section for what Redis is still great for).

"Stateless protocol" doesn't mean "stateless app"

This is the part I want to be really clear about, because it's easy to over-read the headline.

Removing the protocol session does not mean your application can't have state. It means the protocol stops carrying state for you. If your server needs to remember something across calls, you do what HTTP APIs have always done: mint an explicit handle and let the model pass it back as an argument.

The spec calls this the explicit-handle pattern. A tool returns a basket_id (or browser_id, or whatever), and later calls include that ID as a normal parameter:

// 1) create returns a handle
{"name": "create_basket", "arguments": {}}
// -> { "basket_id": "b_12345" }

// 2) later calls pass it back as an ordinary argument
{"name": "add_item", "arguments": {"basket_id": "b_12345", "sku": "ABC"}}

The nice side effect: the model can see the handle, compose it across tools, and hand it off between steps — in ways that session state hidden in transport metadata never really allowed.

So where does Redis fit now? Exactly where it always belonged — your application's data, not the protocol's plumbing:

Backing store for those explicit handles (what's actually in basket b_12345).
Caching expensive lookups or model responses across instances.
App-level conversation memory or rate-limit counters.

Stateless protocol, stateful application. You externalize state because your app needs it shared, not because the transport forces you to.

Migrating an existing MCP server on App Service

If you deployed the original sample (or something like it), here's the punch list to get to the 2026 model. The good news: the App Service / infra side barely changes — most of the work is in the protocol layer your SDK handles for you.

App Service config — mostly already done:

Keep clientAffinityEnabled: false. (Still the right call.)
Keep scaling out to N instances. Nothing here changes.
Keep Application Insights + OpenTelemetry — and lean into the new Trace Context key names for cleaner end-to-end traces.

Protocol layer — the real work:

Update to an SDK build that speaks 2026-07-28. The handshake and session handling go away; your server reads protocol version and client info from _meta per request instead of from an initialize exchange.
Emit ttlMs / cacheScope on tools/list and resource reads so clients (and your gateway) can cache them.
Make sure your server honors / validates the Mcp-Method and Mcp-Name headers.
If you were storing anything keyed off Mcp-Session-Id, move it to the explicit-handle pattern (handle in, handle out, state in Redis/Cosmos/etc.).
Audit for the breaking bits: tasks/list is removed, Roots/Sampling/Logging are deprecated, and the "resource not found" error code moves from -32002 to the standard -32602.

I built a standalone companion sample for exactly this — the 2026-07-28 version of the original, with the handshake gone, everything read from _meta, server/discover implemented, and the explicit-handle pattern shown in a real tool. Link below.

Try it yourself

I built a companion sample for this post: a FastAPI MCP server that speaks 2026-07-28 natively — no handshake, no session — running on three App Service instances behind the built-in load balancer, with a staging slot, App Insights, a spec-compliant client, and a k6 load test:

👉 seligj95/app-service-mcp-stateless-scale-2026-python

azd auth login
azd up

That provisions a Premium v3 plan with capacity: 3, the web app with clientAffinityEnabled: false, a staging slot, and Log Analytics + Application Insights. No initialize, no Mcp-Session-Id anywhere — discovery is a single server/discover call, and every request carries its own protocol version and client info in _meta.

The part I like best is the tally tool. It keeps a running total across calls using an explicit, signed handle instead of a session — so you can watch the total stay correct even as the load balancer routes each call to a different instance:

+10  -> total=10   served_by=2103650c...
+5   -> total=15   served_by=08fc7022...   (different instance, total still right)
+100 -> total=115  served_by=08fc7022...

That's the stateless handle pattern from earlier, made concrete: state travels with the request, not the connection. Then watch the load spread in Application Insights:

requests
| where timestamp > ago(15m)
| where name contains "/mcp"
| summarize count() by cloud_RoleInstance

Want the 2025-11-25 version for comparison? That's the original Part 1 sample: seligj95/app-service-mcp-stateless-scale-python. Diff the two main.py files and you can see the handshake and session handling simply disappear.

The takeaway

When I wrote the first post, "make MCP stateless so App Service can load-balance it" was a pattern you had to apply. With the 2026 spec, it's just how MCP works. The protocol deleted the exact friction we were routing around — which means hosting a horizontally scaled MCP server on App Service is now closer to "deploy a normal web app and scale it out" than ever.

If you're already running MCP on App Service: you did the hard part early. The spec just made it official.

Got an MCP server running on App Service? I'd love to hear how the migration goes — drop a comment.

How Many Copies of Each Layer Does Your Container Registry Actually Need?

payalmahesh — Mon, 22 Jun 2026 15:20:26 GMT

Authors: Payal Mahesh and Vicky Lin

Azure Container Registry team: Jeanine Burke and Johnson Shi

Introduction

It's Monday morning. You spin up a fresh 1,000-node AKS cluster for a big training run or a fleet-wide rollout. Every node reaches for the same large container image at the same instant. What actually happens in the next ten minutes - and whether your pods reach Ready in 9 minutes or 14 - turns out to depend on a single number you've probably never thought about: how many copies of each image layer exist behind your registry.

At the surface, you see a single capacity number for your registry size - but behind that abstraction, Azure Container Registry maintains copies of your layer data to optimize pull performance. That number of copies directly determines the read throughput available per layer. Each copy can serve requests independently, so distributing the layer across storage allows it to be read in parallel. More copies mean more independent readers - and higher aggregate throughput when thousands of nodes pull at once.

The intuitive answer is that more is better: add copies, get faster pulls. When we actually tested it at 1,000-node scale, the truth turned out to be more interesting:

A few extra copies helped a little.
A moderate number helped a lot, and eliminated storage throttling entirely.
A large number helped no more than the moderate one.
A huge number actually made pulls slower again.

Think of it like opening checkout lanes at a grocery store. Opening a few more lanes when the store is slammed cuts the line dramatically. Past a certain point, though, extra lanes barely help, because by then it's the customers, not the cashiers, who are the bottleneck. And open too many? Now the staff is spread thin and tripping over each other, and the line moves worse than it did at the sweet spot.

This post walks through what we measured, why the curve bends where it does, and what we're building next so finding that sweet spot isn't something anyone has to do by hand.

Key Takeaways

There's a sweet spot, not a slope. Adding copies per layer cut pod-startup P99 by 27% and raised P50 per-node egress throughput by 244%, but only up to a point. Past that, the returns vanish, and far past it, latency actually regresses.
Storage throttling is the real enemy. The win comes from spreading load across enough storage backends that no single backend gets pinned at its egress ceiling. Once throttling is gone, more copies stop helping.
Storage scale alone has a ceiling. Even at the sweet spot, the per-backend egress limit caps total throughput. The next jump in performance has to come from somewhere else, which is exactly what we're building (see What's Next).
This isn't something customers should need to manage. We're building a proactive, on-demand storage scaling capability that automatically grows the footprint before throttling happens and shrinks it back when the burst is over.

A quick bit of background

Within a region, the layer data behind your container images is backed by Azure storage. The number of copies ACR maintains per layer determines how many independent storage backends a concurrent-pull workload can spread its reads across. That's what matters, because each backend has a finite egress ceiling. Once concurrent reads against one backend get close to that ceiling, requests start getting throttled, and your pulls slow down in proportion.

The principle is simple: more copies per layer means more backends serving the same data, which means more total egress headroom and fewer throttled requests. What we wanted data on was how many, and where it stops helping.

How we tested

We ran a controlled series of large-scale pull tests against ACR Premium on a roughly 1,000-node cluster, with every node pulling the same large image cold at the same time (no local cache on any node). The only thing we changed between runs was the number of per-layer copies behind a single registry endpoint. Everything else, including rate limits, the image, node count, and concurrency, stayed constant.

For each run we measured pod-startup latency (P50/P90/P99), end-to-end storage read latency, egress throughput distributions (P50-P99.9), and storage throttling events. Pod-startup latency is our headline metric, because it's the one number that reflects the actual customer experience no matter where the bottleneck happens to be. Per-node egress throughput matters too, though. It tells you directly how much pull bandwidth ACR delivers to your fleet, and it's usually what customers have in mind when they ask how much faster extra copies will make their pulls. We report egress as a distribution rather than a single average, since per-request and per-time-window views can tell very different stories about the same set of pulls.

These are observations from a single controlled environment, not a service guarantee. Absolute numbers will move with image size, node count, layer composition, network topology, and concurrency.

What we found

We tested five configurations, sweeping from a low baseline number of per-layer copies up to a very high one. We name them by relative copy count rather than exact instance counts:

Baseline: the lowest level, our reference point.
Low: a modest step up from Baseline.
Mid: a meaningful step up from Low.
Higher: a further step up from Mid.
Very high: the largest configuration we tested, well above Higher.

Here are the numbers. All percent changes are relative to Baseline.

Configuration	Pod startup P50	Pod startup P90	Pod startup P99	Storage throttling events	Peak per-backend egress
Baseline (fewest copies)	9m 36s	11m 0s	14m 16s	Many; all top backends above the egress ceiling	Highest
Low	9m 27s (−2%)	10m 14s (−7%)	12m 59s (−9%)	Some; one backend still above the ceiling	High
Mid	9m 25s (−2%)	9m 45s (−11%)	10m 22s (−27%)	Zero	Below the ceiling
Higher	9m 20s (−3%)	9m 37s (−13%)	10m 22s (−27%)	Zero	Well below the ceiling
Very high	9m 28s (−1%)	10m 31s (−4%)	13m 48s (−3%)	Zero	Lowest

Look at the P99 pod-startup column from top to bottom: 14m 16s, 12m 59s, 10m 22s, 10m 22s, 13m 48s. It improves, flattens out, then climbs back up. Three things explain that shape:

1. The win: Throttling falls off a cliff at the Mid configuration

As we added copies per layer, per-backend egress fell and storage-side throttling decreased. At the Mid configuration, throttling errors hit zero, and they stayed at zero for every configuration above it.

The upside isn't just that the errors went away, though. It's raw pull bandwidth. At the Mid sweet spot, the typical node saw its P50 egress throughput jump 244% over Baseline. With load spread across enough copies, each node pulled its layers off storage much faster, not just without stalling.

For a workload owner, that's the difference between watching pods come up in a steady stream and watching them stall for tens of seconds at a time while throttling clears. Same image, same node count, same registry, very different experience.

To put it in concrete terms: if your team runs a daily AI training kickoff that needs all 1,000 nodes pulling before the job can start, this is the difference between starting on time and starting four minutes late every day. Over a quarter of training runs, that adds up.

2. The surprise: more copies made pulls slower

This is the finding that genuinely surprised us. Going from Higher to Very high, the largest configuration we tested, cost us 3 minutes and 26 seconds at P99: 10m 22s climbing back up to 13m 48s. That gave back almost the entire benefit we'd built up over the previous four configurations. Tail storage-read latency at Very high actually came out worse than Baseline.

The Very high run is where the wheels came off, and the reason is the trade-off underneath. Once storage throttling is gone, more copies stop buying you anything, and the cost of fanning reads across that many backends starts to take over. The throughput distribution shows it clearly. P50 and P75 throughput had been climbing steadily and getting smoother through Mid and Higher, then dropped sharply at Very high while the peak P99/P99.9 spikes came back. Spread the same load across too many backends and it fragments into smaller, less consistent bursts.

The takeaway is that "more is better" stops being true past the sweet spot, and the failure mode is quiet. You won't see throttling errors. You'll just see your pulls get slower.

3. What we didn't expect: at few copies, the hottest backend is what hurts you

At the lowest copy counts, pull traffic wasn't spread evenly across the underlying storage footprint. Some backends absorbed far more traffic than others. As we added copies, that distribution evened out and the hottest backends cooled down.

The implication is sharp. You can saturate the busiest backend, and trigger throttling, even when the total headroom across all your backends is large in aggregate. What matters is the load on the hottest backend, not the average. That's exactly the failure mode that demand-driven, proactive scaling (described below) is meant to head off before it happens.

So how should you think about this?

You don't size copies yourself; ACR manages the storage footprint behind your registry. Still, it helps to understand what moves the sweet spot, because the shape of your own workload is what decides where it lands. The bigger your worst-case concurrent burst (more nodes, larger images, higher concurrency), the more copies per layer it takes to keep pulls off the throttling ceiling, and the further out the sweet spot sits. Smaller workloads may already be sitting on the flat part of the curve.

One thing is worth saying plainly. The storage footprint underneath is managed by ACR and shared across many registries, so there's no fixed, private storage budget that maps one-to-one to your workload. The sweet spot isn't a number you compute and provision; it's a behavior the platform has to land on for you, which is exactly why we're moving toward demand-driven scaling that handles it automatically.

That's what brings us to what we're building next.

What's next: proactive, on-demand storage scaling and a caching layer

The fixed-copy tests above answer the question "how many should the ACR system provision?" but they assume a single, static answer. Real workloads aren't static. A 1,000-node burst happens at deploy time, not at 3 a.m. on a Tuesday. And no matter how many copies are provisioned, the per-backend storage ceiling still bounds peak deliverable throughput. So we're investing along two complementary directions.

1. Proactive, demand-driven storage scaling

We're building a capability that adjusts the number of per-layer copies automatically based on real-time pull demand:

Proactive, not reactive. The system scales the storage footprint before concurrent pull pressure pushes any single backend near the throttling threshold, so throttling is prevented before it forms rather than cleaned up after the fact.
On-demand scale-out. The footprint expands automatically as sustained pull demand grows.
Scale-in when demand subsides. The footprint contracts so you're not paying for steady-state capacity you only needed during a burst.
Tiering for cold content. Long-tail, rarely-pulled content can sit on colder storage, so the redundant footprint of frequently-pulled content doesn't pay full hot-storage cost everywhere.

The benefit to customers is straightforward: smoother pulls under burst, higher delivered throughput on average, no permanent over-provisioning, and no manual re-tuning as workloads grow.

2. A caching layer to absorb burst beyond the storage ceiling

Even a perfectly scaled storage footprint runs into the per-backend egress ceiling at extreme scale. To push past it, we're investing in a caching layer in the registry service that absorbs burst traffic before it ever reaches storage. A pull surge that hits the same set of layers, which is the common case for fleet-wide deployments, can be served largely from cache. That takes a lot of load off any single storage backend and complements the storage scaling above.

We'll share results from this work in follow-up posts.

If you have questions about scaling ACR for your workload, or about how we measure storage performance, reach out on the Azure Container Registry GitHub repository.

Note: All results in this post are based on controlled internal testing configurations and are intended to illustrate general scaling behavior rather than prescribe exact configurations.

Controlling Tool Access with APIM MCP Gateway

samcogan — Thu, 18 Jun 2026 14:44:22 GMT

If you've started working with MCP servers in GitHub Copilot, Claude, or any other agent host in an enterprise environment, you've probably hit a similar problem. You want to give your developers access to a useful MCP server, but not every tool it ships with. Maybe one tool is noisy and burns context for no good reason. Maybe one tool calls a paid API. Maybe one tool does something your security team is not comfortable with. The MCP server is all-or-nothing: install it and you get the lot.

Most MCP servers don't give you a way to switch individual tools off. The MCP spec doesn't define one either. So if you want fine-grained control, you need to put something in front of the server that can see the JSON-RPC traffic and make decisions about it. That something is an MCP gateway, and Azure API Management (APIM) can do the job.

What an MCP gateway actually is

An MCP gateway sits between the agent (the client) and the MCP server (the backend) and inspects the MCP protocol traffic flowing between them. Because MCP is just JSON-RPC over HTTP or SSE, anything that can do reverse-proxy plus payload inspection can in theory act as a gateway. The interesting bit is what you do with that position in the network.

If you've worked with API gateways before, the mental model is the same: you're putting a managed entry point in front of one or more backends, and using policy to control what gets through. The differences are that the protocol is JSON-RPC over a long-lived connection, and the things you're filtering aren't routes but tools, prompts, and resources.

Out of the box, MCP gives you very little operational control. The protocol assumes a trusted one-to-one relationship between a client and a server. There's no built-in authentication story beyond what

the transport gives you, no rate limiting, no audit trail beyond the agent's own logs, and no way to share one MCP server safely across multiple users or teams. A gateway is how you fix all of that without modifying the MCP server itself.

MCP Gateway Functionality

The tool access problem is not the only reason to put a gateway in front of an MCP server. The list looks a lot like the list of reasons you'd put any API behind a gateway, with a few MCP-specific twists:

Centralised authentication and authorisation. Add Entra ID, mTLS, or scoped tokens in front of MCP servers that ship with little more than an API key.
Rate limiting and quota. Stop a runaway agent loop from hammering a paid upstream API and racking up a bill in minutes.
Logging, audit and observability. Capture which user invoked which tool with what arguments and ship it to Log Analytics or your SIEM.
Network isolation. Keep developer machines off the public internet by fronting external MCP servers with private endpoints and fixed egress IPs.
Sharing one MCP server across many clients. Turn a single-tenant MCP server into a multi-tenant front door with per-user identity and per-team limits.
Policy and governance. Control which tools are exposed, redact fields, validate arguments, or transform responses before they reach the agent.
Aggregating multiple MCP servers. Present one logical MCP endpoint that fans out to several backends so agent hosts have a single connection point.
Failure handling and resilience. Add retries, circuit breakers, and caching of tools/list rather than have every agent host grow its own.
Cost management. Put caps, alerts, and per-team chargeback on MCP servers that wrap something paid by the call.

APIM as the gateway

APIM gives you all of the above, and Microsoft has been adding MCP-specific support over the last few releases. The relevant bits for this post:

There's a dedicated MCP servers section in APIM, separate from the standard APIs blade. This is where MCP-aware features live.
You can register an existing MCP server as an external MCP server. APIM proxies the protocol traffic and applies policies to it.
You can also expose a normal REST API as an MCP server, by adding an MCP layer on top of an existing API.
The standard APIM policy engine works on MCP traffic.

The shape of the problem

The MCP protocol has two methods that matter here:

tools/list is what the client calls when it connects. The server returns the catalogue of tools available, with names, descriptions, and input schemas. The agent uses this to decide what it can do.
tools/call is what the client sends when the agent actually wants to invoke a tool.

If you want to hide a tool, you have to deal with both. Filtering tools/list stops the agent ever knowing the tool exists, which is what you usually want. Blocking tools/call stops a determined client (or a tool the agent guessed at) from calling it directly. You need both to be confident the tool is genuinely off-limits.

For the rest of this post I'll use the Microsoft Learn MCP server as the example, because it's public, useful, and ships with three tools:

microsoft_docs_search
microsoft_docs_fetch
microsoft_code_sample_search

Say I want to allow the first two and block the third. Here's how to do it.

Setting up APIM as an MCP gateway

I'm going to skip the bit where you provision APIM. The MCP-specific setup is:

In your APIM instance, go to MCP Servers in the left-hand nav (it's its own section, not under APIs).
Add a new MCP server. Choose External MCP server as the type.
Point it at the Microsoft Learn MCP server endpoint. The transport will be HTTP or SSE depending on the upstream, which APIM handles for you.
Save. APIM now proxies the MCP traffic. Configure the agent to point at the APIM-exposed URL instead of the upstream.

At this point you have a working passthrough. The next step is the policy.

Two ways to control tool access

There are two patterns that work, and the right choice depends on how much maintenance you want to take on.

Option A: Static allowlist

You hardcode the list of tools APIM will expose. APIM intercepts tools/list and returns your fixed list, ignoring whatever the backend says. It also blocks tools/call for anything not on the list.

Pros: predictable, doesn't read the response body, doesn't care what the upstream changes.

Cons: you have to maintain the schemas yourself. If Microsoft adds a useful new tool to the Learn MCP server, you won't see it until you update the policy. If they change the input schema for an existing tool, you'll need to update that too.

This is the option I'd reach for first if I wanted strict control and wasn't expecting the upstream to change much.

Option B: Dynamic deny-list

You let tools/list flow through, then rewrite the response on its way back out, removing the tools you don't want. You also block tools/call for those tools.

Pros: lower maintenance. New tools appear automatically. Schema changes are picked up.

Cons: this reads and rewrites context.Response.Body, and Microsoft's own guidance for MCP policies warns that response-body access can interfere with streaming. In practice, tools/list is non-streaming JSON-RPC and this works fine, but you need to test it carefully in your environment, particularly if your agent host is fussy about response handling. This method also requires you to keep on top of when new tools are added and add them to the list if you don't want them used, they will automatically be available to users if you do not.

Pick this one when you're confident in the upstream and want to minimise toil.

Option A: the allowlist policy

Here's the full policy for the static allowlist. It goes into the MCP Server policy editor in APIM. You'll notice there's no <base /> in the sections. Some MCP Server policy editors don't include them, and the policy works fine without them.

<policies> <inbound> <choose> <when condition="@{ var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (body == null) { return false; } var rpcMethod = (string)body["method"]; return string.Equals(rpcMethod, "tools/list", System.StringComparison.OrdinalIgnoreCase); }"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); var id = req != null ? req["id"] : Newtonsoft.Json.Linq.JValue.CreateNull(); var tools = new Newtonsoft.Json.Linq.JArray( new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("name", "microsoft_docs_search"), new Newtonsoft.Json.Linq.JProperty("description", "Search official Microsoft/Azure documentation"), new Newtonsoft.Json.Linq.JProperty("inputSchema", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "object"), new Newtonsoft.Json.Linq.JProperty("properties", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("query", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "string") )) )) )) ), new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("name", "microsoft_docs_fetch"), new Newtonsoft.Json.Linq.JProperty("description", "Fetch a Microsoft documentation page as markdown"), new Newtonsoft.Json.Linq.JProperty("inputSchema", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "object"), new Newtonsoft.Json.Linq.JProperty("properties", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("url", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "string") )) )), new Newtonsoft.Json.Linq.JProperty("required", new Newtonsoft.Json.Linq.JArray("url")) )) ) ); var response = new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("jsonrpc", "2.0"), new Newtonsoft.Json.Linq.JProperty("id", id), new Newtonsoft.Json.Linq.JProperty("result", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("tools", tools) )) ); return response.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </return-response> </when> <when condition="@{ var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (body == null) { return false; } var rpcMethod = (string)body["method"]; var toolName = (string)body["params"]?["name"]; return string.Equals(rpcMethod, "tools/call", System.StringComparison.OrdinalIgnoreCase) && string.Equals(toolName, "microsoft_code_sample_search", System.StringComparison.OrdinalIgnoreCase); }"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); var id = req != null ? req["id"] : Newtonsoft.Json.Linq.JValue.CreateNull(); var error = new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("jsonrpc", "2.0"), new Newtonsoft.Json.Linq.JProperty("id", id), new Newtonsoft.Json.Linq.JProperty("error", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("code", -32601), new Newtonsoft.Json.Linq.JProperty("message", "Tool disabled by API Management policy: microsoft_code_sample_search") )) ); return error.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </return-response> </when> </choose> </inbound> <backend /> <outbound /> <on-error /> </policies>

It's a chunk of XML, but it's doing only three things. Here's what each part is for.

The outer <choose> and the two <when> branches. This is just an if/else if. APIM peeks at the inbound JSON-RPC request and decides which branch to fire. The first branch handles tools/list. The second handles tools/call for the specific tool I want to block. If neither matches, nothing happens in the policy and the request flows through to the backend as normal. So tools/call for the allowed tools, plus initialize, ping, and anything else MCP throws at the gateway, all pass through untouched.

The tools/list branch: synthesise the catalogue. When the agent asks for the tool list, APIM never forwards the call. Instead, the policy reads the id off the incoming request (so the response correlates correctly), builds a JArray containing only the tools I want to expose, with their names, descriptions, and inputSchema blocks, wraps that in a JSON-RPC result envelope, and returns it directly with <return-response>. The backend never sees the request. That's the bit that makes Option A bulletproof: there is no upstream behaviour that can leak through, because the upstream isn't involved.

The tools/call branch: hard-block the disallowed tool. When the agent tries to call microsoft_code_sample_search directly (whether it guessed at it, cached it from a previous run, or someone added it to a config file), the policy short-circuits with a JSON-RPC error. The error code is -32601, which the spec defines as "Method not found", and the message says plainly that the tool was disabled by APIM. Well-behaved agents will surface that to the user and move on. Without this branch, hiding the tool from tools/list would only stop honest clients.

A couple of incidental things worth knowing. The preserveContent: true argument on context.Request.Body.As<...>() is important: without it, reading the body consumes it and the rest of the pipeline gets nothing. And the OrdinalIgnoreCase comparisons are belt-and-braces: MCP method names are case-sensitive in the spec, but agents in the wild are inconsistent.

Option B: the deny-list policy

Same scenario, different approach. Let tools/list flow through to the backend and rewrite the response on the way out. Still block tools/call for the disallowed tool.

```xml <policies> <inbound> <choose> <when condition="@{ var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (body == null) { return false; } var rpcMethod = (string)body["method"]; var toolName = (string)body["params"]?["name"]; return string.Equals(rpcMethod, "tools/call", System.StringComparison.OrdinalIgnoreCase) && string.Equals(toolName, "microsoft_code_sample_search", System.StringComparison.OrdinalIgnoreCase); }"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); var id = req != null ? req["id"] : Newtonsoft.Json.Linq.JValue.CreateNull(); var error = new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("jsonrpc", "2.0"), new Newtonsoft.Json.Linq.JProperty("id", id), new Newtonsoft.Json.Linq.JProperty("error", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("code", -32601), new Newtonsoft.Json.Linq.JProperty("message", "Tool disabled by API Management policy: microsoft_code_sample_search") )) ); return error.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </return-response> </when> </choose> </inbound> <backend /> <outbound> <choose> <when condition="@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (req == null) { return false; } var rpcMethod = (string)req["method"]; return string.Equals(rpcMethod, "tools/list", System.StringComparison.OrdinalIgnoreCase); }"> <set-body>@{ var bodyText = context.Response.Body.As<string>(preserveContent: true); if (string.IsNullOrEmpty(bodyText)) { return bodyText; } var resp = Newtonsoft.Json.Linq.JObject.Parse(bodyText); var tools = resp["result"]?["tools"] as Newtonsoft.Json.Linq.JArray; if (tools == null) { return bodyText; } var filtered = new Newtonsoft.Json.Linq.JArray(); foreach (var t in tools) { var name = (string)t?["name"]; if (!string.Equals(name, "microsoft_code_sample_search", System.StringComparison.OrdinalIgnoreCase)) { filtered.Add(t); } } ((Newtonsoft.Json.Linq.JObject)resp["result"])["tools"] = filtered; return resp.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </when> </choose> </outbound> <on-error /> </policies>

This policy splits the work across inbound and outbound. The shape is different from Option A, but again it's only doing a few things.

Inbound: block direct calls to the disallowed tool. This block is identical to the one in Option A. If the agent tries to invoke microsoft_code_sample_search directly, APIM returns the JSON-RPC -32601 error and the request never reaches the backend. Everything else (including tools/list and calls to the allowed tools) carries on through to the upstream MCP server.

Backend: nothing custom. The empty <backend /> is deliberate. We want the upstream server to handle every request that wasn't blocked in inbound, including tools/list. APIM forwards it, the server returns its full catalogue, and we get a chance to rewrite the response on the way back.

Outbound: filter the catalogue on the way out. This is the bit that does the dynamic work. The <when> condition checks the original request (not the response) to see if this was a tools/list call, because we only want to rewrite responses to that specific method. If it was, the policy reads the response body as a string, parses it as JSON, walks the result.tools array, builds a new array containing everything except the deny-listed tool, swaps it back into the response object, and writes the modified JSON back with <set-body>. For any other method, the response flows through untouched.

The reason this option is more fragile than Option A is right there in that last paragraph: it reads and rewrites context.Response.Body. Microsoft's own guidance for MCP policies flags that response-body access can interfere with streaming. For non-streaming JSON-RPC like tools/list, this works fine in practice, but if you ever extend this pattern to filter streamed responses (resource updates, long-running tool calls), you'll need to think much harder about it. For the narrow case of trimming tools/list, it's a reasonable trade for the lower maintenance cost.

Validating it works

Once the policy is saved, hit the APIM-exposed MCP endpoint with the agent of your choice and check three things:

tools/list returns only microsoft_docs_search and microsoft_docs_fetch. The third tool should not appear.
tools/call for microsoft_code_sample_search returns a JSON-RPC error with code -32601.
The two allowed tools still actually work end-to-end.

Which one should you use

Default to Option A (the static allowlist) unless you have a specific reason not to. It's the most predictable and it doesn't touch the response body, which keeps you well clear of the streaming caveat. You will have to update it when the upstream tool catalogue changes, but this does give you greater control of when you make those changes available and creates a conscious decision to allow them, or not.

Reach for Option B when the upstream changes frequently, or when you'd rather block specific known-bad tools and let everything else through. Test it carefully and watch the agent's behaviour for any sign of streaming or framing issues.

If you've got a fully internal MCP server you control, the cleanest answer is to fix the tool list at the source and skip the gateway changes altogether. But for third-party MCP servers, this is an approach you can apply without changing the underlying MCP server.

ArchAngel: Skilling the next developer generation for the Agentic transformation.

RohitMadhavKrishnan — Mon, 15 Jun 2026 15:24:42 GMT

AI is transforming the SDLC at speed but there's a quieter question following close behind. If the code is being written for your junior developers, when do they learn the skills to become senior ones and how can they use these tools without losing understanding of what they've made? Most tools try to solve this by catching issues after the commit, but by then the teaching moment is already gone. What if your developers could learn why something is wrong and how to fix it, as they write it?

ArchAngel is built around a simple idea: What if your team’s best engineering practices could exist directly inside the IDE and guide developers as they work? It turns your team's collective wisdom into a mentor that scales to every new hire, every sprint, every repo.

GitHub Copilot has led to great strides in reducing developer toil. It allows teams to move faster, automate repetitive work, and spend more time on higher-level design decisions that drive real impact. As these capabilities evolve, especially with more agentic workflows, developers now have even more powerful ways to generate and iterate on code.

The next challenge for engineering teams isn’t adoption. It’s how to make the most of these capabilities while maintaining strong engineering standards, consistency, and shared understanding across teams. Even with powerful tools in place, teams still face familiar challenges. Standards are often spread across repositories, documentation, and conversations. Best practices evolve over time but aren’t always easy to discover or apply consistently. Reviews can become repetitive, and senior engineers spend a lot of time reinforcing patterns rather than focusing on system design.

AI accelerates development, but it doesn’t automatically understand how your team builds software, the architectural decisions, trade-offs, and conventions that make systems consistent over time. Technically correct code doesn't mean its organisationally aligned.

__________________________________________________________________

The Solution:

ArchAngel is built for and by developers. It provides a developer driven, AI coding assistant that teaches without taking autonomy. Rather than waiting for devs to commit their code to pick up issues, ArchAngel sits beside through the coding process to guide and provide iterative live feedback. By connecting ArchAngel to your project bases, ArchAngel can be grounded in your 'golden repositories'; The agreed best practices, existing approved repos and organisational standards. ArchAngel can also generate Code Style and Wiki Docs for a quick look summary guide to your best practices, prompting further research.

Let's define the tech stack behind this and what makes ArchAngel work:

Semantic Kernel and Microsoft Agent Framework (MAF) - Semantic Kernel is a lightweight, open-source development kit that lets you easily build AI agents and integrate the latest AI models. MAF is an orchestration framework that is used to create agents and conversation history to create and support iterative design. Served via Foundry.
Language Server Protocol (LSP) - LSP allows you to invoke ArchAngel from any IDE. For VSCode, command palette is used to invoke certain functionality for ArchAngel such as document generation.
- Official page for Language Server Protocol
SQLite - Store your search database locally with SQLite.
Config - Setup just like your other dev tools. For VScode, simply load your base repos into archangel.json, authenticate with github and run the "Index from config file" in your command palette.

__________________________________________________________________

Meet Priya: A Senior Software Engineer

Responsibilities:

Design & build scalable software — Architect and deliver high-quality, maintainable systems from concept through to production.
Drive engineering excellence — Lead code reviews, enforce best practices, and champion CI/CD, testing, and observability standards.
Mentor junior engineers — Coach and develop less experienced team members, fostering technical growth and a collaborative culture.
Collaborate cross-functionally — Work with Product, Design, and stakeholders to translate business needs into robust technical solutions.
Innovate & improve continuously — Stay ahead of industry trends, reduce technical debt, and drive adoption of tools and processes that elevate team productivity.

Meet Joe: A Junior Software Engineer

Responsibilities:

Write and maintain clean code — Develop, test, and debug features under guidance from senior engineers, following established coding standards.
Learn the codebase & tech stack — Ramp up quickly on existing systems, tools, and workflows to become a productive contributor.
Participate in code reviews — Submit code for review and actively learn from feedback, while reviewing peers' work to build technical judgement.
Collaborate with the team — Contribute to sprint ceremonies, ask questions, flag blockers early, and communicate progress clearly.
Invest in personal growth — Pursue learning through documentation, certifications, pair programming, and internal knowledge-sharing sessions.

A day in the life

Priya is responsible for Joe's development and guiding him to progress his career, however, Priya is extremely busy and she is managing two other junior devs. So reviewing each junior's code to identify the architectural antipatterns and documentation decisions that deviate from best practice, takes time away from the senior work she brings most value to.

Enter ArchAngel

New hire/Junior joins 4 months into the project with no clue what repos are being used as an example? The dev team used ArchAngel to create Code Style and Wiki Docs so the junior can at a glance view a few of the standards from the repos. The dev team also has the 'golden repos' linked in the ArchAngel config file, so now all the junior has to do is clone, index and start coding. The junior devs use ArchAngel in their VSCode IDE and receive feedback and critique that is educational directly in their IDE as they code, before anything is committed for formal review. Repository informed chat, code completions and docs make onboarding, learning and guiding junior devs an asynchronous task. Giving Priya some breathing room and improving the quality of the PRs her juniors produce!

The ArchAngel's educational, constructive criticism and repository-backed way of reviewing code can also be an asset to learn the PR process itself! Learning what ArchAngel looks out for each time, helps to teach juniors to look out for it themselves in PRs and how to critique it with possible alternatives.

__________________________________________________________________

Example Architecture:

Project Repo: Your project repository, where your team syncs your config files. Setup for success at a team level.
Cloud Environment: Where ArchAngel's brainpower comes from. Secure, scale, monitor and govern with the Azure platform to align organisationally.
User Environment: Where you and your teams code, build and create the software that powers the business. Fit to your dev team and extend to your tools.
Golden Repos: Your golden base of knowledge. This is the indexed and cited source of your coding standards, the team's guiding principle knowledge and the codebases that your team is striving to match at a code quality and standards level.

Next Steps: Making ArchAngel For You

Configure + Customise ArchAngel to suit your development environment!

VNETs (Virtual Networks) enable secure networking configurations within a cloud environment. Enabling your resources and any tools you wish to setup to communicate in secure channels.
- Azure Virtual Network Documentation - Tutorials, quickstarts, API references | Microsoft Learn
Microsoft Foundry - Engage Foundry's full toolset to make ArchAngel customisable and compliant for your organisation.
- Microsoft Foundry documentation | Microsoft Learn
- Foundry Tools: What are Foundry Tools? - Foundry Tools | Microsoft Learn
API Management (AI Gateway) - Use AI Gateway and APIM to secure, scale, monitor and govern your agents and connections to Microsoft Foundry.
- API Management documentation | Microsoft Learn
LSP (Language Server Protocol) - Add custom event and event handlers through the LSP channels across IDEs.
- Official page for Language Server Protocol
Customisable Document Generation - Refine the prompts that are used to create the documents and format as you wish! The base version is a skimmed document with a few snippets as source designed to be simple to extend.

Find the github repo here: rohitmadhavk/ArchAngel: An education coding assistant to help junior devs learn best practice

Debug App Startup Faster on Azure App Service for Linux with Startup Logs

TulikaC — Thu, 11 Jun 2026 12:34:17 GMT

When an app fails to start on Azure App Service for Linux, one of the first things you need is visibility into what happened during startup. This can include container initialization, runtime setup, startup command execution, application output, and warmup probe results.

To make this easier, we have added new Azure CLI commands that let you list and view App Service startup logs directly from the command line.

List available startup logs

You can list startup logs for an app using:

az webapp log startup list \ --name <app-name> \ --resource-group <resource-group>

The output shows whether the startup attempt succeeded or failed, along with the instance name and log file size. This helps you quickly identify the right log file, especially when there are multiple startup attempts across different instances.

Show startup log content

To view the latest startup log, run:

az webapp log startup show \ --name <app-name> \ --resource-group <resource-group>

You can also view a specific log file by name:

az webapp log startup show \ --name <app-name> \ --resource-group <resource-group> \ --log-file-name <log-file-name>

The log content includes startup events from the platform and the application. For example, you can see the container image being pulled, the startup script being generated, the app command being run, and the warmup probe result.

In a successful startup, the log shows that the site startup probe succeeded and the site started successfully.

Failure logs are prioritized by default

When you run az webapp log startup show without specifying a log file name, the command automatically prefers failure logs from the most recent date.

This helps reduce the time spent looking for the right log when debugging startup failures. Instead of manually searching through multiple files, you can run one command and immediately see the most relevant failure details.

For example, if the app fails because the worker process does not start within the allotted time, the log shows the timeout details and the platform actions taken during startup cancellation.

Better hints for common startup failures

The command also includes improved handling for common failure scenarios, including runtime startup failures and container startup timeouts.

For example, if the app starts but does not respond on the expected port, the startup log may show application output such as:

listening on 3000 (wrong port)

while the platform is expecting the app to respond on a different port. This makes it much easier to understand why the warmup probe failed.

Slot support

The startup log commands also support deployment slots.

To list startup logs for a slot:

az webapp log startup list \ --name <app-name> \ --resource-group <resource-group> \ --slot <slot-name>

To show startup logs for a slot:

az webapp log startup show \ --name <app-name> \ --resource-group <resource-group> \ --slot <slot-name>

This is useful when debugging slot-specific startup issues before swapping traffic to production.

Summary

The new az webapp log startup commands make it easier to inspect startup behavior for Azure App Service for Linux apps directly from Azure CLI.

These commands are currently in preview. Try them out the next time you need to understand why your App Service Linux app did or did not start successfully.

How ARM Tracks Work That Takes Hours

aravgoyal — Wed, 10 Jun 2026 02:22:06 GMT

By Arav Goyal, Joy Shah, Michael Cheng, Manik Sikka, Jenny Hunter, Johnson Shi

Introduction

When a user creates, updates, or deletes a resource in Azure, the request flows through Azure Resource Manager (ARM) before reaching the service that actually owns the resource. For operations that complete in milliseconds, the request and response fit cleanly into a single synchronous HTTP exchange. For operations that take seconds, minutes, or hours, this is not possible: HTTP connections cannot be held open that long, and the user's client needs a way to track the work asynchronously. ARM and Azure's resource providers (RPs) implement this through a standard long-running operation (LRO) protocol built on the Azure-AsyncOperation and Location HTTP headers, status URLs, and provisioning states. This post describes that protocol end-to-end and traces a request from the Portal or CLI all the way through to terminal completion.

Key Takeaways

All Azure control plane traffic (Portal, CLI, PowerShell, SDKs, REST API) is routed through ARM, which forwards requests to the appropriate resource provider.

Operations that cannot complete inside a single HTTP request are returned as long-running operations, marked by an HTTP 201 or 202 response and a status URL the caller polls until completion.

The two primary LRO patterns use the Azure-AsyncOperation header (returns operation status) and the Location header (returns the resource itself once complete). Both are guided by a Retry-After value when the resource provider supplies one.

Clients should prefer the Azure-AsyncOperation URL when it is present, because the structured status response is more informative than the implicit "still 202" signal from polling Location alone.

Many Azure resources also expose a provisioningState property that reaches a terminal value when the operation completes, providing a secondary signal in addition to the async operation status URL.

Background: ARM as the Control Plane

Azure Resource Manager is the deployment and management service for Azure. When a user issues a control plane request through any Azure interface (the Portal, the Azure CLI, PowerShell, an SDK, or a direct REST API call), the request reaches ARM at management.azure.com. ARM authenticates the request, authorizes it against the appropriate role assignments and policies, and then forwards it to the resource provider that owns the resource type in question. Resource providers are the Azure services that actually implement specific resource types. Microsoft.Compute provides virtual machines, Microsoft.Storage provides storage accounts, Microsoft.ContainerService provides managed Kubernetes clusters, and so on.

Because every control plane request flows through this same path, the behavior described in this post applies regardless of which client the user is using. az group deployment create, a Bicep deployment from the Portal, and a direct PUT to management.azure.com all enter the system the same way.

Synchronous and Asynchronous Operations

Many control plane operations complete quickly enough to be handled synchronously. A GET request that reads the current state of a resource, for example, can usually return inline in tens of milliseconds. The user's client makes one request, receives one response with the requested data, and the interaction is finished.

Other operations cannot complete this way. Provisioning a managed Kubernetes cluster, deploying a multi-resource template, or tearing down a private endpoint with downstream cleanup may take seconds, minutes, or hours of actual work on the resource provider's side. There are several reasons ARM cannot just hold a synchronous HTTP connection open for the duration of this work:

Intermediate proxies and load balancers typically time out long-lived connections.

Clients may go offline (a laptop closes, a network drops) while waiting.

Holding a TCP connection open for an extended period consumes server resources for no useful purpose; the actual work happens elsewhere.

To handle these cases, ARM and the resource providers implement a standard long-running operation protocol. The initial request returns immediately with a status code indicating the work has been accepted but is not yet done, along with one or more headers that tell the client where to check for status. The client then polls that status endpoint until the operation reaches a terminal state.

The Long-Running Operation Protocol

When a resource provider receives a request that will take longer than a synchronous response can accommodate, it returns an HTTP 201 Created or 202 Accepted response. The response includes one or both of two key headers that direct the caller to a status endpoint.

The Azure-AsyncOperation header

Azure-AsyncOperation contains a URL. When the client polls that URL, the response body is a structured representation of the operation's current state, including a status field. The status takes one of several values:

An in-progress value such as InProgress or a resource-provider-specific equivalent.

A terminal value: Succeeded, Failed, or Canceled.

The client continues polling until the status field reaches a terminal value. Failed and canceled responses typically include an error field with structured detail about what went wrong.

The Location header

Location also contains a URL, but the semantics are different. While the operation is in progress, polling the Location URL returns another 202 Accepted response, often with a refreshed Retry-After value. Once the operation completes, polling the Location URL returns 200 OK with a terminal payload: the resource itself for a successful create or update (a PUT), or the action's result for a POST operation such as starting, stopping, or restarting a resource. Other terminal status codes are possible depending on the operation's outcome.

Not every long-running operation returns an Azure-AsyncOperation header; some expose only Location. When both are present, clients should prefer the Azure-AsyncOperation URL, because the structured status response is more informative than the implicit "still 202" signal from Location-only polling.

The Retry-After header

Both patterns may be accompanied by a Retry-After header, an integer giving the resource provider's suggested interval (in seconds) before the next poll. Well-behaved clients honor this value. Ignoring it and polling faster than the resource provider has asked can trigger server-side throttling, at which point the client is no better off (and often worse off) than if it had waited. When Retry-After is absent, the client falls back to a default polling cadence determined by its own configuration.

Provisioning State

In addition to the operation-level status returned by the LRO protocol, many Azure resources expose a provisioningState property in their own resource manifest. When a client issues a GET on the resource itself (not the operation status URL), the response body contains the resource's current configuration along with a provisioningState field.

The provisioning state moves through a predictable lifecycle:

A transitional state during work: commonly Creating, Updating, or Deleting, sometimes with resource-provider-specific values.

A terminal state once work completes: Succeeded, Failed, or Canceled.

Where provisioningState is available, clients have two distinct ways to determine completion. They can poll the async operation URL, or they can poll the resource itself and watch for provisioningState to reach a terminal value. The async operation URL is the authoritative signal in either case; the resource manifest's provisioningState is a secondary observation point that can be useful when a client is already reading the resource for other reasons.

The End-to-End Polling Chain

Putting the pieces together, the lifecycle of a long-running operation from the client's perspective looks like this:

User runs a command such as az aks create.

The CLI sends a PUT request to management.azure.com/.../Microsoft.ContainerService/managedClusters/{name}.

ARM authenticates and authorizes the request, then forwards it to the Microsoft.ContainerService resource provider.

The resource provider accepts the work and returns 202 Accepted with Azure-AsyncOperation and/or Location headers, plus a Retry-After value.

ARM forwards this response to the CLI.

The CLI begins polling the status URL on the suggested interval. Each poll returns the current status, in progress or terminal.

The resource provider continues its work in the background. Eventually the operation reaches a terminal state (Succeeded, Failed, or Canceled).

The next poll after that returns the terminal status along with any final response body.

The CLI reports completion to the user.

Two ways to observe completion

The sequence above shows the client polling the async operation status URL, which is the primary and authoritative completion signal. Where a resource also exposes provisioningState, the client has a second option. The two differ only in what the client polls and what comes back:

Polling the async operation URL (Azure-AsyncOperation or Location) returns operation-level status directly. This is the path the LRO headers point to and the one to prefer.

Polling the resource's provisioningState means issuing a GET on the resource itself and watching the provisioningState field reach a terminal value. This is useful when the client is already reading the resource for other reasons.

Both observe the same underlying operation. They are not different operations or different code paths on the resource provider's side; they are two different endpoints a client can watch to learn the same thing.

Closing

The long-running operation protocol is one of those pieces of infrastructure that is invisible when it works. A user runs a command, waits, and eventually sees a result. Underneath, that simple experience rests on a well-defined contract: a 201 or 202 with a status URL, a set of headers that tell the client where and how often to check, a predictable set of terminal states, and an optional second signal through provisioningState. The contract is simple enough to describe in a single post and robust enough to handle everything from an eight-second deployment to a multi-hour cluster provision.

The one part of the protocol that this post has treated as a given is the polling cadence: how often the client checks the status URL when no Retry-After value pins it. That cadence is more consequential than it looks. Every in-flight operation across the platform is being checked on repeatedly, and the interval between those checks determines how much work goes into useful status retrieval versus into repeatedly asking an operation that is not done yet whether it is done. Getting that cadence right, across a workload where some operations finish in seconds and others run for hours, is a genuinely interesting problem, and one worth a closer look another time.