azure container apps

239 Topics

Orchestrate Azure Container Apps Jobs with Apache Airflow
Azure Container Apps (ACA) Jobs are a great way to run work that starts, does something, and finishes: nightly batch, data processing, ETL, ML scoring, report generation. They scale to zero, bill per execution, and run any container you give them. But the moment your "one job" becomes "a set of jobs that depend on each other," a gap appears: How do I run twenty jobs in parallel, wait for all of them, then run one more job only if they all succeeded — and retry just the one that failed? A single ACA Job can't express that on its own. What you're describing is an orchestrator, and the most widely adopted one in the data world is Apache Airflow. This post introduces two open-source templates that connect the two, so Airflow becomes the brain and ACA Jobs become the muscle. Pick the one that matches what you already run: airflow-on-aca-jobs: you already have Airflow. Drop in an operator and point it at ACA Jobs. Host nothing new. airflow-hosted-on-aca: you don't have Airflow. Get a full one running on Azure Container Apps with one command. Both use the same operator and the same DAGs, so you can start with one and move to the other later without rewriting your workflows. See Airflow orchestrate real ACA Job executions with parallel fan-out, dependency ordering, and automatic retries. Why ACA Jobs need an orchestrator A plain ACA Job is great at one thing: run this container to completion, then stop. That covers a scheduled job or a one-off task perfectly. Real pipelines need more than that: Dependency ordering: step B runs only after step A succeeds. Parallel fan-out: launch one execution per file, per store, or per partition, all at once, then wait for the whole batch. Per-task retries: if one execution in a batch of fifty fails, retry just that one, not the other forty-nine. Backfills and scheduling: re-run yesterday's pipeline, or run every night with a full history of what happened. These are the problems an orchestrator solves. Instead of building that logic yourself, you let Airflow handle the graph, the scheduling, and the retries, while ACA Jobs run the compute. You get serverless, scale-to-zero workers, and you didn't have to stand up a scheduler to get them. The operator that ties them together Both templates ship the same small plugin: an Airflow operator called AzureContainerAppsJobOperator . In a DAG it looks like any other task: report_sales = AzureContainerAppsJobOperator( task_id="report_store_sales", subscription_id="{{ var.value.azure_subscription_id }}", resource_group="{{ var.value.aca_resource_group }}", job_name="{{ var.value.aca_job_name }}", image="python:3.12-slim", command=["python", "-c", MY_PROGRAM], env_vars={"STORE_NAME": "Seattle"}, deferrable=True, ) A few things make this operator easy to work with: Per-execution overrides. It takes the ACA Job you point it at and overrides the image , command , args , and env_vars for that run. You can drive many different workloads from a single ACA Job definition, and you don't need to build or push a custom image just to try something. The example above runs the stock python:3.12-slim image with an inline program. Deferrable by default. With deferrable=True , Airflow frees its worker slot while the ACA Job runs and resumes when it finishes. That means your fan-out width is bounded by ACA, not by how many Airflow workers you have. You can launch dozens of parallel executions cheaply. No secrets required. Authentication resolves in a sensible order: an Airflow Connection if you set one, otherwise an AZURE_ACCESS_TOKEN environment variable, otherwise DefaultAzureCredential (managed identity). In Azure, the hosted template uses a managed identity so nothing sensitive is stored in Airflow at all. Because both templates share this operator, a DAG written for one runs unchanged on the other. Option 1: Bring your own Airflow (host nothing) Choose airflow-on-aca-jobs if you already run Airflow: Azure Managed Airflow, MWAA, Astronomer, or your own deployment. You keep that Airflow exactly as it is and simply teach it to talk to ACA Jobs. +------------------------------------------+ | Your Airflow (you host it, unchanged) | | runs AzureContainerAppsJobOperator | +------------------------------------------+ | | ACA Jobs REST API v +------------------------------------------+ | ACA Job (Azure Container Apps) | | | | store 1 | store 2 | ... | store N | | parallel executions -> scale to zero | +------------------------------------------+ Your existing Airflow runs the operator; ACA Jobs run the work. You host nothing new. Adoption is three small steps: Copy the operator into your Airflow's plugins/ folder. Add a DAG that uses AzureContainerAppsJobOperator . Set three Airflow Variables so the operator knows which job to drive: Airflow Variable Value azure_subscription_id your subscription id aca_resource_group the resource group holding the ACA Job aca_job_name the ACA Job name That's the whole integration. Nothing new to host, no extra scheduler or database, no custom image. ACA Jobs just become another task type Airflow can call. If you want a job to point at first, the template includes an Azure Developer CLI ( azd ) deployment that stands up a sample ACA Job for you: git clone https://github.com/hetvip2/airflow-on-aca-jobs cd airflow-on-aca-jobs azd up # deploys a sample ACA Job, prints its resource group + name Then copy airflow/plugins/ and airflow/dags/ into your Airflow, set the three Variables, and trigger the DAG. Option 2: Airflow hosted on ACA (turnkey) Choose airflow-hosted-on-aca if you don't already have an orchestrator and want one running next to your jobs. One command provisions the whole thing on Azure Container Apps: azd up | v +------------------------------------------+ | Airflow control plane on ACA | | web | scheduler | triggerer | | Postgres (metadata) + Azure Files (dags)| | Managed Identity - no secrets stored | +------------------------------------------+ | | ACA Jobs REST API v +------------------------------------------+ | ACA Job (Azure Container Apps) | | | | store 1 | store 2 | ... | store N | | parallel executions -> scale to zero | +------------------------------------------+ One command deploys the whole Airflow control plane on ACA, right next to the jobs it drives. git clone https://github.com/hetvip2/airflow-hosted-on-aca cd airflow-hosted-on-aca azd env new my-airflow azd up # prints your Airflow URL when it finishes azd up deploys a complete, working Airflow control plane on ACA: airflow-web, airflow-scheduler, and airflow-triggerer running as Container Apps on LocalExecutor, so there's no Celery or Redis to operate. A Postgres metadata database. A user-assigned managed identity with permission to call the ACA Jobs API, so the operator authenticates with no secrets stored in Airflow. A sample ACA Job for Airflow to drive out of the box. Your DAGs and plugins live on a mounted Azure Files share, so you ship new workflows by re-uploading files rather than rebuilding an image: cp my_dag.py airflow/dags/ azd hooks run postprovision # uploads dags + plugins to the share Airflow picks up the change within a minute. You now own a real orchestrator, hosted serverlessly on the same platform as your jobs. Which one should you pick? Option 1: airflow-on-aca-jobs Option 2: airflow-hosted-on-aca Best when You already run Airflow You don't have Airflow yet Setup Copy the operator + a DAG + 3 Variables azd up (one command) Who hosts Airflow You do (unchanged) Azure Container Apps Authentication Connection or short-lived token Managed identity, nothing stored Ownership Lowest: nothing new to run Turnkey: a full orchestrator you own The important part: the workload never changes. The same DAG and the same operator drive the same ACA Job executions in both. Start wherever you are today, and switch later with zero changes to your pipelines. See it end to end Picture a retailer that wants one number every night: total sales across all stores. Each store reports its own sales as a separate ACA Job execution, all running in parallel. When every store is in, a final job adds them into the company total. That one workflow exercises exactly what a plain Job can't do alone: parallel fan-out: one ACA Job execution per store, all at once dependency ordering: the roll-up runs only after every store reports per-task retries: if a store's execution fails, Airflow retries just that store, and the nightly total still lands In Airflow's Graph view you watch the store tasks light up together, then the roll-up run last. In the Azure portal you watch real executions appear under your ACA Job and scale back to zero when they finish. Same job, same DAG, whichever template you chose. Call to action If you run batch, ETL, or any multi-step work on Azure Container Apps Jobs, give one of these templates a try: Already have Airflow? Start with airflow-on-aca-jobs. Need an orchestrator? Start with airflow-hosted-on-aca. Both are open source, deploy with azd up , and share the same operator so you can move between them freely. Try them out and let us know what you orchestrate.
hetvip
Jul 17, 2026 Place Apps on Azure Blog
173Views
1like
2Comments
Staying in the flow: SleekFlow and Azure turn customer conversations into conversions
A customer adds three items to their cart but never checks out. Another asks about shipping, gets stuck waiting eight minutes, only to drop the call. A lead responds to an offer but is never followed up with in time. Each of these moments represents lost revenue, and they happen to businesses every day. SleekFlow was founded in 2019 to help companies turn those almost-lost-customer moments into connection, retention, and growth. Today we serve more than 2,000 mid-market and enterprise organizations across industries including retail and e-commerce, financial services, healthcare, travel and hospitality, telecommunications, real estate, and professional services. In total, those customers rely on SleekFlow to orchestrate more than 600,000 daily customer interactions across WhatsApp, Instagram, web chat, email, and more. Our name reflects what makes us different. Sleek is about unified, polished experiences—consolidating conversations into one intelligent, enterprise-ready platform. Flow is about orchestration—AI and human agents working together to move each conversation forward, from first inquiry to purchase to renewal. The drive for enterprise-ready agentic AI Enterprises today expect always-on, intelligent conversations—but delivering that at scale proved daunting. When we set out to build AgentFlow, our agentic AI platform, we quickly ran into familiar roadblocks: downtime that disrupted peak-hour interactions, vector search delays that hurt accuracy, and costs that ballooned under multi-tenant workloads. Development slowed from limited compatibility with other technologies, while customer onboarding stalled without clear compliance assurances. To move past these barriers, we needed a foundation that could deliver the performance, trust, and global scale enterprises demand. The platform behind the flow: How Azure powers AgentFlow We chose Azure because building AgentFlow required more than raw compute power. Chatbots built on a single-agent model often stall out. They struggle to retrieve the right context, they miss critical handoffs, and they return answers too slowly to keep a customer engaged. To fix that, we needed an ecosystem capable of supporting a team of specialized AI agents working together at enterprise scale. Azure Cosmos DB provides the backbone for memory and context, managing short-term interactions, long-term histories, and vector embeddings in containers that respond in 15–20 milliseconds. Powered by Azure AI Foundry, our agents use Azure OpenAI models within Azure AI Foundry to understand and generate responses natively in multiple languages. Whether in English, Chinese, or Portuguese, the responses feel natural and aligned with the brand. Semantic Kernel acts as the conductor, orchestrating multiple agents, each of which retrieves the necessary knowledge and context, including chat histories, transactional data, and vector embeddings, directly from Azure Cosmos DB. For example, one agent could be retrieving pricing data, another summarizing it, and a third preparing it for a human handoff. The result is not just responsiveness but accuracy. A telecom provider can resolve a billing question while surfacing an upsell opportunity in the same dialogue. A financial advisor can walk into a call with a complete dossier prepared in seconds rather than hours. A retailer can save a purchase by offering an in-stock substitute before the shopper abandons the cart. Each of these conversations is different, yet the foundation is consistent on AgentFlow. Fast, fluent, and focused: Azure keeps conversations moving Speed is the heartbeat of a good conversation. A delayed answer feels like a dropped call, and an irrelevant one breaks trust. For AgentFlow to keep customers engaged, every operation behind the scenes has to happen in milliseconds. A single interaction can involve dozens of steps. One agent pulls product information from embeddings, another checks it against structured policy data, and a third generates a concise, brand-aligned response. If any of these steps lag, the dialogue falters. On Azure, they don’t. Azure Cosmos DB manages conversational memory and agent state across dedicated containers for short-term exchanges, long-term history, and vector search. Sharded DiskANN indexing powers semantic lookups that resolve in the 15–20 millisecond range—fast enough that the customer never feels a pause. Microsoft Phi’s model Phi-4 as well as Azure OpenAI in Foundry Models like o3-mini and o4-mini, provide the reasoning, and Azure Container Apps scale elastically, so performance holds steady during event-driven bursts, such as campaign broadcasts that can push the platform from a few to thousands of conversations per minute, and during daily peak-hour surges. To support that level of responsiveness, we run Azure Container Apps on the Pay-As-You-Go consumption plan, using KEDA-based autoscaling to expand from five idle containers to more than 160 within seconds. Meanwhile, Microsoft Orleans coordinates lightweight in-memory clustering to keep conversations sleek and flowing. The results are tangible. Retrieval-augmented generation recall improved from 50 to 70 percent. Execution speed is about 50 percent faster. For SleekFlow’s customers, that means carts are recovered before they’re abandoned, leads are qualified in real time, and support inquiries move forward instead of stalling out. With Azure handling the complexity under the hood, conversations flow naturally on the surface—and that’s what keeps customers engaged. Secure enough for enterprises, human enough for customers AgentFlow was built with security-by-design as a first principle, giving businesses confidence that every interaction is private, compliant, and reliable. On Azure, every AI agent operates inside guardrails enterprises can depend on. Azure Cosmos DB enforces strict per-tenant isolation through logical partitioning, encryption, and role-based access control, ensuring chat histories, knowledge bases, and embeddings remain auditable and contained. Models deployed through Azure AI Foundry, including Azure OpenAI and Microsoft Phi, process data entirely within SleekFlow’s Azure environment and guarantees it is never used to train public models, with activity logged for transparency. And Azure’s certifications—including ISO 27001, SOC 2, and GDPR—are backed by continuous monitoring and regional data residency options, proving compliance at a global scale. But trust is more than a checklist of certifications. AgentFlow brings human-like fluency and empathy to every interaction, powered by Azure OpenAI running with high token-per-second throughput so responses feel natural in real time. Quality control isn’t left to chance. Human override workflows are orchestrated through Azure Container Apps and Azure App Service, ensuring AI agents can carry conversations confidently until they’re ready for human agents. Enterprises gain the confidence to let AI handle revenue-critical moments, knowing Azure provides the foundation and SleekFlow provides the human-centered design. Shaping the next era of conversational AI on Azure The benefits of Azure show up not only in customer conversations but also in the way our own teams work. Faster processing speeds and high token-per-second throughput reduce latency, so we spend less time debugging and more time building. Stable infrastructure minimizes downtime and troubleshooting, lowering operational costs. That same reliability and scalability have transformed the way we engineer AgentFlow. AgentFlow started as part of our monolithic system. Shipping new features used to take about a month of development and another week of heavy testing to make sure everything held together. After moving AgentFlow to a microservices architecture on Azure Container Apps, we can now deploy updates almost daily with no down time or customer impact. And this is all thanks to native support for rolling updates and blue-green deployments. This agility is what excites us most about what's ahead. With Azure as our foundation, SleekFlow is not simply keeping pace with the evolution of conversational AI—we are shaping what comes next. Every interaction we refine, every second we save, and every workflow we streamline brings us closer to our mission: keeping conversations sleek, flowing, and valuable for enterprises everywhere.
mtoiba
Jul 17, 2026 Place Customer Innovation Blog
647Views
3likes
0Comments
Azure Container Apps Express for Shipping Container Apps Fast
ACA Express Apps are a strong fit for teams that need to ship quickly and can't afford long platform setup cycles. This includes startups, internal platform teams, and product groups deploying APIs, web apps, or agent endpoints that scale with uneven demand. If the priority is fast path-to-production, predictable wake-up behavior, and minimal infrastructure overhead, this model is likely the right choice. To put real numbers behind that, I built a live demo that races Express against a Consumption environment on the same app. The measurements below come from that demo, not from a spec sheet. MicroVMs make cold starts practical Cold start delays usually come from rebuilding runtime state whenever an app wakes up. ACA Express Apps reduce that overhead with MicroVM-based startup paths built for fast boot and isolation. The result is faster instance readiness without trading off security. The gap shows up clearly when both apps have scaled all the way to zero. Waking from a genuine cold start, Express comes back in about 1.5 seconds. The same app in a Consumption environment takes about 20 seconds to answer the first request. Both were measured live in the browser, from request to first response. Disk and memory state restore is the speed multiplier State restoration skips the app's internal boot sequence entirely. Instead of replaying the same initialization work on every start, ACA Express Apps can restore disk and memory state so the app starts closer to ready. That reduces time-to-first-request and smooths scale events, especially for framework-heavy workloads. It's also what lets scale-to-zero stay practical: the app costs nothing while idle, but the wake-up penalty stays in the low single-digit seconds instead of the tens of seconds you'd otherwise pay. Environmentless changes the deployment experience Skipping the environment setup completely changes the deployment workflow. Teams can ship the container app without first managing environment sprawl, while still getting the runtime foundations they need. For fast-moving teams, that means less setup overhead and a shorter path to production. You can see how little there is to fill in. Creating an Express app is a single short form. There is no environment to stand up first. And once it's created, the manage view gives you the live URL, status, and the basics you need to operate it. The numbers, side by side Everything below was measured on the same container image, in the West Central US region. What's measured Express Consumption Cold start from zero (request to first response) ~1.5 s ~20 s Environment provisioning ~14 s ~120 s First-time deploy (environment + app, zero to live URL) ~52 s ~166 s App deploy only (environment already exists) ~30 s ~30 s Express is much faster on the two steps that build infrastructure from scratch: cold start and environment provisioning. Once an environment already exists, the two are about the same. Express isn't a different app runtime, it's the same platform with the first-time setup cost stripped down. Get started Express is in public preview. You can have a container on a live URL in the time it takes to read this post. 📖 Azure Container Apps Express overview — concepts, capabilities, and the current feature support matrix. 🚀 Create your first Express app — the CLI commands and portal steps to get an app running. 🛠️ New Container Apps portal — create and manage Express apps in the streamlined UI. 🧪 Test Express apps locally — validate your container before you deploy. ❓ Express FAQ — preview status, limits, regions, and how Express relates to standard Container Apps. 👉 Deploy an Express app · Read the docs · Browse the FAQ When speed matters, ACA Express is the best tool for deploying containers. It skips the platform setup delays without sacrificing reliability under load.
hetvip
Jul 06, 2026 Place Apps on Azure Blog
352Views
2likes
1Comment
Auditing and Telemetry for the Agent Governance Toolkit - Getting Started with .NET Core
We've entered an era where AI agents autonomously invoke tools — reading and writing files, calling APIs, querying databases. Convenient as this is, without a mechanism to control who can call what, and under what conditions, you can't put it into production. The Agent Governance Toolkit (AGT), open-sourced by Microsoft, is exactly the toolkit for embedding that "gatekeeper" into AI agents. This article walks through getting started with AGT in .NET (C#), based on the following GitHub repository sample.
daisami
Jul 02, 2026 Place Apps on Azure Blog
255Views
5likes
0Comments
Advanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog
omidvahedv
Jun 29, 2026 Place Azure
147Views
0likes
1Comment
IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)
By Johnson Shi, Aviral Takkar, Bin Du Introduction Two of the most common networking questions we hear from teams running Azure Container Registry (ACR) are: "Can my registry serve clients on IPv6 networks?" — Teams operating IPv6-only or dual-stack networks need their container registry reachable over IPv6. "How do we start moving registry traffic toward IPv6 without breaking anything?" — Organizations guarding against IPv4 address exhaustion, or operating under IPv6 transition mandates, want a migration path that doesn't disrupt existing IPv4 clients. Today, we're announcing the public preview of IPv6 dual-stack endpoints for Azure Container Registry for public endpoints and firewall rules, with IPv6 over private endpoints planned for GA. Set your registry's endpoint protocol to IPv4AndIPv6 , and its endpoints become reachable over both IPv4 and IPv6 — so IPv4-only, dual-stack, and IPv6-capable clients all connect to the same registry, each over whichever protocol their network stack selects. Key Takeaways ACR registries now support an endpointProtocol setting with two values: IPv4 (default) and IPv4AndIPv6 (dual stack, preview). Dual stack is additive — your registry continues serving IPv4 clients exactly as before. There is no IPv6-only mode. Dual stack requires dedicated data endpoints to be enabled ( --data-endpoint-enabled true ), and dedicated data endpoints require the Premium SKU. The service enforces this requirement. You can enable it today with Azure CLI 2.87.0 via az acr update --endpoint-protocol IPv4AndIPv6 . FQDN-based client firewall rules keep working unchanged; IP-based allowlists need to account for IPv6 traffic. Limitation: This public preview covers IPv6 for the registry's public endpoints and firewall rules only. IPv6 over private endpoints is planned for a future release. Limitation: ACR Tasks isn't supported on a registry that has IPv6 dual-stack enabled. Tasks does not work when the endpoint protocol isIPv6 dual-stack, including quick builds (with az acr build) and quick task runs (with az acr run). Support is planned for a future release. How to enable it On an existing registry (Azure CLI 2.87.0 or later) Dual stack requires dedicated data endpoints, so enable both in a single update: az acr update --name <your-registry> --data-endpoint-enabled true --endpoint-protocol IPv4AndIPv6 If dedicated data endpoints are already enabled, set the endpoint protocol on its own: az acr update --name <your-registry> --endpoint-protocol IPv4AndIPv6 Verify the configuration: az acr show --name <your-registry> --query "{endpointProtocol:endpointProtocol, dataEndpointEnabled:dataEndpointEnabled}" { "dataEndpointEnabled": true, "endpointProtocol": "IPv4AndIPv6" } Note: If your clients sit behind a firewall and you're enabling dedicated data endpoints for the first time, add firewall rules for <your-registry>.<region>.data.azurecr.io before enabling — switching from *.blob.core.windows.net to dedicated data endpoints changes where layer blobs are downloaded from. See Dedicated data endpoints for details. Reverting to IPv4 Dual stack is reversible at any time: az acr update --name <your-registry> --endpoint-protocol IPv4 Reverting the endpoint protocol leaves dedicated data endpoints enabled; disable them separately if desired. Scope of this preview This public preview enables IPv6 for the registry's public endpoints — the login server, dedicated data endpoints, and regional endpoints (if enabled). IPv6 over private endpoints isn't part of this preview. Support is planned for a future release. Until then, registries reached through a private endpoint continue to use IPv4. Additionally, IPv6 dual-stack support for ACR Tasks, including support for `az acr build` and `az acr run`, are not supported in the public preview. Support is planned for a future release. Requirements and how features compose Requirement Why Premium SKU Dedicated data endpoints are a Premium feature. Dedicated data endpoints enabled IPv4AndIPv6 requires dataEndpointEnabled: true ; the service rejects the setting otherwise. Azure CLI 2.87.0+ Adds --endpoint-protocol to az acr update . For geo-replicated registries, the endpoint protocol is a registry-level setting, and dedicated data endpoints exist in every replica region. Firewall guidance: rules based on registry FQDNs — the login server, dedicated data endpoints, and regional endpoints (if enabled) — continue to work unchanged for dual-stack registries; only IP-address-based allowlists need updating for IPv6. To learn more, see IPv6 dual-stack endpoints in Azure Container Registry (preview) and the ACR endpoint reference. If you have further questions about IPv6 dual-stack endpoints or dedicated data endpoints, reach out to us on the Azure Container Registry GitHub repository or file feedback through the Azure portal.
johnsonshi_msft
Jun 24, 2026 Place Apps on Azure Blog
179Views
1like
0Comments
VNet integration for Azure SRE Agent (preview)
For many production systems, the logs, databases, private endpoints, repositories, and runbooks an SRE Agent needs to do its job are behind network boundaries your security team already governs. VNet integration for Azure SRE Agent, now in preview, puts the agent's outbound traffic under those same controls - your virtual network, your NSG rules, your private DNS - so it reaches only what your network allows. The principle is one your security team already applies to every other workload: a component's network access shouldn't depend on the component behaving correctly. Identity governs what the agent can reach. Permissions and hooks shape what it does within reach. The network sits beneath both: it blocks any request to a destination you haven't allowed no matter what the agent decides. Why egress control matters Two reasons. First, the agent reads sensitive things by design. Inspecting logs, code, configuration, and internal systems is the whole point during an incident, which means you have to decide where that data can go. Open egress gives that data a path out of your network - a risk you wouldn't accept for any other production-adjacent workload. Second, it reasons over text it didn't write - logs, issue descriptions, tool output — which is how prompt injection gets in. Handling that is partly model safety, and Azure SRE Agent runs under Microsoft's Responsible AI standard with safety work from OpenAI and Anthropic. Network controls add another layer: an instruction that tries to reach a destination you haven't allowed can't run, because the network blocks it. For example, an agent investigating an outage might query Log Analytics, read deployment configuration, and call an internal runbook - all private resources. With VNet integration, those calls follow the routes, DNS, and firewall rules your workloads already use. A request to an external endpoint you haven't allowed fails at the network boundary. It doesn't depend on the model recognizing the risk and refusing; the network stops it either way. Choose an egress mode Azure SRE Agent has three egress modes, and you don't have to start at the strongest. Unrestricted - all outbound traffic allowed Limited - deny all outbound, allow an explicit list of hosts. Gives you host-level control without setting up a full VNet Azure VNet - outbound traffic goes through a delegated subnet in your network, with your NSG rules and private DNS applied. The recommended mode for production and regulated workloads. How Azure VNet mode works Outbound traffic takes one of two paths, and every call takes exactly one. Your VNet. Everything not placed on the managed path goes through a delegated subnet in your own network, where your NSG rules, private DNS, and firewall all apply. The agent is just another workload on that subnet, so it can reach what the subnet can reach: databases behind private endpoints, internal services, monitoring stores, and key vaults -the parts of production that aren't reachable from the public internet. The resources that matter most during an incident are usually the private ones. If your network connects to on-premises over ExpressRoute or VPN, the agent can reach those systems too, as long as your existing routes and rules allow it. The managed infra path. Some destinations go through Azure SRE Agent's managed infrastructure network instead - platform services the agent needs, plus optional categories you turn on: package registries, code repositories, and remote MCP servers. This path skips your VNet, so your NSG rules and Firewall Policies don't apply to it. Treat it as a deliberate exception, used only where you need it. Why public services start on the managed path Public services are hard to allow by IP address. GitHub, PyPI, npm, NuGet, apt, and the container registries run on large, changing IP ranges, and they don't map to a single Azure service tag. If your NSG filters by IP and port, keeping those lists up to date is constant work, and when a list falls behind, the agent can't pull a package or read a repository - and an investigation stalls on a networking problem that has nothing to do with the incident. Each category has a toggle: package registries (PyPI, npm, NuGet, apt), code repositories (GitHub, GitHub Enterprise, Azure DevOps), remote MCP servers, and a list of additional hostnames. Starting with these on the managed path keeps the agent working reliably without maintaining an IP allowlist. For build-time dependencies, that's usually fine. If you want this traffic inspected too, the next step is name-based (FQDN) egress filtering in your own network. Once your firewall can allow github.com and pypi.org by name, you can move these categories off the managed path and route them through your VNet instead Configure it Two decisions: the subnet, and what (if anything) uses the bypass. Navigate to Settings > Workspace Configuration > Network Choose Azure VNet as the egress mode. Select a subnet that is /27 or larger and delegated to `Microsoft.App/environments`. Decide which categories, if any, use the bypass. Restrict who can change the egress mode and bypass toggles. These settings widen or narrow the agent's reach, so govern them like any production network control. Test the outbound behavior before using the agent with production data. A reasonable setup for most enterprises during preview: use Azure VNet mode, keep package registries and code repositories on the bypass if you need reliable access to them, and route everything else through your VNet. Stricter environments can turn those categories off and rely on their own name-based firewall rules. What it doesn't cover yet VNet integration is in preview, with two limitations to know. It covers outbound traffic only - reaching the agent privately from inside your network isn't part of this preview. And connector traffic still routes over the public internet; the governance and credential isolation in Connectors V2 still apply. Use VNet integration for outbound control of the agent workspace, and combine it with identity, RBAC, tool permissions, hooks, and connector governance for a complete set of controls. Where it fits VNet integration doesn't replace identity, RBAC, tool permissions, or connector governance. It controls where traffic can go. The agent still needs the right identity and permissions to access a resource in the first place. Identity is the foundation: your RBAC assignments decide what the agent can reach. Permissions and hooks shape what it does within reach: allow/ask/deny rules control what runs, and hooks let you inspect or change a tool call before it runs. VNet integration sits underneath, controlling where traffic can go no matter what the agent tries to do. You want the agent to be capable. You also want a boundary that holds whether or not it is. Get started Create an SRE Agent - https://aka.ms/sreagent Documentation - https://aka.ms/sreagent/newdocs Recipes - https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent
sanchitmehta
Jun 14, 2026 Place Apps on Azure Blog
1.1KViews
1like
0Comments
Private Plugins with Azure SRE Agent
SRE's and platform teams are building operational skills specific to their infrastructure: investigation runbooks, compliance checks, cost analysis playbooks, deployment verification procedures. The next step is making that work reusable across every agent in the organization without exposing it publicly. Today, SRE Agent supports plugin marketplaces hosted in private GitHub repositories, including GitHub Enterprise. This is part of the Azure SRE Agent announcements at Build 2026. You can now point SRE Agent at a private repo when adding a marketplace or installing a plugin. Authentication is handled per-marketplace, and supports OAuth, GitHub PATs, and GitHub Apps for GHE tenants. From one agent to an organization’s plugin catalog Most teams start with a single SRE Agent connected to their services. The agent learns their infrastructure, runs their runbooks, and handles their incidents. It works well. Then adoption grows. A second team stands up their own agent. Then a third. Platform engineering wants every agent to run the same compliance checks. Security needs approval hooks enforced consistently. FinOps has cost governance skills that should be standard across the organization. Suddenly the question isn’t “how do I set up my agent,” it’s “how do we share operational knowledge across all of them.” Without a distribution model, teams end up copying skill files between agents manually. A platform team writes a runbook, shares it over email or a wiki link, and each service team pastes it into their agent individually. When the runbook improves, some agents get updated, some don’t. There’s no version tracking, no central catalog, and no way to know which agent is running which version of which skill. Private marketplace support solves this. How Private Plugin marketplace meet enterprise needs A platform team publishes once, every agent installs. Codify best practices as plugins in a private GitHub repo. Service teams add that repo as a marketplace in their agents and install what they need. Compliance checks, cost governance thresholds, incident playbooks, deployment verification procedures all distributed through versioned plugins. Each team retains ownership. Security controls which plugins enforce approval hooks. FinOps locks cost thresholds into parameter values. Platform engineering governs infrastructure investigation patterns. The marketplace is the distribution layer for organizational standards. Versions are pinned, updates are explicit. Each installation locks to the commit at install time. A merged PR upstream does not change any agent’s behavior. Teams promote new versions on their own schedule: validate in dev, promote to staging, then production. Different agents can run different versions simultaneously. Reuse across environments and tools. The same plugin works across dev, staging, and production agents, and can be reused by local coding agents and other services that support plugins. One source of truth, not separate copies per environment. Accessing Private Plugin marketplaces Private repo support adds authentication to the SRE Agent's plugin workflow so your agent can clone and install from repos that require credentials. Authentication is configured once per marketplace. Every plugin within it inherits the credentials. Auth method When to use Setup OAuth github.com repos your agent can already access Uses your existing GitHub connection. One click. Personal access token Private repos in other orgs on github.com Per-marketplace PAT. Scoped to just that marketplace. GitHub App GitHub Enterprise (*.ghe.com) BYO App with private key in Azure Key Vault. Short-lived tokens minted at runtime. Getting started In SRE Agent, navigate to Builder > Plugins, then click Add Marketplace and enter the URL of the private marketplace you want to connect to. Then click Connect to GitHub to complete the OAuth sign-in. Click Add and you will see the plugins available from your connected marketplace. Click on the plugin to install and in the detail view you can browse the skills packaged with the plugin. click Install to install this plugin. You can now see the skills imported from plugins from Capabilities > Skills > Custom Skills The bottom line Private repo support turns the Plugin Marketplace from a public skill catalog into your organization’s internal distribution platform for operational automation. Your team writes the plugins. Your agents install them. Your GitHub permissions control who has access. Try it yourself: create a private repo with a marketplace.json and a few skills, add it as a marketplace in your agent, and install a plugin. Resources SRE Agent documentation — https://aka.ms/sreagent/newdocs SRE Agent overview — https://aka.ms/sreagent/newdocsoverview Plugin Marketplace capability page — https://aka.ms/sreagent/newdocs/capabilities/plugin-marketplace Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent
ebencarek
Jun 11, 2026 Place Apps on Azure Blog
372Views
0likes
0Comments
Introducing Azure Container Apps Sandboxes: Secure Infrastructure for Agentic Workloads
Today we are announcing the public preview of Azure Container Apps Sandboxes - a new first-class resource type that gives you fast, secure, ephemeral compute environments with built-in suspend and resume. This is the underlying infrastructure on which products like Cloud sandboxes in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built, you now have the opportunity to build your solutions leveraging this infrastructure. Azure Container Apps Sandboxes unlocks two massive opportunities. For platform developers and ISVs, sandboxes give you the same isolated compute fabric that powers many Microsoft products. You get the building blocks to create your own multi-tenant platform on proven, enterprise-scale infrastructure. For AI agents, sandboxes become a self-configurable tool that lets agents extend their own capabilities on the fly. An agent can spin up a fresh sandbox in milliseconds and use it to execute untrusted code, compile source, test HTTP requests against a live app, launch a browser session, or tackle whatever needs a quick and scalable infrastructure. On one side it empowers humans to build platforms, on the other it empowers agents to build their own capabilities. Both get enterprise-grade isolation, instant startup, and snapshot-based persistence out of the box. We'll walk through the resource model, sandbox lifecycle, the features that set Sandboxes apart - like snapshots, lifecycle policies, network egress controls, volumes, and managed identities - and show you how to get started with the portal and CLI. What Are Container Apps Sandboxes? Container Apps Sandboxes are secure, isolated compute environments that start in sub-second time, scale to thousands, and cost nothing when idle. Each sandbox runs in its own hardware-isolated microVM boundary - fully separated from the host, the platform, and every other sandbox. You bring your own Open Container Initiative (OCI) image, and Sandboxes handle the rest: provisioning from prewarmed pools, strong multi-tenant isolation, and snapshot-based suspend/resume that preserves full memory and disk state across sessions. There are many ways Sandboxes can help you build your next project - here are a few: Your own build & test systems - wire a Sandbox into your CI/CD flow to run builds while your laptop stays cool. Agents that can run anything safely - an agent spawns a sandbox, executes work inside it, and returns the output with no agent host privileges required. Agent swarms - decompose a research question, spawn N sandbox workers in parallel (each pinned to its own image and egress policy), and synthesize the result. Early access customers are already unlocking significant benefits by leveraging Azure Container Apps Sandboxes. "With Azure Container Apps sandboxes, SitecoreAI can safely enable agents to take real action. The combination of multi-tenant isolation, rapid scale-out, and full automation allows Sitecore to run long-lived, autonomous agents that securely execute code, manage workflows, and interact with enterprise systems within secure, governed environments. With this foundation, we can build agents that do real work: assembling content, personalizing experiences, and optimizing campaigns in production. Agents that operate continuously, learn from results, and improve over time, so our customers get better outcomes without giving up control." - Mo Cherif, VP of AI and Innovation, Sitecore "We got early access to Azure Container Apps Sandboxes, and got the first prototype integrated with Atlas AI in hours, and it's already shaping a new Atlas AI capability that we plan to launch in preview in Q3. It gives every Atlas AI agent a safe, sandboxed workspace (file system, terminal, code execution) on a customer's live data in Cognite Data Fusion. The value: Industrial process, reliability, and production engineers spend days and weeks on questions like "which wells are underperforming and why?" These questions are tractable but expensive, so they are asked rarely and decisions are made on gut feel. With this, an agent pulls the data, runs the analysis, cross-references maintenance and inspection records, and returns a cited draft in minutes. Sandboxes make it practical: Aligned feature set, per-customer isolation, pause/resume across multi-day investigations, scale-to-zero economics." - Kelvin Sundli, Product manager, Atlas AI, Cognite Resource Model: Sandbox Groups and Sandboxes The top-level ARM resource is Microsoft.App/SandboxGroups. A Sandbox Group is the management boundary for a collection of sandboxes that share configuration - think of it like a Container Apps Environment, but purpose-built for sandboxes. When you create a Sandbox Group, you specify: Subscription, Resource Group, and Region Sandbox defaults (optional): default CPU, memory, disk, max sandbox count, and default idle timeout Networking: optionally deploy into a custom VNet with a dedicated subnet for private networking Identity: System or user assigned Entra identity. Individual sandboxes are created within a Sandbox Group. Each sandbox has its own source (disk image or snapshot), resource tier, lifecycle policy, network egress policy, environment variables, ports, volumes, and connections. Sandbox Lifecycle Sandboxes have a well-defined lifecycle with the following states: State Description Creating Provisioning the sandbox from a disk image or snapshot Running Actively executing - backed by a live microVM Idle System-suspended after inactivity; can auto-resume on the next request Suspended Full state (memory + disk) preserved as a snapshot; no compute costs Resuming Restoring from a suspended or idle state - sub-second for most workloads Stopped User-initiated stop; can be resumed Stopping Graceful shutdown in progress Deleting Teardown in progress The key insight here is the distinction between Idle and Suspended. When a sandbox goes idle (e.g., no traffic for a configured timeout), the system can automatically suspend it and capture a snapshot. When a new request arrives, the sandbox resumes transparently. This gives you scale-to-zero economics with stateful compute - something that wasn't possible before without significant custom engineering. Disk Images: Bring Your Own Container Sandboxes boot from Disk Images - Open Container Initiative (OCI) images converted into an optimized root filesystem format. You point to any OCI image (public or private registry), and the platform builds a bootable disk image from it. You can start with public, pre-built images maintained by the platform (for example, Ubuntu base images), or bring your own private images. For private registries, you can authenticate with username/token or use a user-assigned managed identity for Azure Container Registry (ACR) – integrated with Azure as you expect. Snapshots: Full-State Persistence Snapshots capture the complete state of a running sandbox - memory, disk, and all running processes. When you resume a sandbox from a snapshot, every process, open file handle, and in-memory data structure is restored exactly as it was. A snapshot captures the full state of a running sandbox: memory pages, disk, processes. Two ways to make one - automatically on suspend, or manually on demand. Three things they're great for: Checkpointing mid-task so a long-running agent can resume exactly where it left off Cloning an environment that's already warm - dependencies installed, caches populated, services running Shipping a "ready-to-go" state that resumes in sub-second instead of cold-booting Snapshots are free during the preview, after which they will be stored as Azure Blob Storage at standard rates. Each snapshot records the source sandbox, resource allocation (CPU, memory, disk), and container metadata - so what you get back is exactly what you snapshotted. Resource Tiers Every sandbox is assigned to a resource tier that determines its CPU, memory, and disk allocation: Tier CPU Memory Disk XS 0.25 vCPU 0.5 GB 5 GB S 0.5 vCPU 1 GB 10 GB M (default) 1vCPU 2 GB 20 GB L 2 vCPU 4 GB 40 GB XL 4 vCPU 8 GB 80 GB When creating a sandbox from a snapshot, the resource tier is inherited from the snapshot and cannot be changed - this ensures the restored environment has the exact resources it was running with when the snapshot was taken. Lifecycle Policies: Auto-Suspend and Auto-Delete Every sandbox can be configured with lifecycle policies that automate state transitions and cleanup: Auto-Suspend Idle timeout: How long a sandbox can sit idle before being suspended (configurable: 1m, 2m, 5m, 10m, 30m, 60m) Suspend mode: Disk + Memory (default): Full snapshot including memory state - resume picks up exactly where you left off, with all processes and in-memory data intact. Disk: Only the disk is preserved; the VM restarts fresh on resume. Useful when you only need file persistence, not process continuity. Auto-Delete Automatically delete sandboxes after a configurable number of days of inactivity Prevents accumulation of abandoned sandboxes that consume snapshot storage These lifecycle policies are what make Sandboxes economically viable at scale. A platform serving thousands of tenants can configure aggressive idle timeouts (say, 60 seconds) with Memory suspend mode, and each tenant's sandbox disappears from the billing meter almost immediately - but resumes in sub-second time the moment they return. Network Egress Policy For scenarios involving untrusted code - AI agents executing LLM-generated scripts, multi-tenant SaaS with user-submitted workloads - controlling outbound network access is critical. Sandboxes provide a per-sandbox Network Egress Policy: Default action: Allow or Deny all outbound traffic Host rules: Domain-pattern rules (e.g., *.github.com → Allow) to permit specific destinations Custom CIDR rules: Network-level rules for IP ranges (e.g., 10.0.0.0/8 → Deny) Skip egress proxy: Option to bypass the egress proxy entirely when custom VNet routing handles policy enforcement This means you can run a sandbox in a deny-by-default posture and allowlist only the specific endpoints it needs (your API server, a package registry, etc.) - without setting up NSGs or firewall appliances. Managed Volumes: Persistent and Shared Storage Sandboxes support two types of mountable volumes, both managed by Microsoft: Volume Type Backed By Best For Managed Azure Blob Azure Blob Storage Shared data across sandboxes, file uploads/downloads, persistent artifacts Managed Data Disk Azure Disk Storage High-performance storage for databases, build caches, large working sets - only available to one sandbox at a time Blob volumes come with a built-in file explorer in the portal - you can browse, upload, download, create folders, and drag-and-drop files directly. Data Disk volumes provide dedicated block storage with configurable sizes. Secrets and Identity Secrets Sandbox Groups support key-value secrets scoped to the group. Secrets can be created, edited, and referenced by sandboxes within the group. These secrets can be used in egress policies to modify requests with transform or header-injection rules, without exposing the secrets to code running inside the sandbox. Managed Identity Sandbox Groups support both system-assigned and user-assigned managed identities, with full RBAC role assignment management. This means your sandboxes can authenticate to Azure services (Key Vault, Storage, Cosmos DB, etc.) without managing credentials - the same identity model you use everywhere else in Azure. MCP Connectors and Triggers ACA Sandboxes now supports managed connectors through the Model Context Protocol (MCP), giving sandboxes access to external APIs - including Microsoft 365, Salesforce, ServiceNow, GitHub, and 1,400+ other systems - without managing credentials directly. Attach a Connector Gateway to your sandbox group, and every sandbox in the group can call external APIs through a standardized MCP interface at runtime. Pair connectors with triggers to build event-driven automation: route an Outlook email to a sandbox that triages it with an AI agent, or react to a SharePoint file upload by extracting and processing the document all without writing glue code. Triggers can fire a shell command inside a sandbox or invoke an HTTP endpoint the sandbox exposes, so your automation shapes fit naturally around your workload. The integration is built on the new Connector Namespace service (az connector-namespace), the same runtime behind Logic Apps and Power Platform connectors, now available as a programmable layer for sandboxes. See the end-to-end samples for runnable azd up-deployable examples covering email triage and document automation scenarios. The Portal Experience Azure Container Apps Sandboxes are only available in the new Azure Container Apps portal that provides a rich, IDE-like experience for working with sandboxes. Creating a Sandbox The portal offers multiple creation paths: Standard Sandbox - full configuration control over source, resources, lifecycle, networking, and volumes GitHub Copilot Sandbox - preset, Copilot CLI ready to go, GitHub credentials can be wired through the Access Token before the sandbox is created Claude Sandbox - Claude CLI pre-installed, ready for agentic coding inside the sandbox Using Coding Agents (Copilot CLI / Claude Code) If you live inside Copilot CLI or Claude Code, you don't need to learn a new CLI. Install the azure-sandbox skill once and your agent picks up the right skills: # GitHub Copilot CLI # Add as a plugin marketplace /plugin marketplace add microsoft/azure-container-apps # Install all skills /plugin install sandboxes@Azure-Container-Apps # Claude Code claude plugin add microsoft/azure-container-apps The skill runs prerequisite checks silently (az --version, az account show, node --version, aca --version), prompts only if something's missing, and maps natural-language asks to the right aca commands. Bundled runbooks cover Copilot CLI BYOK (bring your own Azure OpenAI key), the deploy-a-web-app walkthrough, and shell setup. Sandbox Detail Page Once your sandbox is running, the detail page gives you immediate access to the sandbox terminal and additional details, such as - Network Audit - real-time egress traffic log showing allowed and denied requests Monitor - live CPU, memory, disk, and network utilization charts Connectors - attached connections with an "Add" action Volumes - mounted volumes with an "Add" action Log Stream - streaming container logs Processes - running process list inside the sandbox Files - file explorer to browse the sandbox filesystem The toolbar actions let you manage the state of the sandbox - Resume or Stop. In the Ellipsis menu (⁝) you can find additional settings to manage network Egress Policy and ingress (Add port), take a Snapshot of the sandbox, Commit (save disk state as a new disk image), set Lifecycle Policy or permanently Delete the sandbox. Finally, you can see additional Details in a side panel. Getting Started with the CLI and Python SDK All sandbox and sandbox-group operations go through the  aca  CLI. There are no az containerapp sandbox commands, - az is only used for az login, az account show, and resource-group management. Install (CLI) # Mac, Linux curl -fsSL https://aka.ms/aca-cli-install | sh # Windows irm https://aka.ms/aca-cli-install-ps | iex Run aca --help to get started. Install (Python SDK) pip install azure-containerapps-sandbox For more details, quick start and examples on ACA CLI and Python SDK, please go to https://sandboxes.azure.com Evolution from Dynamic Sessions If you've used Azure Container Apps Dynamic Sessions, Sandboxes are the next evolution of that capability. Everything Sessions can do, Sandboxes can do - and significantly more: Capability Dynamic Sessions Sandboxes Sub-second startup ✓ ✓ Strong isolation ✓ ✓ Custom container images ✓ ✓ Custom VNet integration ✓ (Partial) ✓ Suspend/resume with Memory and Disk snapshots - ✓ Lifecycle policies (auto-suspend, auto-delete) - ✓ Network egress policy (per-sandbox) - ✓ Persistent managed volumes (Blob, Data Disk) - ✓ Managed identity (system + user-assigned) - ✓ Secrets management - ✓ Configurable resource tiers - ✓ Direct access to sandbox in Portal experience - ✓ We will continue to support Dynamic Sessions, but all new investment goes into Sandboxes. If you're building new workloads on isolated ephemeral compute, start with Sandboxes. How It All Fits Together ACA Sandboxes is a platform primitive. It's the foundation on which multiple Microsoft products are already built - including ACA Express, Cloud sandboxes in GitHub Copilot, and Foundry Hosted Agents. When you build on Sandboxes, you're building on the same infrastructure that powers Microsoft's own portfolio. This is the evolution of what we shared with Project Legion in 2024. Legion described the internal infrastructure; Sandboxes exposes it as a customer-facing primitive that you can use directly. What's Next • Deeper Azure integrations - first-class connectivity with Azure networking, identity, storage, and AI services • Enhanced SDK and CLI - richer programmatic experiences for managing sandboxes at scale • More Microsoft services built on Sandboxes - this is just the beginning Get Started Today • Portal: https://sandboxes.azure.com/ • Documentation: Azure Container Apps Sandboxes • Pricing: Azure Container Apps Pricing (per-second vCPU/memory billing, scale-to-zero, snapshots at Blob Storage rates) We'd love to hear your feedback. You can ask questions, or file issues on the Azure Container Apps GitHub (prefix with [Sandbox] for Sandboxes-specific issues).
vyomnagrani
Jun 10, 2026 Place Apps on Azure Blog
5.6KViews
3likes
1Comment
Shaping what Azure SRE Agent does: Tool Permissions and Hooks
When an AI agent runs against production, the first question every security team asks is "What can it do, who decided it could, and what stops it from doing something it should not." Azure SRE Agent reached general availability in March. Since then, teams inside Microsoft and customers running it against real production workloads have asked for the same thing: finer-grained controls over what the agent can do on its own and a clear answer to who governs each call that reaches a tool. Today at Build 2026, we are releasing global tool access policies as one of a set of new governance controls. This post covers how they work. Tool access policies give security and platform teams a single place to define which tools the agent can invoke, under what conditions, and what requires human approval before it runs. Underneath those policies sits the identity the agent runs as the bedrock that every other control layer depends on. It is defense in depth applied to agent behavior: layers of control, each one holding on its own, so that governing the agent is something you can read, audit, and reason about as you scale it across production. Identity is the bedrock: managed identity today, agent identity next Start here, because nothing else matters if you skip it. The identity the SRE Agent runs as, and the Azure RBAC role assignments on that identity, are the most powerful boundary the agent works inside of. If your role assignments do not grant the agent access to a resource, none of the controls below come into play, because the agent cannot reach the resource to begin with. Network rules, tool permissions, hooks, and connector contracts all sit on top of an RBAC story that you write. The features in this post add layers above that floor. They do not replace it. Today the SRE Agent operates as a managed identity, and your RBAC role assignments on that identity govern what it can do. This is the bedrock, and it is the same model your other Azure workloads already use. You assign roles, you scope them, and the agent inherits exactly what you granted and nothing more. Everything that follows assumes the bedrock is in place. With identity settled, the next question is the obvious one: where is the agent allowed to send its traffic? Permissions: govern what the agent does with a tool Identity decides what the agent can reach. Permissions decide what the agent does with the access it has, down to the individual tool. Two levels cover the range: a point-and-click grid for the common cases, and hooks when a decision needs your own code. The grid is the easy mode. Every tool the agent can use, built-in tools along with MCP servers, services, and custom tools, shows up in one searchable list with two switches. On/Off sets whether the tool is available at all; turn it off and the agent cannot use it. Allow/Ask sets what happens when it is on: Allow lets the agent run the tool automatically, Ask requires a human to approve every time, except in Autonomous mode. Select tools in bulk to flip a whole category at once, filter by category or permission, and use the Advanced permissions tab when you want rules that apply at global, per-agent, or per-thread scope instead of tool by tool. Defaults stay put until you touch them, and the engine is fail-closed: if a rule cannot be evaluated, the call is blocked rather than allowed. That covers most of what teams need. Underneath those switches are three rules, allow, ask, and deny, and the Advanced tab is where you set them by scope. Global rules apply to every agent and thread, Agent rules to one custom agent, Thread rules to a single conversation. Deny is the hard one: it blocks the tool outright no matter the run mode, and a deny at a higher scope always wins, so an Allow at thread scope cannot reopen something denied globally. That split is deliberate. A platform team sets the Global guardrails that should never be crossed and the Asks that always need a human, and service teams add their own Allow rules at Agent scope for routine work, without being able to override the guardrails above them. Platform team, Global scope: deny: bash(az * delete *) - never delete, on any agent or thread deny: bash(kubectl delete *) ask: bash(az webapp restart *) - always confirm, even in Autonomous allow: bash(az monitor *) - auto-approve monitoring queries Service team, Agent scope: allow: bash(kubectl get *) - routine read-only work allow: bash(kubectl describe *) Two details make this safe to lean on. Rules match the canonicalized tool invocation rather than the raw text, so enforcement holds no matter how the command was assembled. And fail-closed has a softer edge than a hard stop: a cached last-known-good policy covers transient failures, so a blip in the policy store blocks the call rather than silently widening access. You can find these under Capabilities > Tools missions. The layer worth spending time on is hooks. Allow and Ask answer "should this tool run." Hooks answer "should this specific call run, given exactly what it is about to do." A hook fires before the agent runs a tool and receives the actual call, parameters and all. Your code then decides the outcome and can reshape it: rewrite parameters before they are sent, inject extra context into the pipeline as a user message so the agent reconsiders before its next step, block the call outright, or redirect the agent toward a safer path. Because your code sees the real parameters, the decision can depend on anything you can express in code: which resource the call targets, whether a value falls outside an allowed range, the time of day, the result of an external policy lookup. This is where you write the rule the grid cannot. Two kinds of hook, mixable on the same agent. Command hooks are a script you write; reach for these when code is enough. Prompt hooks put a separate LLM in the loop as a judge that evaluates the call in context; reach for these when the decision needs reasoning rather than a fixed rule. A real example from our own internal test agent: when the agent tries to list files through the shell with ls or dir, a hook blocks the call. The agent absorbs the signal, reconsiders, and reaches for the ListDir tool instead. The hook did not argue with a human. It shaped what happened next. As with the grid, configure nothing and the agent behaves exactly as it does today. Both are additive. Authoring one is a short form. You name the hook, pick the event (Pre Tool Use, so it runs before the call), and set a tool matcher, either picked from the tool menu or written as a regex like (FetchWebpage|SearchMemory) with anchors and lookaheads when you need them, so the hook fires only on the calls you care about. You set a timeout and a fail mode (Block, so a hook that errors or hangs stops the call rather than waving it through), and you write the body in Bash or Python. A command hook reads the call as JSON on stdin, the event name, the tool name, its parameters, and the call id, and answers on stdout. Print nothing and exit zero to allow. Return a block decision with a reason to stop the call, and that reason is what the agent reads back. You can also substitute: run a cheaper or safer version yourself, block the real call, and hand your own output back as the result, so the agent never runs the expensive or risky original. #!/bin/bash input=$(cat) tool=$(echo "$input" | jq -r '.tool_name') # Block one tool, with a reason the agent will read if [ "$tool" = "ExampleToolName" ]; then echo '{"decision":"block","reason":"Blocked ExampleToolName by hook policy."}' exit 0 fi # Otherwise allow: print nothing and exit 0 exit 0 You can find these under Builder > Hooks Each layer holds on its own The layers stack. Identity is the floor: your RBAC assignments decide what the agent can reach at all. Permissions, the grid and hooks together, decide what it does with a tool. You author each layer, each one holds whether or not the layer above it behaves as expected, and all of it configures through the same ARM and Bicep surface your platform team already uses, reproducible the way the rest of your Azure estate is. The upgrade path is additive and non-breaking. Existing agents keep working. Turn on each control when you are ready, in the order your governance requires. There is more coming. We run Azure SRE Agent inside Microsoft on our own production workloads, so we feel the same gaps you do, and the next round is shaped by what we hear from teams running it in production today. Which control is doing the most for you, and which one are you still waiting on? Let us know and thank you! Getting started Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent
Dalibor_Kovacevic
Jun 10, 2026 Place Apps on Azure Blog
594Views
0likes
0Comments