azure kubernetes service
245 TopicsWhat’s new in Observability at Build 2026
At Build 2026, Azure Monitor introduces major advancements in end-to-end observability, extending across AI agents, applications, and infrastructure with OpenTelemetry at its core. New capabilities with Azure Copilot Observability agent, SLI/SLO support, and smarter alerting help teams move faster from detection to root cause while reducing noise and manual effort. Together, these innovations enable developers and SREs to operate modern, AI-driven systems with greater insight, efficiency, and alignment to customer experience.308Views2likes0CommentsVNet integration for Azure SRE Agent (preview)
For many production systems, the logs, databases, private endpoints, repositories, and runbooks an SRE Agent needs to do its job are behind network boundaries your security team already governs. VNet integration for Azure SRE Agent, now in preview, puts the agent's outbound traffic under those same controls - your virtual network, your NSG rules, your private DNS - so it reaches only what your network allows. The principle is one your security team already applies to every other workload: a component's network access shouldn't depend on the component behaving correctly. Identity governs what the agent can reach. Permissions and hooks shape what it does within reach. The network sits beneath both: it blocks any request to a destination you haven't allowed no matter what the agent decides. Why egress control matters Two reasons. First, the agent reads sensitive things by design. Inspecting logs, code, configuration, and internal systems is the whole point during an incident, which means you have to decide where that data can go. Open egress gives that data a path out of your network - a risk you wouldn't accept for any other production-adjacent workload. Second, it reasons over text it didn't write - logs, issue descriptions, tool output — which is how prompt injection gets in. Handling that is partly model safety, and Azure SRE Agent runs under Microsoft's Responsible AI standard with safety work from OpenAI and Anthropic. Network controls add another layer: an instruction that tries to reach a destination you haven't allowed can't run, because the network blocks it. For example, an agent investigating an outage might query Log Analytics, read deployment configuration, and call an internal runbook - all private resources. With VNet integration, those calls follow the routes, DNS, and firewall rules your workloads already use. A request to an external endpoint you haven't allowed fails at the network boundary. It doesn't depend on the model recognizing the risk and refusing; the network stops it either way. Choose an egress mode Azure SRE Agent has three egress modes, and you don't have to start at the strongest. Unrestricted - all outbound traffic allowed Limited - deny all outbound, allow an explicit list of hosts. Gives you host-level control without setting up a full VNet Azure VNet - outbound traffic goes through a delegated subnet in your network, with your NSG rules and private DNS applied. The recommended mode for production and regulated workloads. How Azure VNet mode works Outbound traffic takes one of two paths, and every call takes exactly one. Your VNet. Everything not placed on the managed path goes through a delegated subnet in your own network, where your NSG rules, private DNS, and firewall all apply. The agent is just another workload on that subnet, so it can reach what the subnet can reach: databases behind private endpoints, internal services, monitoring stores, and key vaults -the parts of production that aren't reachable from the public internet. The resources that matter most during an incident are usually the private ones. If your network connects to on-premises over ExpressRoute or VPN, the agent can reach those systems too, as long as your existing routes and rules allow it. The managed infra path. Some destinations go through Azure SRE Agent's managed infrastructure network instead - platform services the agent needs, plus optional categories you turn on: package registries, code repositories, and remote MCP servers. This path skips your VNet, so your NSG rules and Firewall Policies don't apply to it. Treat it as a deliberate exception, used only where you need it. Why public services start on the managed path Public services are hard to allow by IP address. GitHub, PyPI, npm, NuGet, apt, and the container registries run on large, changing IP ranges, and they don't map to a single Azure service tag. If your NSG filters by IP and port, keeping those lists up to date is constant work, and when a list falls behind, the agent can't pull a package or read a repository - and an investigation stalls on a networking problem that has nothing to do with the incident. Each category has a toggle: package registries (PyPI, npm, NuGet, apt), code repositories (GitHub, GitHub Enterprise, Azure DevOps), remote MCP servers, and a list of additional hostnames. Starting with these on the managed path keeps the agent working reliably without maintaining an IP allowlist. For build-time dependencies, that's usually fine. If you want this traffic inspected too, the next step is name-based (FQDN) egress filtering in your own network. Once your firewall can allow github.com and pypi.org by name, you can move these categories off the managed path and route them through your VNet instead Configure it Two decisions: the subnet, and what (if anything) uses the bypass. Navigate to Settings > Workspace Configuration > Network Choose Azure VNet as the egress mode. Select a subnet that is /28 or larger and delegated to `Microsoft.App/environments`. Decide which categories, if any, use the bypass. Restrict who can change the egress mode and bypass toggles. These settings widen or narrow the agent's reach, so govern them like any production network control. Test the outbound behavior before using the agent with production data. A reasonable setup for most enterprises during preview: use Azure VNet mode, keep package registries and code repositories on the bypass if you need reliable access to them, and route everything else through your VNet. Stricter environments can turn those categories off and rely on their own name-based firewall rules. What it doesn't cover yet VNet integration is in preview, with two limitations to know. It covers outbound traffic only - reaching the agent privately from inside your network isn't part of this preview. And connector traffic still routes over the public internet; the governance and credential isolation in Connectors V2 still apply. Use VNet integration for outbound control of the agent workspace, and combine it with identity, RBAC, tool permissions, hooks, and connector governance for a complete set of controls. Where it fits VNet integration doesn't replace identity, RBAC, tool permissions, or connector governance. It controls where traffic can go. The agent still needs the right identity and permissions to access a resource in the first place. Identity is the foundation: your RBAC assignments decide what the agent can reach. Permissions and hooks shape what it does within reach: allow/ask/deny rules control what runs, and hooks let you inspect or change a tool call before it runs. VNet integration sits underneath, controlling where traffic can go no matter what the agent tries to do. You want the agent to be capable. You also want a boundary that holds whether or not it is. Get started Create an SRE Agent - https://aka.ms/sreagent Documentation - https://aka.ms/sreagent/newdocs Recipes - https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent471Views1like0CommentsInside ACR Artifact Cache: Pull-Through Caching at Scale
By: Akash Singhal, Luis Dieguez, Kiran Challa, Nathan Anderson, Tony Vargas, Caroline Barker, Ren Shao, Mabel Egba, Toddy Mladenov, Johnson Shi Introduction For many customers, Azure Container Registry (ACR) is the only registry their workloads can trust, even when images and artifacts originate from a different registry such as Docker Hub, Microsoft Artifact Registry, GitHub Container Registry, Quay, another ACR, or a private registry. ACR Artifact Cache makes this many-to-one model practical by letting a platform team map a downstream ACR repository path to an upstream source repository. Here, upstream means the source registry and repository ACR contacts on behalf of the customer, and downstream means the ACR-facing path customers pull from. From the outside, the experience looks like a normal pull from ACR. Inside the service, that pull moves through the same multi-tenant registry platform that serves ACR traffic across regions, clouds, and data plane stamps. This series is about the gap between that simple external experience and the internal system. The goal is to show what happens inside ACR, why the system is designed this way, and how those design choices shape the behavior customers ultimately observe. Some implementation details are simplified, and the system continues to evolve. The request paths and design constraints are representative, but this article intentionally avoids service-by-service internals that are not necessary to understand the feature. For this overview, the useful mental model is: serve now, hydrate for later. Later sections will show where that model helps, and where it creates engineering pressure. Why serve upstream content from ACR? Pulling directly from an upstream is often sufficient for development, but production systems need stronger guarantees from the pull path. The failure modes are familiar to anyone who has operated containerized workloads at scale: an upstream registry is slow or temporarily unavailable an upstream applies rate limits or burst protection credentials for various upstream sources need to be handled safely ACR-to-ACR scenarios should avoid customer-managed credentials entirely by using managed identity network policy expects pulls to stay inside an approved network boundary a platform team wants one shared, sanitized catalog of public content for first-party consumption while individual teams pull only what they need Let’s take Docker Hub as a concrete example. Docker Hub pull rate limits mean that unauthenticated users and Docker Personal users can exhaust their allowed pulls in a time window, causing shared build agents or Kubernetes nodes to receive rate-limit errors instead of images. That is a useful example because it makes the upstream dependency visible, but it is not the whole story. The broader engineering problem is that upstream-sourced artifacts should behave like local registry dependencies once a customer chooses to route them through ACR. Artifact Cache addresses that problem by letting customers map a downstream ACR namespace to an upstream namespace, pull through ACR, and allow ACR to materialize content locally as it is requested. A pull-through cache inside ACR Azure Container Registry operates across 60+ Azure regions and 6 public and sovereign clouds, serves hundreds of thousands of registries, and handles billions of requests per day. Artifact Cache is only one part of that larger service, but it is large enough to be a distributed systems problem in its own right: more than 100 million image pulls per day, petabyte-scale egress, upstreams with different behavior, and customers who expect registry pulls to remain predictable. This scale matters because Artifact Cache is not deployed beside ACR as a separate service. It is part of the same registry system that serves normal pushes, pulls, tag listing, catalog operations, authentication flows, private networking scenarios, and other registry API traffic. That means Artifact Cache has to fit into ACR's existing resource model and request-serving model. Customers configure cache rules and authentication boundaries through the control plane, then their pulls are served through the data plane. The next sections follow those two parts in order: first the resources customers create, then the runtime path those resources affect. The customer workflow The setup begins in the control plane, where customers define the relationship between an ACR namespace and an upstream source. A customer starts with an ACR and chooses an upstream repository. In the examples below, myregistry.azurecr.io is the customer's ACR login server. The dockerhub/library/node path is the downstream ACR namespace the customer wants to use for cached content. The authentication model depends on the upstream: For a public upstream, the cache rule may not need credentials. For a private upstream, the customer stores upstream credential material in their Azure Key Vault, creates a credential set that references those secrets, and then associates that credential set with a cache rule. At access time, ACR uses the system-assigned managed identity associated with the cache rule to read the referenced Key Vault secrets, so the customer controls access by granting that identity the required secret permissions. ACR materializes those credentials only when it needs to contact the upstream, so the customer-owned Key Vault remains the secret store. For an ACR-to-ACR upstream, the customer can use a user-assigned managed identity. In that scenario, credential sets are not part of the flow; managed identity replaces the credential-set and Key Vault path. At a high level, the customer defines a namespace mapping: docker pull myregistry.azurecr.io/dockerhub/library/node:latest maps to: docker pull docker.io/library/node:latest In ACR, that mapping is stored as a cache rule: a control-plane resource that maps a downstream ACR path to an upstream source path. If the upstream requires authentication, the cache rule links to the appropriate credential boundary: a credential set backed by customer-owned Key Vault secrets, or a user-assigned managed identity for ACR-to-ACR. This is where the control-plane/data-plane split shows up. The control plane manages registry configuration through surfaces such as CLI, portal, Bicep, ARM templates, and other Azure Resource Manager clients. ARM sends those resource operations to the ACR control plane, which creates or updates the cache rule and, when needed, the credential set as child resources under the registry. Those resources do not own customer secrets or identities directly; they link to existing Azure resources such as the customer's Key Vault or an optional user-assigned managed identity. Later, the data plane uses that persisted configuration to decide whether a runtime registry request, such as a pull or tag listing, should be handled by Artifact Cache. After setup, the runtime path begins with the simplest possible pull: docker pull myregistry.azurecr.io/dockerhub/library/node:latest To understand what happens after that command, we need a map of the ACR components that participate in the request path. The ACR components involved The architecture needed for this overview is much smaller than ACR's full internal service graph. ACR is a regionalized service. The control plane operates at the regional level, while data plane stamps serve hot-path registry traffic for the registries assigned to them. A registry is pinned to a stamp, and high-traffic regions may have more than one stamp. Stamp architecture is an ACR concept covered in more detail in the stamp rebalancing post; this article only needs the simplified model below. For this article, ACR has three important boundaries: The regional control plane manages registry resources and provisioning operations. The data plane stamp serves hot-path registry traffic for registries pinned to that stamp. The storage layer holds downstream registry metadata, blobs, and storage-backed event queues. At this level of detail, a data plane stamp is composed of a few major runtime substrates. The registry data plane virtual machine scale set (VMSS) is the core ACR data plane. It runs containerized services including the frontend, the registry API entry point that receives and routes OCI and ACR-specific requests. The data proxy VMSS also runs containerized services and serves selected blob-content paths. It serves eligible blob-content traffic behind ACR's dedicated data endpoint; see the ACR data endpoint documentation. The stamp also includes a runtime cluster for additional data plane services, including services that are not on the hot path. This article will not explain why ACR uses both VMSS-based services and a runtime cluster inside the data plane stamp. That tradeoff is useful context, but it belongs in a separate deep dive. For Artifact Cache, the important point is narrower: the stamp contains the runtime substrates that participate in data plane serving, including runtime-cluster services that process async import and hydration work. The component list is: Component Role Region control plane Manages registry resources and provisioning operations Data plane stamp Serves pinned registries in a region Registry data plane VMSS Core ACR data plane for OCI and ACR-specific APIs Frontend Handles OCI registry API traffic inside the registry data plane Data proxy VMSS Serves selected blob-content paths, including Artifact Cache Runtime Kubernetes Cluster Hosts additional data plane services, including async import and hydration workers Cache rule Maps downstream ACR path to upstream path Credential set or managed identity Provides the upstream authentication boundary when needed Cache Backend service Handles cache-rule-backed pulls Storage queue Regional storage resource used for hydration events Metadata/blob storage Stores downstream manifests, tags, digests, and layer blobs Import workers Run in the data plane runtime cluster and hydrate downstream content asynchronously Upstream registry Public, private, or another ACR registry used as the source The diagram below is a component map rather than a step-by-step pull trace. It shows one visible data plane stamp in West US for myregistry.azurecr.io, with a muted marker to indicate that larger regions can contain multiple stamps. The stamp contains a registry data plane VMSS, a data proxy VMSS, and a runtime Kubernetes cluster. Regional metadata/blob storage and the storage queue sit outside the stamp boundary. The storage queue is also outside the regional control plane cluster; it is a storage resource consumed by data plane runtime-cluster workers. First artifact pull Now return to the pull request: docker pull myregistry.azurecr.io/dockerhub/library/node:latest The request reaches the data plane stamp where myregistry is pinned. The frontend in the registry data plane VMSS handles the registry API request and forwards it to the Cache Backend Service, which checks whether the requested repository path matches a cache rule. If there is no matching cache rule, the request follows the normal ACR path. If a cache rule matches, Artifact Cache logic applies. The next check is local state. ACR looks at downstream metadata and blob storage to determine whether the requested manifest and blobs are already available locally. If the content is present, ACR can serve it from the downstream registry path. If the content is not available locally, ACR resolves the upstream repository path from the cache rule. If the upstream requires authentication, ACR uses the configured auth boundary for that upstream: a credential set for private upstreams, or a user-assigned managed identity for ACR-to-ACR upstreams. The request can then be served through the upstream-backed data path, with the data proxy handling the blob content path. The first pull does not need to wait for durable hydration to complete before the client receives content. Serving the pull and hydrating the downstream registry are related operations, but they are deliberately separated. The trace above follows the same node:latest image used in the setup example. On a cache miss, the data plane queues an async import event for the requested image while still serving the client request. Manifest content returns through the frontend path. For layer blobs, the frontend returns a redirect to the data proxy, and the client follows that redirect while the data proxy streams blob content from the upstream CDN. The data plane serves the customer request, but it also detects that durable downstream state needs to be populated. That durable work is where hydration comes in. Hydration Hydration is the process that materializes upstream content into the downstream ACR registry. ACR performs hydration asynchronously because the data plane workload can be bursty and variable. A deployment or scale-out event can cause many clients to request the same not-yet-hydrated image at nearly the same time. Image size, layer count, multi-platform manifest trees, upstream behavior, queue depth, and retry behavior all matter in a multi-tenant service. The north star is to coordinate those requests: collapse duplicate work, hydrate the content from upstream, and serve all waiting clients without turning one customer action into unnecessary upstream load. That coordination problem is challenging at ACR scale, and we are continuing to improve it. The existing async import path gives Artifact Cache a durable and scalable foundation while that serving path continues to evolve. At a high level, the data plane queues an import event. A notification service consumes the event and dispatches work to import workers in the data plane runtime cluster. Those workers fetch the required content from the upstream registry and write manifests, tags, digests, and layer blobs into ACR metadata and blob storage. When import workers complete, they notify the notification service, which can publish completion signals through ACR eventing surfaces such as Event Grid and webhooks. This allows customers to use webhooks to detect when cached content is fully available locally. You can read more about how it works here. The mental model is that the first pull can serve immediately, while hydration makes future local serving durable. A follow-up post will go deeper on the work ACR does to reduce upstream load during this hydration window. Later pulls After hydration completes, later pulls for the same content can be served from ACR. For digest references, the model is relatively direct because a digest is content-addressed. If ACR has the requested digest and its blobs downstream, the data plane can serve that content locally. Tags are more subtle because tags can change. A tag such as latest is a name that can point to different content over time. Artifact Cache therefore must care about freshness semantics for tag-based pulls. This is one of the reasons a pull-through cache becomes more complex than "fetch once and forget." The benefit is not only lower latency. ACR also reduces repeated dependency on the upstream for content that has already been materialized downstream. Guarding the pull path Once content is hydrated, ACR must serve that content from the customer's registry boundary even when the upstream is slow, unavailable, or returning errors. That distinction matters for tag-based pulls: ACR may need upstream checks to reason about freshness, but an upstream failure should not automatically prevent ACR from serving content that is already available downstream. Artifact Cache also must be careful about how it behaves when upstreams are unhealthy. If an upstream starts returning 5xx errors or throttling requests, ACR should avoid amplifying the problem by repeatedly sending customer-triggered requests upstream. Circuit breaking and upstream work minimization are part of being a good steward of both customer traffic and upstream registry limits. More details to follow in subsequent posts. There is a separate availability question inside ACR: what happens if Artifact Cache-specific components, such as the cache backend path, are operationally unavailable? ACR handles that case gracefully by falling back to normal registry pull behavior: it checks the customer's registry state and serves the image if the requested content already exists in ACR. In other words, cache-backend unavailability should not block pulls for content that is already present in the registry. What we will explore next This overview is the map for the rest of the series. The following posts will go deeper into the parts of the system where the design pressure is highest. Minimizing upstream work We will start with how Artifact Cache avoids making more upstream requests than necessary. This becomes difficult when many clients request the same not-yet-hydrated image at the same time. A Kubernetes scale-out event is the classic example: many nodes may ask for the same image concurrently, and the system must avoid turning one customer's action into unnecessary duplicate upstream work. Making Artifact Cache observable to customers We will also look at how customers understand whether their cache rule is healthy, whether credentials are usable, and why a pull failed. This is hard because a failed pull can involve customer configuration, Key Vault access, managed identity configuration, upstream credentials, upstream availability, data plane request handling, or asynchronous hydration. The engineering challenge is to expose the right customer-facing health and debug signals without turning internal topology into the user interface. Repository semantics in Artifact Cache Finally, we will look at repository semantics. Once upstream content becomes local, the repository is no longer just a mirror. Tags can move upstream, digest references are content-addressed, and customers may push their own content into downstream repositories. The visible repository state can involve both upstream-derived content and customer-owned downstream writes. Closing Artifact Cache is designed to make upstream-sourced artifacts behave like ACR-served content once customers choose to route those artifacts through their registry. The design goal is that customers can pull from ACR and reason about the result using ACR boundaries: registry configuration, local serving, customer-visible health, and predictable repository semantics.188Views2likes0CommentsIs Your Monitoring Actually Working? What's New in Monitoring Coverage
Monitoring is only useful when the right signals are collected, the right alerts are in place, and the data is actually flowing when teams need it. In large Azure environments, confirming all three across every VM and AKS cluster can still take too much manual work. At Microsoft Ignite, we introduced Monitoring Coverage in Azure Monitor, a centralized preview experience for finding coverage gaps and enabling recommended VM and container monitoring at scale. At Microsoft Build, we are expanding that experience with two new capabilities that make monitoring easier to operationalize: data flow status and at-scale recommended alert enablement for virtual machines and Azure Kubernetes Service (AKS). With these updates, teams can move beyond asking whether monitoring was configured. They can see whether recommended monitoring is enabled, whether important alert coverage is missing, and whether configuration issues may prevent monitoring data from reaching its destination. Monitoring Coverage overview with recommendations and data flow status. What is Monitoring Coverage? Monitoring Coverage in Azure Monitor gives you a single place to review recommended monitoring across supported Azure resources. The Overview page summarizes coverage across your selected scope, shows Azure Advisor observability recommendations, and provides quick actions to enable recommended monitoring settings. Coverage is grouped into basic, partial, and enhanced monitoring so you can quickly understand whether a resource is using only default monitoring or has the Microsoft-recommended configuration enabled. From there, you can drill into the Monitoring Details tab to review individual resources and take action. New: data flow status The most important question after enabling monitoring is simple: is the data flowing? Data flow status helps answer that question directly from Monitoring Coverage. The new data flow status summary shows how many resources need attention, passed initial checks, or are not configured for validation. It also highlights top resources that need attention so operators can start with the most important issues first. When you open data flow status for a resource, Azure Monitor shows validation checks across areas such as: Resource configuration Data collection rule associations Network connectivity Data flows to the configured destination Detected issues are prioritized at the top of the details pane, and each validation check includes a recommended action. After making a fix, you can run validation again to confirm that data flow issues are resolved. Data flow status details with validation checks and recommended actions. Alternatively, you can visualize your data flows and identify problems from there. New: enable recommended alerts at scale Monitoring Coverage now also helps close alerting gaps. From the Overview page, you can see recommendations such as Enable VM Recommended Alerts and Enable AKS Recommended Alerts, then select Apply to configure recommended alert rules from a centralized flow. For virtual machines, you can enable alerts across an entire subscription or choose selected resources. Subscription scope is useful when you want recommended alerts to apply broadly, including to future VMs in the selected subscription. Selected resource scope gives you more granular control when you want to enable alert rules for a specific set of VMs. The enablement flow lets you review recommended alert rules, adjust thresholds, and configure notification options such as email, Azure Resource Manager role notifications, Azure mobile app notifications, or an existing action group. Some VMs may already have alerts configured, and new rules are designed not to duplicate existing alerts. For AKS, Monitoring Coverage can surface recommended alert gaps and start the same guided pattern: review impacted resources, configure recommended alert settings, and use Review + Enable to create the alert rules. A resource-centric view for follow-up The Monitoring Details tab brings coverage and data flow into the same resource list. Two columns are especially useful for triage: Monitoring coverage and Data flow status. Select either value to open resource-level details. Monitoring coverage details show what is configured for the resource, including VM Insights, recommended alerts, data collection rules, data sources, destinations, and agent version when available. Data flow details show validation results and recommended remediation steps. This makes it easier to move from a high-level gap to the specific resource and configuration that needs attention. Getting started Monitoring Coverage is available in preview from the Azure portal. Open Monitor, select Monitoring Coverage (preview), and choose the subscriptions and resources you want to review. From the Overview page, you can: Review coverage across VMs and AKS resources. Apply recommendations to enable VM Insights, container monitoring, and recommended alerts. Use data flow status to find resources whose monitoring data needs attention. Open Monitoring Details for resource-level coverage and validation results. A few preview notes: enablement operations include up to 100 resources at a time, and enabling monitoring or alert rules may create data collection rules, deploy Azure Monitor Agent, configure destinations, or create alert rules. Data collection, workspace ingestion, and alert rules may incur costs based on the settings you enable. To learn more, see Monitoring coverage in Azure Monitor (preview). Looking ahead Monitoring Coverage is part of our continued work to make Azure Monitor easier to operationalize at scale. We want teams to spend less time hunting for monitoring gaps and more time acting on reliable, validated signals. We would love your feedback as you try these new Build updates and we look to expand support beyond this set of resource types. Use the Azure portal feedback options or share feedback through your Microsoft account team.171Views1like0CommentsManaged Connectors for SRE Agent (preview)- Govern what your agent can do
Giving an agent access to a tool is the easy part. The harder question is what it's allowed to do with that access. "Can the agent copy a file in OneDrive?" mostly answers itself. "Can it copy any file, to any destination, over one that's already there?" is the one that decides whether the integration has a governance layer. Managed Connectors is built around that second question. It expands the catalog of tools the agent can reach - OneDrive, SharePoint, Google Drive, GitLab, Power BI, Microsoft Security Copilot, with more being added regularly - and pairs it with a governance model that keeps the policy for those tools outside the agent's control. This is part of the Azure SRE Agent announcements at Build 2026 What's new Managed Connectors is the next generation of our connector experience. It significantly expands the catalog of third-party and first-party SaaS integrations available to SRE Agent and surfaces each one to the agent as a curated set of operations through the Model Context Protocol (MCP) - the same standard the agent already uses for every other tool source. Governance: the agent gets capability, you keep control The governance model is the headline of this release, so it's worth being concrete about it. When you add a connector, you walk through a short wizard - Set up connector, Configure tools, Review & Save - and the "Configure tools" step is where the policy is set. Three things make it different from "just wire the API up to the LLM": You choose what's exposed - it isn't automatic. A connector might offer 40+ operations; in the wizard you pick the ones the agent can use. The rest aren't shown to the model, so it can't call them. Parameter policy lives outside the agent. For each selected operation you can mark parameters as user-defined (pinned to a value you specify) or agent-defined (the agent fills it in). On the Microsoft Planner “Create a task” tool, for example, you can choose the group ID from a list of your joined groups – this means that the agent provides the task details but can’t assign it to any arbitrary group, because that isn’t a parameter it sees when invoking the tool. Per-tool approval is built in. Each operation has an Allow/Ask toggle integrated directly into the creation and edit wizards. "Ask" routes the call through the agent runtime human-in-the-loop approval flow before it executes. On that same Microsoft Planner connector, you might leave read-only tools like “List tasks” or "Get plan details” on Allow, but flip “Delete a task” to Ask so a human must confirm before anything is removed. This is enforced on the agent's runtime; it is not a prompt instruction the model can be talked out of following. Credential Isolation No long-lived secrets in the agent. No API keys, no client secrets, no certificates, no OAuth tokens. All service credentials are encrypted at rest and stored outside of the agent’s trust boundary Automatic token refreshed. Once you consent, the internal connector resource keeps your tokens valid. You won't be asked to re-authenticate unless your service itself requires it. You consent once, in your own browser, with your own service. SRE Agent never proxies your password or the sign-in flow. Per-connection authorization. Each connection is bound to the specific SRE Agent instance you set up on and cannot be used by external threat actors. How it fits together All of this is stored and evaluated outside the agent loop. Each configured connector becomes an MCP server that the SRE Agent runtime registers as a tool source, the same standard wire format the agent uses for everything else, so adoption on the model side is trivial. Each layer does one job, and the trust boundary between "what the model decided" and "what was actually sent" is explicit and inspectable: the agent never sees the operations you didn't select, never sees the parameter slots you pinned, and cannot bypass approval on operations you marked Ask. How to try it Open the SRE Agent portal and go to Builder > Connectors. Pick a connector from the catalog with the “Preview” label and go through the creation wizard steps. At the “Set up connector” step, choose how the connector authenticates. Start with “OAuth” if you just want to sign-in and see it working against your own account. At “Configure tools”, select the operations you want to expose, pin any parameters that shouldn't be agent-controlled, and mark sensitive operations as “Ask.” Review & Save. The connector is registered with the runtime and immediately available to your agent. You can enable/disable specific tools or connectors in the “Capabilities” section. Edit connector – after creating the new connector, at any point you can go back and authenticate it with a different account, add or remove operations, update tool parameters and configure approval policies Resources Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent200Views1like0CommentsShaping what Azure SRE Agent does: Tool Permissions and Hooks
When an AI agent runs against production, the first question every security team asks is "What can it do, who decided it could, and what stops it from doing something it should not." Azure SRE Agent reached general availability in March. Since then, teams inside Microsoft and customers running it against real production workloads have asked for the same thing: finer-grained controls over what the agent can do on its own and a clear answer to who governs each call that reaches a tool. Today at Build 2026, we are releasing global tool access policies as one of a set of new governance controls. This post covers how they work. Tool access policies give security and platform teams a single place to define which tools the agent can invoke, under what conditions, and what requires human approval before it runs. Underneath those policies sits the identity the agent runs as the bedrock that every other control layer depends on. It is defense in depth applied to agent behavior: layers of control, each one holding on its own, so that governing the agent is something you can read, audit, and reason about as you scale it across production. Identity is the bedrock: managed identity today, agent identity next Start here, because nothing else matters if you skip it. The identity the SRE Agent runs as, and the Azure RBAC role assignments on that identity, are the most powerful boundary the agent works inside of. If your role assignments do not grant the agent access to a resource, none of the controls below come into play, because the agent cannot reach the resource to begin with. Network rules, tool permissions, hooks, and connector contracts all sit on top of an RBAC story that you write. The features in this post add layers above that floor. They do not replace it. Today the SRE Agent operates as a managed identity, and your RBAC role assignments on that identity govern what it can do. This is the bedrock, and it is the same model your other Azure workloads already use. You assign roles, you scope them, and the agent inherits exactly what you granted and nothing more. Everything that follows assumes the bedrock is in place. With identity settled, the next question is the obvious one: where is the agent allowed to send its traffic? Permissions: govern what the agent does with a tool Identity decides what the agent can reach. Permissions decide what the agent does with the access it has, down to the individual tool. Two levels cover the range: a point-and-click grid for the common cases, and hooks when a decision needs your own code. The grid is the easy mode. Every tool the agent can use, built-in tools along with MCP servers, services, and custom tools, shows up in one searchable list with two switches. On/Off sets whether the tool is available at all; turn it off and the agent cannot use it. Allow/Ask sets what happens when it is on: Allow lets the agent run the tool automatically, Ask requires a human to approve every time, except in Autonomous mode. Select tools in bulk to flip a whole category at once, filter by category or permission, and use the Advanced permissions tab when you want rules that apply at global, per-agent, or per-thread scope instead of tool by tool. Defaults stay put until you touch them, and the engine is fail-closed: if a rule cannot be evaluated, the call is blocked rather than allowed. That covers most of what teams need. Underneath those switches are three rules, allow, ask, and deny, and the Advanced tab is where you set them by scope. Global rules apply to every agent and thread, Agent rules to one custom agent, Thread rules to a single conversation. Deny is the hard one: it blocks the tool outright no matter the run mode, and a deny at a higher scope always wins, so an Allow at thread scope cannot reopen something denied globally. That split is deliberate. A platform team sets the Global guardrails that should never be crossed and the Asks that always need a human, and service teams add their own Allow rules at Agent scope for routine work, without being able to override the guardrails above them. Platform team, Global scope: deny: bash(az * delete *) - never delete, on any agent or thread deny: bash(kubectl delete *) ask: bash(az webapp restart *) - always confirm, even in Autonomous allow: bash(az monitor *) - auto-approve monitoring queries Service team, Agent scope: allow: bash(kubectl get *) - routine read-only work allow: bash(kubectl describe *) Two details make this safe to lean on. Rules match the canonicalized tool invocation rather than the raw text, so enforcement holds no matter how the command was assembled. And fail-closed has a softer edge than a hard stop: a cached last-known-good policy covers transient failures, so a blip in the policy store blocks the call rather than silently widening access. You can find these under Capabilities > Tools The layer worth spending time on is hooks. Allow and Ask answer "should this tool run." Hooks answer "should this specific call run, given exactly what it is about to do." A hook fires before the agent runs a tool and receives the actual call, parameters and all. Your code then decides the outcome and can reshape it: rewrite parameters before they are sent, inject extra context into the pipeline as a user message so the agent reconsiders before its next step, block the call outright, or redirect the agent toward a safer path. Because your code sees the real parameters, the decision can depend on anything you can express in code: which resource the call targets, whether a value falls outside an allowed range, the time of day, the result of an external policy lookup. This is where you write the rule the grid cannot. Two kinds of hook, mixable on the same agent. Command hooks are a script you write; reach for these when code is enough. Prompt hooks put a separate LLM in the loop as a judge that evaluates the call in context; reach for these when the decision needs reasoning rather than a fixed rule. A real example from our own internal test agent: when the agent tries to list files through the shell with ls or dir, a hook blocks the call. The agent absorbs the signal, reconsiders, and reaches for the ListDir tool instead. The hook did not argue with a human. It shaped what happened next. As with the grid, configure nothing and the agent behaves exactly as it does today. Both are additive. Authoring one is a short form. You name the hook, pick the event (Pre Tool Use, so it runs before the call), and set a tool matcher, either picked from the tool menu or written as a regex like (FetchWebpage|SearchMemory) with anchors and lookaheads when you need them, so the hook fires only on the calls you care about. You set a timeout and a fail mode (Block, so a hook that errors or hangs stops the call rather than waving it through), and you write the body in Bash or Python. A command hook reads the call as JSON on stdin, the event name, the tool name, its parameters, and the call id, and answers on stdout. Print nothing and exit zero to allow. Return a block decision with a reason to stop the call, and that reason is what the agent reads back. You can also substitute: run a cheaper or safer version yourself, block the real call, and hand your own output back as the result, so the agent never runs the expensive or risky original. #!/bin/bash input=$(cat) tool=$(echo "$input" | jq -r '.tool_name') # Block one tool, with a reason the agent will read if [ "$tool" = "ExampleToolName" ]; then echo '{"decision":"block","reason":"Blocked ExampleToolName by hook policy."}' exit 0 fi # Otherwise allow: print nothing and exit 0 exit 0 You can find these under Builder > Hooks Each layer holds on its own The layers stack. Identity is the floor: your RBAC assignments decide what the agent can reach at all. Permissions, the grid and hooks together, decide what it does with a tool. You author each layer, each one holds whether or not the layer above it behaves as expected, and all of it configures through the same ARM and Bicep surface your platform team already uses, reproducible the way the rest of your Azure estate is. The upgrade path is additive and non-breaking. Existing agents keep working. Turn on each control when you are ready, in the order your governance requires. There is more coming. We run Azure SRE Agent inside Microsoft on our own production workloads, so we feel the same gaps you do, and the next round is shaped by what we hear from teams running it in production today. Which control is doing the most for you, and which one are you still waiting on? Let us know and thank you! Getting started Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent181Views0likes0CommentsPrivate Plugins with Azure SRE Agent
SRE's and platform teams are building operational skills specific to their infrastructure: investigation runbooks, compliance checks, cost analysis playbooks, deployment verification procedures. The next step is making that work reusable across every agent in the organization without exposing it publicly. Today, SRE Agent supports plugin marketplaces hosted in private GitHub repositories, including GitHub Enterprise. This is part of the Azure SRE Agent announcements at Build 2026. You can now point SRE Agent at a private repo when adding a marketplace or installing a plugin. Authentication is handled per-marketplace, and supports OAuth, GitHub PATs, and GitHub Apps for GHE tenants. From one agent to an organization’s plugin catalog Most teams start with a single SRE Agent connected to their services. The agent learns their infrastructure, runs their runbooks, and handles their incidents. It works well. Then adoption grows. A second team stands up their own agent. Then a third. Platform engineering wants every agent to run the same compliance checks. Security needs approval hooks enforced consistently. FinOps has cost governance skills that should be standard across the organization. Suddenly the question isn’t “how do I set up my agent,” it’s “how do we share operational knowledge across all of them.” Without a distribution model, teams end up copying skill files between agents manually. A platform team writes a runbook, shares it over email or a wiki link, and each service team pastes it into their agent individually. When the runbook improves, some agents get updated, some don’t. There’s no version tracking, no central catalog, and no way to know which agent is running which version of which skill. Private marketplace support solves this. How Private Plugin marketplace meet enterprise needs A platform team publishes once, every agent installs. Codify best practices as plugins in a private GitHub repo. Service teams add that repo as a marketplace in their agents and install what they need. Compliance checks, cost governance thresholds, incident playbooks, deployment verification procedures all distributed through versioned plugins. Each team retains ownership. Security controls which plugins enforce approval hooks. FinOps locks cost thresholds into parameter values. Platform engineering governs infrastructure investigation patterns. The marketplace is the distribution layer for organizational standards. Versions are pinned, updates are explicit. Each installation locks to the commit at install time. A merged PR upstream does not change any agent’s behavior. Teams promote new versions on their own schedule: validate in dev, promote to staging, then production. Different agents can run different versions simultaneously. Reuse across environments and tools. The same plugin works across dev, staging, and production agents, and can be reused by local coding agents and other services that support plugins. One source of truth, not separate copies per environment. Accessing Private Plugin marketplaces Private repo support adds authentication to the SRE Agent's plugin workflow so your agent can clone and install from repos that require credentials. Authentication is configured once per marketplace. Every plugin within it inherits the credentials. Auth method When to use Setup OAuth github.com repos your agent can already access Uses your existing GitHub connection. One click. Personal access token Private repos in other orgs on github.com Per-marketplace PAT. Scoped to just that marketplace. GitHub App GitHub Enterprise (*.ghe.com) BYO App with private key in Azure Key Vault. Short-lived tokens minted at runtime. Getting started In SRE Agent, navigate to Builder > Plugins, then click Add Marketplace and enter the URL of the private marketplace you want to connect to. Then click Connect to GitHub to complete the OAuth sign-in. Click Add and you will see the plugins available from your connected marketplace. Click on the plugin to install and in the detail view you can browse the skills packaged with the plugin. click Install to install this plugin. You can now see the skills imported from plugins from Capabilities > Skills > Custom Skills The bottom line Private repo support turns the Plugin Marketplace from a public skill catalog into your organization’s internal distribution platform for operational automation. Your team writes the plugins. Your agents install them. Your GitHub permissions control who has access. Try it yourself: create a private repo with a marketplace.json and a few skills, add it as a marketplace in your agent, and install a plugin. Resources SRE Agent documentation — https://aka.ms/sreagent/newdocs SRE Agent overview — https://aka.ms/sreagent/newdocsoverview Plugin Marketplace capability page — https://aka.ms/sreagent/newdocs/capabilities/plugin-marketplace Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent163Views0likes0CommentsAnyscale on Azure: Powering Enterprise AI at Massive Scale on Azure Kubernetes Service
Somewhere on your AI platform team, an engineer is on call this weekend — not for the model, not for the training run, but for the integration code holding five separate AI processing systems together. Data preparation on one. Training on a second. Evaluation on a third. Serving on a fourth. Observability bolted on across all of it. The glue between them has quietly evolved into a production system of its own, complete with its own failure modes and its own pager. This is what running AI at scale looks like for most enterprises in 2026. To process the full breadth of AI workloads, teams don’t have one platform, but a stack of multiple compute engines — stitched together and monitored around the clock. Training failures become increasingly costly as multi-node GPU clusters remain underutilized and difficult to operate. Inference costs climb in a straight line when they should be bending the other way. And the accelerators underneath, at six figures a year per node, sit at 30–40% utilization. None of this is a model problem. It is a systems problem, and it exposes a divide that is widening across the industry. The AI shift: Moving from API inference calls only to end-to-end AI Most enterprises start an AI journey by calling hosted model APIs. It’s the fastest way to experiment and ship. But as adoption grows, inference costs scale in a straight line while differentiation remains limited. The organizations pulling ahead are doing more than consuming models. They are customizing them with proprietary data, operating them at scale, and owning the infrastructure between their data and their models. Their unit economics improve as they scale. The dividing line isn’t budget. It isn’t ambition. It is a single architectural decision: whether the layer between your data and your models is something you rent in pieces or run as a single system. That unified system for end-to-end AI, almost without exception, is built on one runtime: Ray, an open-source framework widely adopted by AI-natives such as Cursor, Mistral and xAI to act as the engine that powers many of their workloads from multimodal data processing to reinforcement learning. Anyscale on Azure: Build and run end-to-end AI on your Azure subscription Anyscale on Azure brings the distributed compute runtime the AI industry has converged on — Ray— into your Azure tenant as an Azure Native service, that includes purpose-built developer tooling and unified pane for cluster management, built through deep engineering collaboration between Anyscale and Microsoft. Unlike other processing engines which either only support one hardware type (e.g. CPUs) or focus on a single workload (e.g. inference), Ray turns a heterogeneous cluster of CPUs and GPUs into a single Python runtime composing data preparation, distributed training, fine-tuning, reinforcement learning, high-throughput inference, and agentic execution as one program, not five interlocking systems. Anyscale created Ray and stewards the open-source Ray project, now governed by the PyTorch Foundation; the Anyscale Runtime is the production-grade layer that enterprises can utilize on critical paths from day one, bringing managed cluster operations, enterprise-grade support, and the operational reliability needed to run AI and data workloads at scale. On Azure, that runtime executes on your Azure Kubernetes Service (AKS) clusters, inside your subscription, and under Microsoft Entra ID workload identity. Your data, models, and weights never leave your cloud, and consumption is billed through Azure with drawdown against your existing Azure commitment (MACC). Sovereignty isn't a label bolted on after the fact. It is the architectural starting point: customer-owned data and models in the customer-owned tenant and governance boundary. The variable per-token economics of hosted APIs are replaced with compute you govern directly. Your proprietary data becomes a compounding advantage rather than a payload shipped to a third-party endpoint. A single runtime for the full AI lifecycle The cost profile of enterprise AI is largely architectural. Fragmented stacks — separate systems for prep, training, evaluation, and serving — produce a predictable set of failure modes such as Idle GPU time, Integration code and cross-system data movement. The result: production GPU utilization only in the 30–40% range, against accelerators that cost six figures per node per year. On the same fleet, Anyscale customers run those accelerators at 80%+ sustained utilization and report 40–60% lower GPU spend versus static, single-tenant clusters — driven by fractional GPU allocation (down to 0.2 of a device), bin-packing across complementary memory and compute profiles, gang scheduling for distributed training, priority-aware preemption that lets production inference take precedence over ad-hoc training, and spot integration with checkpoint-aware preemption so long-running jobs survive reclamation without lost work. Anyscale on Azure replaces this with a single Ray-powered runtime that spans the lifecycle as one distributed computation graph: Ray Data (distributed preparation) → Ray Train (fault-tolerant training) → Ray Tune (hyperparameter search) → Ray Serve (inference) — under one managed control plane. On top of open-source Ray, the Anyscale Runtime adds fault-tolerant training with checkpoint/restart, optimized scheduling, faster cluster bring-up, inference-aware autoscaling, and per-stage observability. Ray is the unifying layer that, rather than replacing, streamlines distributed processing of the framework stack the AI industry already uses: PyTorch, Hugging Face Transformers, FSDP, DeepSpeed, and Megatron for training, vLLM and SGLang for high-throughput inference with continuous batching, paged attention, and speculative decoding. Ray Train orchestrates the three parallelism patterns modern training requires — data parallel, model parallel, and hybrid 3D parallel (data + tensor + pipeline) — for trillion-parameter models, without requiring teams to write custom distributed code. The architectural payoff is direct: a single Python program defines a graph spanning CPU-heavy preparation and GPU-heavy training. The model produced by Ray Train is served by Ray Serve in the same cluster, against the same storage. The operational, identity, and observability surface is unified instead of fragmented. What enterprises deploy with Anyscale on Azure There are five workloads that power the development of modern AI systems, spanning data processing, training, inference, and simulation. But in most environments, each depends on separate engines, frameworks, and orchestration layers. The resulting fragmentation drives up infrastructure spend, latency, and engineering complexity. This makes a single Ray-based runtime under Anyscale’s managed control plane the operationally rational choice. Anyscale on Azure provides a complete platform to build and deploy AI applications using the same APIs as open-source Ray. While the data plane runs inside the customer’s AKS cluster, the managed control plane provides a unified interface for development, debugging, and cluster operations. AI in your trust boundary by design: the architecture Anyscale on Azure is an Azure Native product — discoverable via the Azure portal and provisioned through Azure Resource Manager with every resource tagged, scoped, and policy‑bound like any other in your subscription. Anyscale on Azure is a split-plane deployment: Control plane (managed by Anyscale) — scheduling, jobs, services, workspaces, and observability. Data plane (your Azure subscription) — Ray clusters run on your AKS, in your VNet, on your storage (Azure Blob / ADLS Gen2 via BlobFuse2). The trust boundary is what matters — more than any individual data plane feature — for regulated workloads (financial services, healthcare, public sector) and any enterprise where proprietary data is the differentiation. The execution model: Workloads run inside your AKS cluster — your subscription, your VNet. Model weights, training data, KV caches, checkpoints, and inference traffic never leave the boundary. Provisioning is ARM-native — resources tag, scope, and inherit Azure Policy like anything else in the subscription. Identity is Microsoft Entra ID end to end — workload identity issues pod credentials; RBAC governs access. No long-lived keys, no parallel secret store. Network controls are yours — Private Link, NSGs, Cilium-based Azure CNI policies, and customer-managed encryption keys via Key Vault. Audit is the Azure Activity Log — the same surface your compliance team already monitors. The Anyscale Operator is the only Anyscale-controlled component in your environment — it runs inside your AKS, communicates with the control plane via egress only, and accepts no inbound access from Anyscale. The result: code and data stay in your Azure subscription. Your existing compliance posture, audit surface, and data residency certifications carry forward — nothing new to attest. Billing rolls through the same Azure invoice with MACC drawdown — no second invoice, no parallel procurement. Production evidence Xoople planetary‑scale satellite imagery on Anyscale on Azure; multimodal AI turns spectral data into operational intelligence. "Anyscale lets our teams focus on models and outcomes rather than infrastructure, dramatically accelerating the path from experimentation to deployment," — Milos Colic, VP of Engineering, Xoople. Wayve trains the next generation of autonomous‑driving foundation models on Anyscale on Azure, running distributed ML and data pipelines across large CPU and GPU fleets. The operational driver is GPU‑capacity aggregation at a scale that no single region or cluster can deliver. Beyond Anyscale on Azure, the same Ray runtime is used in production at Cursor, Physical Intelligence, xAI, Coinbase, Bedrock Robotics, and Runway. Bedrock Robotics scaled compute 85x on Anyscale without linearly increasing costs. Currently with 12M+ weekly downloads (+400% YoY) and 42K+ GitHub stars and now openly governed under the PyTorch Foundation (Linux Foundation), Ray is becoming the de-factor open-source standard and is not a single-vendor runtime. Pricing Pricing is usage‑based and consolidates onto the same Azure invoice as the rest of the customer's subscription, including drawdown against existing Azure commitment (MACC): Azure infrastructure — standard Azure compute and GPU charges for the AKS substrate the workload runs on, scaling directly with actual usage. Anyscale service layer — pay‑as‑you‑go through Azure service meters with no upfront commitment, priced by CPU, memory, and GPU type. Where Anyscale on Azure fits Base-model intelligence is converging. Enterprises can buy access to the same frontier models, so the model itself is no longer the moat. What separates the enterprises pulling ahead is the layer underneath: how efficiently they run the full AI lifecycle at scale, how much compounding leverage they extract from their proprietary data, and whether they own the runtime that ties it all together. Anyscale on Azure is the Azure Native runtime layer for that posture — bringing the open-source distributed compute standard the AI industry has converged on into the same Azure governance, identity, and procurement model as the rest of the tenant. The shape of enterprise AI is settling. The teams pulling ahead are not the ones renting the most intelligence through APIs — they are the ones building and operating AI systems inside their own cloud, on their own data, under their own governance, and scaling those systems on the open distributed runtime the industry has already converged on. Anyscale on Azure is that runtime, delivered as an Azure Native product: Ray, productionized — the open‑source distributed compute standard for AI, hardened with the Anyscale Runtime, a managed control plane, and observability designed for foundation‑model‑scale workloads. One runtime, the full AI lifecycle — data preparation, training, fine‑tuning, reinforcement learning, inference, and agentic workloads in a single Python program, on a single substrate, with no cross‑system glue. Inside your Azure tenant, on the AKS you already run — customer‑owned data, customer‑owned models, customer‑owned governance. Entra identity, Azure RBAC, Private Link, Activity Log audit, and customer‑managed keys end to end. One Azure invoice — usage‑based pricing through the Marketplace with MACC drawdown; no parallel procurement, no second vendor contract. If your team is wrestling with GPU utilization, fragmented data‑to‑serving stacks, training jobs that exceed any single region's capacity, or hosted‑API costs that scale faster than your usage — this is the runtime built for that problem. Try it now Provision your first Anyscale Cloud by navigating to the Azure portal. Click on "Create" to begin creating the Anyscale cloud resource and link the necessary Azure resources. your Anyscale Cloud directly from Azure Portal. e. Explore the quickstart guides and documentation on Microsoft Learn to get started. For architectural deep‑dives, capacity planning, or a hands‑on workshop with the Anyscale on Azure solution architects, reach out through your Microsoft account team. Deepen your expertise and deep dive on best practices in the upcoming virtual webinar. Register here. The infrastructure for the next decade of enterprise AI is here. Build on it. Links and Resources Press Release: Anyscale Launches on Microsoft Azure as a Native Integration for Enterprises Announcing Anyscale on Azure public preview: Powered by Ray on AKS Youtube Video: Anyscale on Azure: Scale Python AI workloads with managed Ray on AKS Azure on Anyscale overview Architecture Create an Anyscale Cloud in Azure Portal Pricing Support model Terms and Conditions Frequently asked questions160Views0likes0CommentsIntroducing On-demand Sandboxes for Azure Durable Task Scheduler (Private Preview)
Maybe it needs a native toolchain. Maybe it runs untrusted customer or LLM-generated code. Maybe it needs Python from a .NET orchestrator, or bursty compute that should scale to zero when the work is done. Today, we're thrilled to announce On-demand Sandboxes for Azure Durable Task Scheduler, now available in private preview. On-demand Sandboxes lets you move those individual workflow steps to managed, isolated compute while your orchestrator stays exactly where it is. Tell DTS which steps should run in isolation, provide a container image with the step code, and DTS handles provisioning, scaling, and teardown. No infrastructure to manage, no idle costs, no orchestrator changes. Sign up for On-demand Sandboxes Private Preview Today → Availability: On-demand Sandboxes targets the standalone Durable Task SDKs used outside the Azure Functions host — for apps running on Azure Container Apps, Azure Kubernetes Service, App Service, or anywhere else you self-host. The private preview supports the .NET and Python Durable Task SDKs, with additional language SDKs and Azure Functions support coming soon. What is Azure Durable Task Scheduler? The Durable Task Scheduler is a fully managed backend for durable execution on Azure. It can serve as the backend for a Durable Function App using the Durable Functions extension, or as the backend for an app leveraging the Durable Task SDKs in other compute environments, such as Azure Container Apps, Azure Kubernetes Service, or Azure App Service. For a deeper introduction, see the Durable Task Scheduler overview or the full Durable Task documentation. Why On-demand Sandboxes? Most activities belong in-process. They're fast, simple, and co-located with your orchestrator. But sometimes you hit a step that doesn't fit: it needs a native binary, a different language runtime, per-invocation isolation, or bursty compute you don't want to keep warm. On-demand Sandboxes gives you a way to handle those exceptions without spinning up dedicated infrastructure or managing scaling policies in Azure Kubernetes Service or Azure Container Apps. Activity-level granularity. Move individual steps to managed compute, not your whole app. Per-activity or per-invocation isolation. Each execution runs in a clean, microVM-backed sandbox. Ideal for untrusted code, customer plugins, or LLM-generated logic. Cross-runtime flexibility. Run a Python inference step from a .NET orchestrator. No compromise on either side. Scale-to-zero. Pay for CPU and memory per second of execution, not infrastructure that waits. No orchestrator changes. Your orchestration code and hosting model don't change at all. Here are a few scenarios where On-demand Sandboxes shines: Native toolchains. Package ffmpeg, LibreOffice, or Pandoc in a container without dragging them into your main app. CPU-heavy preprocessing. OCR, layout extraction, or image processing can scale independently of the rest of your workflow. Cross-runtime workflows. A .NET orchestrator dispatches a Python inference step. No compromises. Sandboxed code execution. Run customer plugins or LLM-generated code with a clean boundary on every invocation. Multi-tenant isolation. Tenant-specific steps get dedicated boundaries while everything else stays in-process. Bursty event-driven workloads. Steps that spike hard but rarely may not justify always-on infrastructure. Sub-second cold starts mean you get capacity when you need it without paying to keep it warm. How it works On-demand Sandboxes uses a two-part model: a worker profile in your orchestrator app that tells DTS which activities to offload, and a worker image that contains those activity implementations. Your orchestrator still calls activities the same way it always has; the decision to run one activity in a sandbox lives in the profile configuration. 1. Declare a sandbox worker profile In the app that hosts your orchestrator, define a sandbox worker profile. The profile gives DTS the container image, resource shape, concurrency setting, and activity names that should run in a sandbox: using Microsoft.DurableTask.Worker.AzureManaged.Sandbox; [SandboxWorkerProfile("code-executor")] internal sealed class CodeSandboxWorkerProfile : ISandboxWorkerProfile { public void Configure(SandboxOptions options) { options.ContainerImage = Environment.GetEnvironmentVariable("DTS_SANDBOX_IMAGE") ?? throw new InvalidOperationException("DTS_SANDBOX_IMAGE is required."); options.Cpu = "1000m"; options.Memory = "2048Mi"; options.MaxConcurrentActivities = 1; options.AddActivity(TaskNames.ExecuteCode); } } Then enable on-demand sandbox discovery when you configure the Durable Task worker in the main app: workerBuilder.AddTasks(tasks => tasks.AddAllGeneratedTasks()); workerBuilder.UseDurableTaskScheduler(options => { options.EndpointAddress = Environment.GetEnvironmentVariable("DTS_ENDPOINT"); options.TaskHubName = Environment.GetEnvironmentVariable("DTS_TASK_HUB"); options.Credential = credential; }); workerBuilder.EnableSandboxes(); Here's what the profile configuration does: SandboxWorkerProfile: a friendly profile id for this sandbox setup. It groups the activity, image, and resource settings for monitoring and reuse across deployments. ContainerImage: the container image (from your registry) that contains the activity implementations. Cpu / Memory: the resource shape for each worker instance. Sized per your activity's needs. MaxConcurrentActivities: how many activities a single worker instance can process concurrently. AddActivity: the specific activity to offload. Only activities added to a sandbox worker profile execute in DTS-managed isolated compute; everything else stays in-process. The orchestrator call site doesn't change: ExecuteCodeOutput execution = await context.CallActivityAsync<ExecuteCodeOutput>( TaskNames.ExecuteCode, new ExecuteCodeInput(pythonCode, input.CsvData)); ExecuteCode is not registered in the main app's in-process activity list. When the orchestrator calls it, DTS uses the codegen profile to route the work to the sandbox image. 2. Build the worker image The worker image is a container you own. In most apps, this worker lives in a separate project from the orchestrator host so it can have its own entry point, dependencies, and container image. It registers the activity implementations it can run and opts in to managed execution with UseSandboxWorker(): builder.Services.AddDurableTaskWorker(workerBuilder => { workerBuilder.AddTasks(tasks => { tasks.AddActivity<ExecuteCodeActivity>(); }); workerBuilder.UseSandboxWorker(); }); UseSandboxWorker() is the key line. It signals that this worker runs in DTS-managed compute. The sandbox worker does not need to configure the DTS endpoint, task hub, profile id, or credentials; DTS injects the runtime settings when it starts the container. The activity implementations themselves are standard Durable Task activities. There's nothing special about the activity code: it can call a runtime with different dependencies, such as Python and pandas, while running in an isolated container instead of in your main app's process. Package the image like any containerized service, including whatever runtimes and native tools the activity needs. Push it to your container registry (e.g., Azure Container Registry) and reference the image in the worker profile's ContainerImage option. View logs in the DTS dashboard Once your sandbox activities are running, you can view their execution logs directly in the Durable Task Scheduler dashboard. The dashboard shows real-time output from your managed workers, including stdout, stderr, and activity lifecycle events. This gives you full visibility into what's happening inside the sandbox without needing to configure external log sinks or set up your own observability pipeline. Demo Get started On-demand Sandboxes is in private preview. To get access, sign up here. We'll enable the feature on your scheduler and help you get your first sandbox activity running. Once you're in, the workflow is straightforward: declare a sandbox worker profile in your orchestrator app, build and push a worker image, and DTS takes care of the rest. Sign up for On-demand Sandboxes Private Preview Today → Documentation: Durable Task Scheduler overview Samples: Azure-Samples/Durable-Task-Scheduler Pricing: Azure Durable Task Scheduler pricing Questions, feedback, or ideas? Open an issue in the Durable-Task-Scheduler GitHub repo. We'd love to hear from you.251Views0likes0CommentsAnnouncing Anyscale on Azure public preview: Powered by Ray on AKS
Today, I’m excited to announce the public preview of Anyscale on Azure, bringing Anyscale’s managed Ray platform and the Anyscale Runtime natively to Azure, all running on Azure Kubernetes Service (AKS). It is the fastest path I have seen from a single notebook to a multi-region distributed AI job, running on the AKS clusters your platform team already operates.264Views1like0Comments