updates

842 Topics

Network security perimeter for Azure Service Bus & also now available in Azure Gov. Regions
TL; DR Network security perimeter support for Azure Service Bus is now Generally Available. With this, you can now place your Service Bus namespace inside a central security boundary and apply perimeter-based governance for inbound/outbound network access—while keeping key PaaS-to-PaaS scenarios secure and auditable. Introduction We’re excited to announce that network security perimeter support for Azure Service Bus is now Generally Available (GA). This milestone brings one of Azure’s most widely used messaging services into the network security perimeter ecosystem, enabling customers to define a centralized security boundary across messaging and data services. Alongside this, we are also expanding network security perimeter’s reach — network security perimeter is now available in Azure Government regions, including: Texas Arizona Virginia DoD East DoD Central This ensures that customers operating in regulated, sovereign, and mission-critical environments can adopt network security perimeter while meeting their compliance and regional requirements. Why this matters? Modern applications rely heavily on messaging layers like Service Bus. These systems often connect microservices, data platforms, key management systems and external integrations. As architectures scale, managing network access individually becomes complex and error prone. Network security perimeter changes this by introducing a perimeter-based access model, where: Communication is restricted by default Access must be explicitly allowed Governance is applied consistently across services With Service Bus now onboarded and network security perimeter extending into Azure Gov regions, customers can apply this model across both commercial and sovereign environments. What you can do with Service Bus + network security perimeter Confine communication within a security boundary Service Bus namespaces communicate only with resources inside the perimeter by default—blocking unintended access. Secure PaaS-to-PaaS communication Enable secure interactions between Service Bus, Azure Key Vault (for CMK scenarios) and other network security perimeter-enabled services (What is a network security perimeter? - Azure Private Link | Microsoft Learn) Define explicit access controls Inbound rules → IP ranges and subscriptions Outbound rules → FQDN-based filtering Enable audit and compliance visibility Diagnostic logs capture all access attempts, supporting compliance and investigation workflows. Use Private Link seamlessly Private endpoint traffic continues to work without additional configuration inside the perimeter. Azure Government Availability With this update, network security perimeter is now available in key Azure Government regions (Texas Arizona Virginia DoD East DoD Central), enabling: Consistent security across clouds Apply the same network security perimeter model across public Azure regions and Azure Government environments. Support for regulated workloads Customers in federal, defence, and highly regulated industries can now: Enforce perimeter-based governance Reduce exposure risks Meet compliance requirements Enable secure cross-service patterns in Gov clouds Azure Government boundaries now support scenarios like: CMK with Key Vault Service-to-service messaging Controlled external access More details of onboarded PaaS services are detailed in What is a network security perimeter? - Azure Private Link | Microsoft Learn What’s next Service Bus GA further strengthens network security perimeter’s growing coverage across Azure PaaS services. We will continue to: Expand PaaS service onboarding Improve access rule capabilities (e.g. Service tag-based access, identity-based access)
shashankamalladi
Jun 16, 2026 Place Azure Networking Blog
125Views
0likes
0Comments
Modern VM monitoring, powered by OpenTelemetry
At Build 2026, we're announcing the general availability of OpenTelemetry (OTel) Guest OS metrics for Azure VMs and Arc-enabled Servers. OTel provides a standards-based foundation for VM monitoring with consistent metrics across Windows and Linux, richer Guest OS and per-process visibility, and streamlined integration with open-source and cloud-native observability tools. Alongside the GA, we're introducing an enhanced VM monitoring experience, recommended alerts, and out-of-the-box Grafana dashboards, all powered by OTel Guest OS metrics. We're also sharing upcoming VM troubleshooting capabilities in the Azure Copilot observability agent enriched by OTel Guest OS metrics. What are OpenTelemetry Guest OS metrics OTel Guest OS metrics are collected from inside a VM. Today's coverage includes a curated set of CPU, memory, disk I/O, networking, and per-process metrics including CPU utilization, memory usage, uptime, and thread count. The supported set is point-in-time and will continue to expand as the OTel Host Metrics Receiver evolves upstream. This level of visibility helps customers diagnose operating system and application issues without manually signing into individual VMs. Why they matter 1. Lower cost and faster queries Default OTel Guest OS metrics are available at no additional cost. They are stored in Azure Monitor Workspace using metric-optimized storage and pricing, providing lower cost and faster query performance compared to LA-based metrics. 2. Per-process visibility for deeper troubleshooting Customers can optionally enable per-process metrics for deeper visibility into VM resource consumption. This helps identify noisy processes, memory leaks, runaway jobs, or resource-intensive applications without manually signing into the VM. 3. Consistent metrics across Windows and Linux Use the same metric names, dashboards, and alerts across operating systems without maintaining separate monitoring configurations. 4. Native PromQL support Use PromQL with the scale and managed experience of Azure Monitor Workspace. 5. OpenTelemetry-based standardization Use the same metrics across Azure Monitor, existing OTel pipelines, or other compatible observability backends. Log Analytics (LA)‑based metrics vs OTel‑based metrics Customers running workloads on Azure VMs and Arc-enabled Servers have long relied on Log Analytics (LA)-based metrics for fleet visibility. That experience continues to be generally available and trusted by thousands of customers. We recommend evaluating your requirements to determine which approach best suits your needs. LA-based metrics remain the foundation for customers who need advanced analytics and correlation, while OTel-based metrics open new possibilities for modern VM observability. Learn more. New Capabilities Powered by OpenTelemetry VM monitoring experience powered by OpenTelemetry (GA) We're excited to announce the general availability of the enhanced monitoring experience for Azure VMs and Arc servers. This experience brings comprehensive monitoring capabilities in a single, streamlined view, helping you more efficiently observe, diagnose, and optimize your virtual machines. The new experience offers two levels of insight within one unified interface: Basic view (Host OS-based): Available for all Azure VMs with no configuration required. This view surfaces key host-level metrics including CPU, disk, and network performance for quick health checks. Detailed view (Guest OS-based): Requires simple onboarding. Azure Monitor continues to support the GA detailed view powered by Log Analytics-based metrics. Customers can now choose to power the experience using OTel Guest OS metrics, which enable recommended alerts and provide expanded visibility into Guest OS and process-level resource consumption, including CPU, memory, disk I/O, and networking. Dashboards with Grafana for VMs For deeper analysis and customization, customers can leverage Azure Monitor dashboards with Grafana powered by OTel Guest OS metrics and PromQL at no additional cost. Built-in dashboards provide out-of-the-box visualizations for at-scale monitoring, host-level monitoring, Guest OS monitoring, and per-process monitoring, while still allowing teams to: Customize panels and dashboards Run ad hoc investigations Import dashboards from the Grafana community Share dashboards using Azure RBAC and ARM/Bicep deployment support Together, the enhanced VM monitoring experience and Grafana dashboards provide both streamlined day-to-day monitoring and flexible deep troubleshooting capabilities for modern VM environments. Query metrics in the context of your resources (GA) We’re also announcing the general availability of resource-scope querying for Azure Monitor Workspace (AMW) metrics, including OTel Guest OS metrics. With resource-scope query, you can query metrics directly from the context of a resource, resource group, or subscription, without needing to know which workspace stores the data. This simplifies troubleshooting, aligns with Azure-native workflows, and enforces least-privilege access using Azure RBAC. This capability powers scenarios like querying OTel Guest OS metrics directly from the Virtual Machine resource in Azure Portal, or resources can be scoped as a dedicated data source in Grafana to query with PromQL, making it easier for application and infrastructure teams to monitor and troubleshoot in the context of their workloads. Coming soon: Observability Agent Troubleshooting for VMs (Public Preview) Today, the Observability Agent helps customers investigate issues by correlating applications, infrastructure signals, LA-based metrics, logs, alerts, health information, and recent changes into a guided investigation narrative. Support for OTel Guest OS metrics is coming soon, extending investigations with richer Guest OS and per-process visibility. With OTel Guest OS metrics, the Observability Agent will be able to incorporate finer-grained operating system and process-level insights into its analysis, helping customers more quickly identify resource bottlenecks and understand their impact on application performance. Instead of manually piecing signals together across multiple tools and timelines, customers will receive a guided investigation summary with likely causes and recommended next steps. Combined with the new VM monitoring experience and Grafana dashboards, customers will have both AI-assisted investigations and powerful manual troubleshooting tools built on the same OTel foundation. Onboarding VMs at scale to OpenTelemetry Onboarding Azure VMs and Arc-enabled Servers to OTel Guest OS metrics is now simpler and more cost-efficient than ever. For teams getting started at scale, the easiest path is through the Monitoring Coverage experience in the Azure portal, where you can review recommended resources and onboard VMs through a guided workflow. Customers that prefer infrastructure-as-code can use ARM and Bicep templates to apply the same monitoring configuration programmatically. Azure Advisor recommendations provide another seamless entry point for onboarding, proactively identifying VMs that are not fully monitored and guiding customers to enable OTel -based monitoring with a few clicks. This helps teams continuously improve coverage across their fleet without needing to manually audit resources. Customers can now also reuse an existing Data Collection Rule (DCR) during onboarding, making it easier to standardize monitoring across large VM fleets. After onboarding, teams can centrally evolve their monitoring configuration by updating that DCR to collect additional metrics and logs, with changes applying across all associated VMs. Get Started Explore the new OpenTelemetry-powered experiences today: Enable enhanced monitoring for an Azure virtual machine - Azure Monitor Migrate from logs-based to OpenTelemetry metrics for Azure virtual machines - Azure Monitor Metrics experience for virtual machines in Azure Monitor - Azure Monitor Use Dashboards with Grafana for Azure Virtual Machines - Azure Monitor
viviandiec
Jun 14, 2026 Place Azure Observability Blog
396Views
3likes
1Comment
PUBLIC PREVIEW - Azure Monitor - Collect Azure Resource Platform Logs at Scale with DCRs
PUBLIC PREVIEW - Azure Monitor - Collect Azure Resource Platform Logs at Scale with DCRs. How DCR-based platform logs simplify the telemetry collection for organizations managing 1,000+ resources.
Mahesh_Sundaram
Jun 14, 2026 Place Azure Observability Blog
645Views
2likes
1Comment
VNet integration for Azure SRE Agent (preview)
For many production systems, the logs, databases, private endpoints, repositories, and runbooks an SRE Agent needs to do its job are behind network boundaries your security team already governs. VNet integration for Azure SRE Agent, now in preview, puts the agent's outbound traffic under those same controls - your virtual network, your NSG rules, your private DNS - so it reaches only what your network allows. The principle is one your security team already applies to every other workload: a component's network access shouldn't depend on the component behaving correctly. Identity governs what the agent can reach. Permissions and hooks shape what it does within reach. The network sits beneath both: it blocks any request to a destination you haven't allowed no matter what the agent decides. Why egress control matters Two reasons. First, the agent reads sensitive things by design. Inspecting logs, code, configuration, and internal systems is the whole point during an incident, which means you have to decide where that data can go. Open egress gives that data a path out of your network - a risk you wouldn't accept for any other production-adjacent workload. Second, it reasons over text it didn't write - logs, issue descriptions, tool output — which is how prompt injection gets in. Handling that is partly model safety, and Azure SRE Agent runs under Microsoft's Responsible AI standard with safety work from OpenAI and Anthropic. Network controls add another layer: an instruction that tries to reach a destination you haven't allowed can't run, because the network blocks it. For example, an agent investigating an outage might query Log Analytics, read deployment configuration, and call an internal runbook - all private resources. With VNet integration, those calls follow the routes, DNS, and firewall rules your workloads already use. A request to an external endpoint you haven't allowed fails at the network boundary. It doesn't depend on the model recognizing the risk and refusing; the network stops it either way. Choose an egress mode Azure SRE Agent has three egress modes, and you don't have to start at the strongest. Unrestricted - all outbound traffic allowed Limited - deny all outbound, allow an explicit list of hosts. Gives you host-level control without setting up a full VNet Azure VNet - outbound traffic goes through a delegated subnet in your network, with your NSG rules and private DNS applied. The recommended mode for production and regulated workloads. How Azure VNet mode works Outbound traffic takes one of two paths, and every call takes exactly one. Your VNet. Everything not placed on the managed path goes through a delegated subnet in your own network, where your NSG rules, private DNS, and firewall all apply. The agent is just another workload on that subnet, so it can reach what the subnet can reach: databases behind private endpoints, internal services, monitoring stores, and key vaults -the parts of production that aren't reachable from the public internet. The resources that matter most during an incident are usually the private ones. If your network connects to on-premises over ExpressRoute or VPN, the agent can reach those systems too, as long as your existing routes and rules allow it. The managed infra path. Some destinations go through Azure SRE Agent's managed infrastructure network instead - platform services the agent needs, plus optional categories you turn on: package registries, code repositories, and remote MCP servers. This path skips your VNet, so your NSG rules and Firewall Policies don't apply to it. Treat it as a deliberate exception, used only where you need it. Why public services start on the managed path Public services are hard to allow by IP address. GitHub, PyPI, npm, NuGet, apt, and the container registries run on large, changing IP ranges, and they don't map to a single Azure service tag. If your NSG filters by IP and port, keeping those lists up to date is constant work, and when a list falls behind, the agent can't pull a package or read a repository - and an investigation stalls on a networking problem that has nothing to do with the incident. Each category has a toggle: package registries (PyPI, npm, NuGet, apt), code repositories (GitHub, GitHub Enterprise, Azure DevOps), remote MCP servers, and a list of additional hostnames. Starting with these on the managed path keeps the agent working reliably without maintaining an IP allowlist. For build-time dependencies, that's usually fine. If you want this traffic inspected too, the next step is name-based (FQDN) egress filtering in your own network. Once your firewall can allow github.com and pypi.org by name, you can move these categories off the managed path and route them through your VNet instead Configure it Two decisions: the subnet, and what (if anything) uses the bypass. Navigate to Settings > Workspace Configuration > Network Choose Azure VNet as the egress mode. Select a subnet that is /27 or larger and delegated to `Microsoft.App/environments`. Decide which categories, if any, use the bypass. Restrict who can change the egress mode and bypass toggles. These settings widen or narrow the agent's reach, so govern them like any production network control. Test the outbound behavior before using the agent with production data. A reasonable setup for most enterprises during preview: use Azure VNet mode, keep package registries and code repositories on the bypass if you need reliable access to them, and route everything else through your VNet. Stricter environments can turn those categories off and rely on their own name-based firewall rules. What it doesn't cover yet VNet integration is in preview, with two limitations to know. It covers outbound traffic only - reaching the agent privately from inside your network isn't part of this preview. And connector traffic still routes over the public internet; the governance and credential isolation in Connectors V2 still apply. Use VNet integration for outbound control of the agent workspace, and combine it with identity, RBAC, tool permissions, hooks, and connector governance for a complete set of controls. Where it fits VNet integration doesn't replace identity, RBAC, tool permissions, or connector governance. It controls where traffic can go. The agent still needs the right identity and permissions to access a resource in the first place. Identity is the foundation: your RBAC assignments decide what the agent can reach. Permissions and hooks shape what it does within reach: allow/ask/deny rules control what runs, and hooks let you inspect or change a tool call before it runs. VNet integration sits underneath, controlling where traffic can go no matter what the agent tries to do. You want the agent to be capable. You also want a boundary that holds whether or not it is. Get started Create an SRE Agent - https://aka.ms/sreagent Documentation - https://aka.ms/sreagent/newdocs Recipes - https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent
sanchitmehta
Jun 14, 2026 Place Apps on Azure Blog
776Views
1like
0Comments
Announcing Microsoft Azure Network Adapter (MANA) support for Existing VM SKUs
As a leader in cloud infrastructure, Microsoft ensures that Azure’s IaaS customers always have access to the latest hardware. Our goal is to consistently deliver technology to support business critical workloads with world class efficiency, reliability, and security. Customers benefit from cutting-edge performance enhancements and features, helping them to future proof their workloads while maintaining business continuity. Azure will be a deploying new hardware generation to support capacity demands for existing VM Size Families. The hardware is optimized for these VM sizes, utilizing Intel’s Emerald Rapid CPU, native NVMe SSD support for higher storage bandwidth and lower latency, and Microsoft Azure Network Adapter (MANA). Deployment timelines will be communicated via Service Health Advisory updates. The intent is to provide the benefits of new server hardware to customers of existing VM SKUs as they work towards migrating to newer SKUs. The deployments will be based on capacity needs and won’t be restricted by region. Once the hardware is available in a region, VMs can be deployed to it as needed. Workloads on operating systems which fully support MANA will benefit from sub-second Network Interface Card (NIC) firmware upgrades, higher throughput, lower latency, increased Security and Azure Boost-enabled data path accelerations. If your workload doesn't support MANA today, you'll still be able to access Azure’s network on MANA enabled SKUs, but performance will be comparable to previous generation (non-MANA) hardware. Check out the Azure Boost Overview and the Microsoft Azure Network Adapter (MANA) overview for more detailed information and OS compatibility. To determine whether your VMs are impacted and what actions (if any) you should take, start with MANA support for existing VM SKUs. This article provides additional information about which VM Sizes are eligible to be deployed on the new MANA-enabled hardware, what actions (if any) you should take, and how to determine if the workload has been deployed on MANA-enabled hardware.
ali_sheriff
Jun 12, 2026 Place Azure Infrastructure Blog
15KViews
10likes
8Comments
Azure Monitor SLIs now Generally Available
Azure Monitor SLIs are now generally available Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in Azure Monitor are now generally available. Teams can now measure reliability based on customer experience, not just infrastructure signals. SLI: A quantitative measure of how well an application or service is performing from the customer’s point of view. SLO: A defined target for an SLI that represents how good or bad the SLI is over a given time-period. This is also referred to as a baseline in Azure Monitor. Traditional monitoring shows what is happening across your systems, but not always what customers are experiencing. A service can be technically available and still feel unreliable because of latency, partial failures, or dependency issues. SLIs help close that gap by measuring reliability from the customer’s point of view. With GA, Azure Monitor now brings SLI authoring, SLO tracking, error budgets, and burn rate–based alerting into one experience, helping teams focus on whether they are meeting the reliability their customers expect. What Azure Monitor SLI helps you do Azure Monitor SLI lets you measure availability and latency with either request-based or window-based evaluation methods. In Azure Monitor, SLIs are defined at the Service Group level, which provides a logical representation of your application across multiple resources. This gives teams a clearer view of application health, customer impact, and the signals that matter most. SLIs continuously evaluate your service by using existing Azure Monitor metrics and store the resulting evaluations in your Azure Monitor Workspace. Azure Monitor uses these SLI evaluations to power error budgets, burn rate visualization, and alerting. This helps teams spot reliability issues earlier and make better release and incident response decisions. Get started To get started, you’ll need: A Service Group. Application metrics flowing into an Azure Monitor Workspace, for example through Managed Prometheus or OpenTelemetry Collect and analyze OpenTelemetry data with Azure Monitor (Preview) - Azure Monitor | Microsoft Learn Learn more here. Summary Azure Monitor SLI helps teams measure customer experience, track reliability against clear targets, and respond sooner with error budgets and burn rate–based alerting. Learn more in the product documentation and start defining SLIs for your services in Azure Monitor today.
Sokuma
Jun 11, 2026 Place Azure Observability Blog
248Views
0likes
0Comments
Azure Monitor Metrics Export Generally Available
Today, we’re excited to announce the general availability of Azure Monitor Metrics Export using data collection rules (DCRs). A scalable, flexible way to continuously export platform metrics with dimensional fidelity, lower latency, and more control over what you send downstream. Azure Monitor Metrics Export is configured through data collection rules and can route platform metrics to Azure Storage accounts, Azure Event Hubs, or Azure Log Analytics workspaces. Compared to diagnostic settings, DCR-based metrics export supports multidimensional metrics, metric-name filtering, and improved scalability for large environments. Here are some of the key benefits of Azure Monitor Metrics Export: Control what you export: You can export all supported metrics for a resource type or filter to specific metric names, helping reduce downstream volume and manage cost. Preserve dimensional fidelity: The DCR-based metric export supports multidimensional metrics, making downstream analysis and correlation more meaningful. Get faster export latency: End-to-end export latency is typically within about 3 minutes, improving time to insight for operational and analytics workflows. With Azure Monitor Metrics Export, organizations can build more scalable observability pipelines, route metrics to the destinations that fit their architecture, and unlock richer analysis for operations, reporting, and integration scenarios. What’s new in GA With general availability, Azure Monitor Metrics Export offers a production-ready path to continuously stream supported platform metrics using data collection rules. Azure Monitor Metrics Export now covers 44 Azure regions, up from 12 regions previously. This expanded footprint helps more customers adopt DCR-based metrics export closer to where their resources run, improving rollout flexibility for global deployments. Customers can export metrics to Azure Storage, Azure Event Hubs, or Azure Log Analytics, preserve metric dimensions, and filter by metric name to better control downstream volume and cost. Learn more about metrics export using data collection rules. We’re excited to make Azure Monitor Metrics Export generally available and look forward to seeing how customers use it to build more reliable, cost-conscious, and extensible monitoring solutions on Azure.
Sokuma
Jun 11, 2026 Place Azure Observability Blog
276Views
0likes
0Comments
Private Plugins with Azure SRE Agent
SRE's and platform teams are building operational skills specific to their infrastructure: investigation runbooks, compliance checks, cost analysis playbooks, deployment verification procedures. The next step is making that work reusable across every agent in the organization without exposing it publicly. Today, SRE Agent supports plugin marketplaces hosted in private GitHub repositories, including GitHub Enterprise. This is part of the Azure SRE Agent announcements at Build 2026. You can now point SRE Agent at a private repo when adding a marketplace or installing a plugin. Authentication is handled per-marketplace, and supports OAuth, GitHub PATs, and GitHub Apps for GHE tenants. From one agent to an organization’s plugin catalog Most teams start with a single SRE Agent connected to their services. The agent learns their infrastructure, runs their runbooks, and handles their incidents. It works well. Then adoption grows. A second team stands up their own agent. Then a third. Platform engineering wants every agent to run the same compliance checks. Security needs approval hooks enforced consistently. FinOps has cost governance skills that should be standard across the organization. Suddenly the question isn’t “how do I set up my agent,” it’s “how do we share operational knowledge across all of them.” Without a distribution model, teams end up copying skill files between agents manually. A platform team writes a runbook, shares it over email or a wiki link, and each service team pastes it into their agent individually. When the runbook improves, some agents get updated, some don’t. There’s no version tracking, no central catalog, and no way to know which agent is running which version of which skill. Private marketplace support solves this. How Private Plugin marketplace meet enterprise needs A platform team publishes once, every agent installs. Codify best practices as plugins in a private GitHub repo. Service teams add that repo as a marketplace in their agents and install what they need. Compliance checks, cost governance thresholds, incident playbooks, deployment verification procedures all distributed through versioned plugins. Each team retains ownership. Security controls which plugins enforce approval hooks. FinOps locks cost thresholds into parameter values. Platform engineering governs infrastructure investigation patterns. The marketplace is the distribution layer for organizational standards. Versions are pinned, updates are explicit. Each installation locks to the commit at install time. A merged PR upstream does not change any agent’s behavior. Teams promote new versions on their own schedule: validate in dev, promote to staging, then production. Different agents can run different versions simultaneously. Reuse across environments and tools. The same plugin works across dev, staging, and production agents, and can be reused by local coding agents and other services that support plugins. One source of truth, not separate copies per environment. Accessing Private Plugin marketplaces Private repo support adds authentication to the SRE Agent's plugin workflow so your agent can clone and install from repos that require credentials. Authentication is configured once per marketplace. Every plugin within it inherits the credentials. Auth method When to use Setup OAuth github.com repos your agent can already access Uses your existing GitHub connection. One click. Personal access token Private repos in other orgs on github.com Per-marketplace PAT. Scoped to just that marketplace. GitHub App GitHub Enterprise (*.ghe.com) BYO App with private key in Azure Key Vault. Short-lived tokens minted at runtime. Getting started In SRE Agent, navigate to Builder > Plugins, then click Add Marketplace and enter the URL of the private marketplace you want to connect to. Then click Connect to GitHub to complete the OAuth sign-in. Click Add and you will see the plugins available from your connected marketplace. Click on the plugin to install and in the detail view you can browse the skills packaged with the plugin. click Install to install this plugin. You can now see the skills imported from plugins from Capabilities > Skills > Custom Skills The bottom line Private repo support turns the Plugin Marketplace from a public skill catalog into your organization’s internal distribution platform for operational automation. Your team writes the plugins. Your agents install them. Your GitHub permissions control who has access. Try it yourself: create a private repo with a marketplace.json and a few skills, add it as a marketplace in your agent, and install a plugin. Resources SRE Agent documentation — https://aka.ms/sreagent/newdocs SRE Agent overview — https://aka.ms/sreagent/newdocsoverview Plugin Marketplace capability page — https://aka.ms/sreagent/newdocs/capabilities/plugin-marketplace Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent
ebencarek
Jun 11, 2026 Place Apps on Azure Blog
271Views
0likes
0Comments
Shaping what Azure SRE Agent does: Tool Permissions and Hooks
When an AI agent runs against production, the first question every security team asks is "What can it do, who decided it could, and what stops it from doing something it should not." Azure SRE Agent reached general availability in March. Since then, teams inside Microsoft and customers running it against real production workloads have asked for the same thing: finer-grained controls over what the agent can do on its own and a clear answer to who governs each call that reaches a tool. Today at Build 2026, we are releasing global tool access policies as one of a set of new governance controls. This post covers how they work. Tool access policies give security and platform teams a single place to define which tools the agent can invoke, under what conditions, and what requires human approval before it runs. Underneath those policies sits the identity the agent runs as the bedrock that every other control layer depends on. It is defense in depth applied to agent behavior: layers of control, each one holding on its own, so that governing the agent is something you can read, audit, and reason about as you scale it across production. Identity is the bedrock: managed identity today, agent identity next Start here, because nothing else matters if you skip it. The identity the SRE Agent runs as, and the Azure RBAC role assignments on that identity, are the most powerful boundary the agent works inside of. If your role assignments do not grant the agent access to a resource, none of the controls below come into play, because the agent cannot reach the resource to begin with. Network rules, tool permissions, hooks, and connector contracts all sit on top of an RBAC story that you write. The features in this post add layers above that floor. They do not replace it. Today the SRE Agent operates as a managed identity, and your RBAC role assignments on that identity govern what it can do. This is the bedrock, and it is the same model your other Azure workloads already use. You assign roles, you scope them, and the agent inherits exactly what you granted and nothing more. Everything that follows assumes the bedrock is in place. With identity settled, the next question is the obvious one: where is the agent allowed to send its traffic? Permissions: govern what the agent does with a tool Identity decides what the agent can reach. Permissions decide what the agent does with the access it has, down to the individual tool. Two levels cover the range: a point-and-click grid for the common cases, and hooks when a decision needs your own code. The grid is the easy mode. Every tool the agent can use, built-in tools along with MCP servers, services, and custom tools, shows up in one searchable list with two switches. On/Off sets whether the tool is available at all; turn it off and the agent cannot use it. Allow/Ask sets what happens when it is on: Allow lets the agent run the tool automatically, Ask requires a human to approve every time, except in Autonomous mode. Select tools in bulk to flip a whole category at once, filter by category or permission, and use the Advanced permissions tab when you want rules that apply at global, per-agent, or per-thread scope instead of tool by tool. Defaults stay put until you touch them, and the engine is fail-closed: if a rule cannot be evaluated, the call is blocked rather than allowed. That covers most of what teams need. Underneath those switches are three rules, allow, ask, and deny, and the Advanced tab is where you set them by scope. Global rules apply to every agent and thread, Agent rules to one custom agent, Thread rules to a single conversation. Deny is the hard one: it blocks the tool outright no matter the run mode, and a deny at a higher scope always wins, so an Allow at thread scope cannot reopen something denied globally. That split is deliberate. A platform team sets the Global guardrails that should never be crossed and the Asks that always need a human, and service teams add their own Allow rules at Agent scope for routine work, without being able to override the guardrails above them. Platform team, Global scope: deny: bash(az * delete *) - never delete, on any agent or thread deny: bash(kubectl delete *) ask: bash(az webapp restart *) - always confirm, even in Autonomous allow: bash(az monitor *) - auto-approve monitoring queries Service team, Agent scope: allow: bash(kubectl get *) - routine read-only work allow: bash(kubectl describe *) Two details make this safe to lean on. Rules match the canonicalized tool invocation rather than the raw text, so enforcement holds no matter how the command was assembled. And fail-closed has a softer edge than a hard stop: a cached last-known-good policy covers transient failures, so a blip in the policy store blocks the call rather than silently widening access. You can find these under Capabilities > Tools missions. The layer worth spending time on is hooks. Allow and Ask answer "should this tool run." Hooks answer "should this specific call run, given exactly what it is about to do." A hook fires before the agent runs a tool and receives the actual call, parameters and all. Your code then decides the outcome and can reshape it: rewrite parameters before they are sent, inject extra context into the pipeline as a user message so the agent reconsiders before its next step, block the call outright, or redirect the agent toward a safer path. Because your code sees the real parameters, the decision can depend on anything you can express in code: which resource the call targets, whether a value falls outside an allowed range, the time of day, the result of an external policy lookup. This is where you write the rule the grid cannot. Two kinds of hook, mixable on the same agent. Command hooks are a script you write; reach for these when code is enough. Prompt hooks put a separate LLM in the loop as a judge that evaluates the call in context; reach for these when the decision needs reasoning rather than a fixed rule. A real example from our own internal test agent: when the agent tries to list files through the shell with ls or dir, a hook blocks the call. The agent absorbs the signal, reconsiders, and reaches for the ListDir tool instead. The hook did not argue with a human. It shaped what happened next. As with the grid, configure nothing and the agent behaves exactly as it does today. Both are additive. Authoring one is a short form. You name the hook, pick the event (Pre Tool Use, so it runs before the call), and set a tool matcher, either picked from the tool menu or written as a regex like (FetchWebpage|SearchMemory) with anchors and lookaheads when you need them, so the hook fires only on the calls you care about. You set a timeout and a fail mode (Block, so a hook that errors or hangs stops the call rather than waving it through), and you write the body in Bash or Python. A command hook reads the call as JSON on stdin, the event name, the tool name, its parameters, and the call id, and answers on stdout. Print nothing and exit zero to allow. Return a block decision with a reason to stop the call, and that reason is what the agent reads back. You can also substitute: run a cheaper or safer version yourself, block the real call, and hand your own output back as the result, so the agent never runs the expensive or risky original. #!/bin/bash input=$(cat) tool=$(echo "$input" | jq -r '.tool_name') # Block one tool, with a reason the agent will read if [ "$tool" = "ExampleToolName" ]; then echo '{"decision":"block","reason":"Blocked ExampleToolName by hook policy."}' exit 0 fi # Otherwise allow: print nothing and exit 0 exit 0 You can find these under Builder > Hooks Each layer holds on its own The layers stack. Identity is the floor: your RBAC assignments decide what the agent can reach at all. Permissions, the grid and hooks together, decide what it does with a tool. You author each layer, each one holds whether or not the layer above it behaves as expected, and all of it configures through the same ARM and Bicep surface your platform team already uses, reproducible the way the rest of your Azure estate is. The upgrade path is additive and non-breaking. Existing agents keep working. Turn on each control when you are ready, in the order your governance requires. There is more coming. We run Azure SRE Agent inside Microsoft on our own production workloads, so we feel the same gaps you do, and the next round is shaped by what we hear from teams running it in production today. Which control is doing the most for you, and which one are you still waiting on? Let us know and thank you! Getting started Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent
Dalibor_Kovacevic
Jun 10, 2026 Place Apps on Azure Blog
330Views
0likes
0Comments
Azure Monitor Copilot Observability Agent: What’s new at Build
The Observability agent in Azure Copilot is an AI-powered assistant built into Azure Monitor that helps engineers investigate issues and explore their systems using natural language. By grounding its analysis in telemetry data such as metrics, logs, and traces, it supports both open-ended exploration and guided troubleshooting. For more details, see the documentation. Since our initial public preview, the Observability agent in Azure Copilot has continued to evolve with new capabilities and expanded coverage (You can read more about the initial release in our previous blog) At Build 2026, we’re introducing updates that expand the Observability agent’s capabilities and the range of scenarios it can support. These updates provide deeper analysis and more detailed responses for both exploration and investigation. Expanded Investigation Scenarios The Observability agent now supports a broader set of scenarios across applications and infrastructure. These can be accessed directly from relevant product experiences, without requiring a prior alert, allowing teams to explore data conversationally and initiate deeper investigations as signals emerge. Integration with Microsoft Foundry AI Agent The Observability agent integrates with Microsoft Foundry AI Agents, enabling correlation of signals across key generative AI and agent observability scenarios such as latency spikes, error patterns, and tool invocation failures. Teams can interact with the Observability agent either from alerts - including alerts based on Foundry telemetry - or directly within Application Insights, where the Agents details experience serves as the primary entry point. From there, users can use the Observability agent to diagnose errors, analyze trends, and explore their data across one or multiple agents. Application Insights integration The Observability agent enables investigation of failure scenarios directly from Application Insights Failures blade, allowing teams to analyze application-level issues and move from symptom to root cause. Azure Kubernetes Service (AKS) integration The Observability agent enables deep investigation of issues in Azure Kubernetes Service (AKS) clusters. AKS investigations correlate signals from Azure Monitor with Kubernetes logs and events, and (coming soon) Prometheus metrics stored in an Azure Monitor Workspace. Together, these signals enable full‑stack analysis of applications running on AKS. The Observability agent helps teams determine whether an issue originates from the application or from the underlying Kubernetes platform, reducing time to diagnosis and resolution. Activity Logs integration Investigations can be initiated based on Azure Resource Health events surfaced in Activity Logs, enabling analysis of service-impacting signals related to the Azure platform. Deeper Insights across systems Multiple Application Insights - Coming soon! The Observability agent supports investigations that can span multiple Application Insights resources, enabling scenarios that involve multiple services within distributed applications. The agent can guide users to expand the investigation scope when cross-service issues are detected. Integration with Azure Service Health The Observability agent correlates investigation context with Azure Service Health events, helping teams understand potential platform impact as part of their investigation. This helps distinguish application-level issues from broader Azure platform conditions and prioritize active impacts. Issue management Enhancements Viewing issues Issues can now be viewed in multiple places, depending on the required scope: Azure Monitor: showing issues across all Azure Monitor Workspaces (AMWs) under the selected subscriptions Azure Monitor Workspace: showing issues stored within a specific AMW Issue actions & notifications Issue actions trigger notifications when issues are created or updated, enabling integration with workflows such as email, webhooks, and automation. Sharing and follow-up You can now download investigation results as a PDF, including supported data, enabling teams to capture and share investigation context for incident reviews and reporting. Coming Soon Billing for the Observability agent starts on July 1, 2026. The agent uses a consumption-based pricing model, so customers pay only for the AI work the agent performs. Agent consumption is measured in Azure Agent Credit (AAC) units, which reflect how many LLM tokens the agent used. For more details, see the documentation. Stay connected Follow this blog for ongoing updates and deeper dives into new capabilities Join our upcoming webinar for real-world scenarios, best practices, and a look at what’s coming next 👉 Register here We’d love your feedback The Observability Agent continues to evolve based on real-world usage and customer feedback. Share feedback through the Give Feedback option in the product or contact us at: azureobsagent@microsoft.com Want to learn more? Read our previous blog posts - Public Preview Update: Azure Copilot Observability Agent | Microsoft Community Hub The Azure Copilot Observability Agent Chat - Stop Writing Queries, Start Asking Questions. | Microsoft Community Hub Explore our documentation - Azure Copilot observability agent (preview) - Azure Monitor | Microsoft Learn
EfratNauerman
Jun 09, 2026 Place Azure Observability Blog
416Views
0likes
1Comment