updates
793 TopicsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Be sure to use a new subnet CIDR within the VNet address space, which doesn't overlap with other subnets nor the AKS CIDRS for nodes/pods and ClusterIP services. To minimise disruption, we'll first install the Virtual Nodes on ACI Helm chart, before then removing the legacy managed add-on and its resources. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Create the new Virtual Nodes on ACI subnet with the specific name value of cg (a custom subnet can be used by following the steps here): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes <your subnet CIDR> \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Scale-down any running Virtual Nodes workloads (example below): kubectl scale deploy <deploymentName> -n <namespace> --replicas=0 Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Delete the previous (legacy) Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify the nodeSelector and tolerations properties of your Virtual Nodes workloads to match the requirements of Virtual Nodes on ACI (see details below) Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here434Views3likes0CommentsSecurity Review for Microsoft Edge version 146
We have reviewed the new settings in Microsoft Edge version 146 and determined that there are no additional security settings that require enforcement. The Microsoft Edge version 139 security baseline continues to be our recommended configuration which can be downloaded from the Microsoft Security Compliance Toolkit. Microsoft Edge version 146 introduced 9 new Computer and User settings; we have included a spreadsheet listing the new settings to make it easier for you to find. As a friendly reminder, all available settings for Microsoft Edge are documented here, and all available settings for Microsoft Edge Update are documented here. Please continue to give us feedback through the Security Baselines Discussion site or this post.Announcing Microsoft Azure Network Adapter (MANA) support for Existing VM SKUs
As a leader in cloud infrastructure, Microsoft ensures that Azure’s IaaS customers always have access to the latest hardware. Our goal is to consistently deliver technology to support business critical workloads with world class efficiency, reliability, and security. Customers benefit from cutting-edge performance enhancements and features, helping them to future proof their workloads while maintaining business continuity. In April 2026, Azure will be deploying the Microsoft Azure Network Adapter (MANA) for existing VM Size Families. The intent is to provide the benefits of new server hardware to customers of existing VM SKUs as they work towards migrating to newer SKUs. The deployments will be based on capacity needs and won’t be restricted by region. Once the hardware is available in a region, VMs can be deployed to it as needed. Workloads on operating systems which fully support MANA will benefit from sub-second Network Interface Card (NIC) firmware upgrades, higher throughput, lower latency, increased Security and Azure Boost-enabled data path accelerations. If your workload doesn't support MANA today, you'll still be able to access Azure’s network on MANA enabled SKUs, but performance will be comparable to previous generation (non-MANA) hardware. Check out the Azure Boost Overview and the Microsoft Azure Network Adapter (MANA) overview for more detailed information and OS compatibility. To determine whether your VMs are impacted and what actions (if any) you should take, start with MANA support for existing VM SKUs. This article provides additional information about which VM Sizes are eligible to be deployed on the new MANA-enabled hardware, what actions (if any) you should take, and how to determine if the workload has been deployed on MANA-enabled hardware.1.8KViews5likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.11KViews5likes0CommentsUpdate To API Management Workspaces Breaking Changes: Built-in Gateway & Tiers Support
What’s changing? If your API Management service uses preview workspaces on the built-in gateway and meets the tier-based limits below, those workspaces will continue to function as-is and will automatically transition to general availability once built-in gateway support is fully announced. API Management tier Limit of workspaces on built-in gateway Premium and Premium v2 Up to 30 workspaces Standard and Standard v2 Up to 5 workspaces Basic and Basic v2 Up to 1 workspace Developer Up to 1 workspace Why this change? We introduced the requirement for workspace gateways to improve reliability and scalability in large, federated API environments. While we continue to recommend workspace gateways, especially for scenarios that require greater scalability, isolation, and long-term flexibility, we understand that many customers have established workflows using the preview workspaces model or need workspaces support in non-Premium tiers. What’s not changing? Other aspects of the workspace-related breaking changes remain in effect. For example, service-level managed identities are not available within workspaces. In addition to workspaces support on the built-in gateway described in the section above, Premium and Premium v2 services will continue to support deploying workspaces with workspace gateways. Resources Workspaces in Azure API Management Original breaking changes announcements Reduced tier availability Requirement for workspace gateways2.5KViews2likes8CommentsAzure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context
What if SRE Agent already knew your system before the next incident? Your most experienced SRE didn't become an expert overnight. Day one: reading runbooks, studying architecture diagrams, asking a lot of questions. Month three: knowing which services are fragile, which config changes cascade, which log patterns mean real trouble. Year two: diagnosing a production issue at 2 AM from a single alert because they'd built deep, living context about your systems. That learning process, absorbing documentation, reading code, handling incidents, building intuition from every interaction is what makes an expert. Azure SRE Agent could do the same thing From pulling context to living in it Azure SRE Agent already connects to Azure Monitor, PagerDuty, and ServiceNow. It queries Kusto logs, checks resource health, reads your code, and delivers root cause analysis often resolving incidents without waking anyone up. Thousands of incidents handled. Thousands of engineering hours saved. Deep Context takes this to the next level. Instead of accessing context on demand, your agent now lives in it — continuously reading your code, knowledge building persistent memory from every interaction, and evolving its understanding of your systems in the background. Three things makes Deep Context work: Continuous access. Source code, terminal, Python runtime, and Azure environment are available whenever the agent needs them. Connected repos are cloned into the agent's workspace automatically. The agent knows your code structure from the first message. Persistent memory. Insights from previous investigations, architecture understanding, team context — it all persists across sessions. The next time the agent picks up an alert, it already knows what happened last time. Background intelligence. Even when you're not chatting, background services continuously learn. After every conversation, the agent extracts what worked, what failed, what the root cause was. It aggregates these across all past investigations to build evolving operational insights. The agent recognizes patterns you haven't noticed yet. One example: connected to Kusto, background scanning auto-discovers every table, documents schemas, and builds reusable query templates. But this learning applies broadly — every conversation, every incident, every data source makes the agent sharper. Expertise that compound with every incident New on-call engineer SRE Agent with Deep Context Alert fires Opens runbook, looks up which service this maps to Already knows the service, its dependencies, and failure patterns from prior incidents Investigation Reads logs, searches code, asks teammates Goes straight to the relevant code path, correlates with logs and persistent insights from similar incidents After 100 incidents Becomes the team expert — irreplaceable institutional knowledge Same institutional knowledge — always available, never forgets, scales across your entire organization A human expert takes months to build this depth. An agent with Deep Context builds it in days and the knowledge compounds with every interaction. You shape what your agent learns. Deep Context learns automatically but the best results come when your team actively guides what the agent retains. Type #remember in chat to save important facts your agent should always know environment details, escalation paths, team preferences. For example: "#remember our Redis cache uses Premium tier with 6GB" or "#remember database failover takes approximately 15 minutes." These are recalled automatically during future investigations. Turn investigations into knowledge. After a good investigation, ask your agent to turn the resolution into a runbook: "Create a troubleshooting guide from the steps we just followed and save it to Knowledge settings." The agent generates a structured document, uploads it, and indexes it — so the next time a similar issue occurs, the agent finds and follows that guide automatically. The agent captures insights from every conversation on its own. Your guidance tells it which ones matter most. This is exactly how Microsoft’s own SRE team gets the best results: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” Read the full story in The Agent That Investigates Itself. See it in action: an Azure Monitor alert, end to end An HTTP 5xx spike fires on your container app. Your agent is in autonomous mode. It acknowledges the alert, checks resource health, reads logs, and delivers a diagnosis — that's what it already does well. Deep Context makes this dramatically better. Two things change everything: The agent already knows your environment.It'salready read your code, runbooks, and built context from previous investigations. Your route handlers, database layer, deployment configs, operational procedures, it knows all of it. So, when these alert fires, it doesn't start from scratch. It goes straight to the relevant code path, correlates a recent connection pooling commit with the deployment timeline, and confirms the root cause. The agent remembers.It's seen this pattern before a similar incident last week that was investigated but never permanently fixed. It recognizes the recurrence from persistent memory, skips rediscovery, confirms the issue is still in the code, and this time fixes it. Because it's in autonomous mode, the agent edits the source code, restarts the container, pushes the fix to a new branch, creates a PR, opens a GitHub Issue, and verifies service health, all before you wake up. The agent delivers a complete remediation summary including alert, root cause with code references, fix applied, PR created, without a single message from you. Code access turns diagnosis into action. Persistent memory turns recurring problems into solved problems. Give your agent your code — here's why it matters If you're on an IT operations, SRE, or DevOps team, you might think: "Code access? That's for developers." We'd encourage you to rethink that. Your infrastructure-as-code, deployment configs, Helm charts, Terraform files, pipeline definitions — that's all code. And it's exactly the context your agent needs to go from good to extraordinary. When your agent can read your actual configuration and infrastructure code, investigations transform. Instead of generic troubleshooting, you get root cause analysis that points to the exact file, the exact line, the exact config change. It correlates a deployment failure with a specific commit. It reads your Helm values and spots the misconfiguration that caused the pod crash loop. "Will the agent modify our production code?" No. The agent works in a secure sandbox — a copy of your repository, not your production environment. When it identifies a fix, it creates a pull request on a new branch. Your code review process, your CI/CD pipeline, your approval gates — all untouched. The agent proposes. Your team decides. Whether you're a developer, an SRE, or an IT operator managing infrastructure you didn't write — connecting your code is the single highest-impact thing you can do to make your agent smarter. The compound effects Deep Context amplifies every other SRE Agent capability: Deep Context + Incident management → Alerts fire, the agent correlates logs with actual code. Root cause references specific files and line numbers. Deep Context + Scheduled tasks → Automated code analysis, compliance checks, and drift detection — inspecting your actual infrastructure code, not just metrics. Deep Context + MCP connectors → Datadog, Splunk, PagerDuty data combined with source code context. The full picture in one conversation. Deep Context + Knowledge files → Upload runbooks, architecture docs, postmortems — in any format. The agent cross-references your team's knowledge with live code, logs, and infrastructure state. Logs tell the agent what happened. Code tells it why. Your knowledge files tell it what to do about it. Get started Deep Context is available today as part of Azure SRE Agent GA. New agents have it enabled by default. For a step-by-step walkthrough connecting your code, logs, incidents, and knowledge files, see What It Takes to Give an SRE Agent a Useful Starting Point Resources SRE Agent GA Announcement blog - https://aka.ms/sreagent/ga SRE Agent GA What’s new post - https://aka.ms/sreagent/blog/whatsnewGA SRE Agent Documentation – https://aka.ms/sreagent/newdocs SRE Agent Overview - https://aka.ms/sreagent/newdocsoverview479Views0likes0CommentsBuilding Reusable Custom Images for Azure Confidential VMs Using Azure Compute Gallery
Overview Azure Confidential Virtual Machines (CVMs) provide hardware-enforced protection for sensitive workloads by encrypting data in use using AMD SEV-SNP technology. In enterprise environments, organizations typically need to: Create hardened golden images Standardize baseline configurations Support both Platform Managed Keys (PMK) and Customer Managed Keys (CMK) Version and replicate images across regions This guide walks through the correct and production-supported approach for building reusable custom images for Confidential VMs using: PowerShell (Az module) Azure Portal Disk Encryption Sets (CMK) Azure Compute Gallery Key Design Principles Before diving into implementation steps, it is important to clarify that during real-world implementations, two important architectural truths become clear: ✅1️⃣ The Same Image Supports PMK and CMK The encryption model (PMK vs CMK) is not embedded in the image. Encryption is applied: At VM deployment time Through disk configuration (default PMK or Disk Encryption Set for CMK) This means: You build one golden image. You deploy it using PMK or CMK depending on compliance requirements. This simplifies lifecycle management significantly. ✅2️⃣ Confidential VM Image Versions Must Use Source VHD When publishing to Azure Compute Gallery: Confidential VMs require Source VHD (Mandatory Requirement) This is a platform requirement for Confidential Security Type support. Therefore, the correct workflow is: Deploy base Confidential VM Harden and configure Generalize Export OS disk as VHD Upload to storage Publish to Azure Compute Gallery Deploy using PMK or CMK Security Stack Breakdown Protection Area Technology Data in Use AMD SEV-SNP Boot Integrity Secure Boot + vTPM Image Lifecycle Azure Compute Gallery Disk Encryption PMK or CMK Compliance Control Disk Encryption Set (CMK) Implementation Steps 🖥️ Step 1 – Deploy a Base Windows Confidential VM This VM will serve as the image builder. Key Requirements Gen2 Image Confidential SKUs (similar to DCasv5 or ECasv5 series) SecurityType = ConfidentialVM Secure Boot enabled vTPM enabled Confidential OS Encryption enabled Reference Code Snippets (PowerShell) $rg = "rg-cvm-gi-pr-sbx-01" $location = "NorthEurope" $vmName = "cvmwingiprsbx01" New-AzResourceGroup -Name $rg -Location $location $cred = Get-Credential $vmConfig = New-AzVMConfig ` -VMName $vmName ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOperatingSystem ` -VM $vmConfig ` -Windows ` -ComputerName $vmName ` -Credential $cred $vmConfig = Set-AzVMSourceImage ` -VM $vmConfig ` -PublisherName "MicrosoftWindowsServer" ` -Offer "WindowsServer" ` -Skus "2022-datacenter-azure-edition" ` -Version "latest" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 📸 Reference Screenshots 🔧 Step 2 – Harden and Customize the OS This is where you: Install monitoring agents Install Defender for Endpoint Apply CIS baseline Install security agents Remove unwanted services Install application dependencies This is your enterprise golden baseline depending on the individual organizational requirements. 🔄 Step 3 – Generalize the Windows Confidential VM (Production-Ready Method) Confidential VMs often enable BitLocker automatically. Improper Sysprep handling can cause failures. Generalizing a Windows Confidential VM properly is critical to avoid: Sysprep failures BitLocker conflicts Image corruption Deployment errors later Follow these steps carefully inside the VM and later through Azure PowerShell. 1. Remove Panther Folder The Panther folder stores logs from previous Sysprep operations. If leftover logs exist, Sysprep can fail. This safely removes old Sysprep metadata. rd /s /q C:\Windows\Panther ✔ This step prevents common “Sysprep was not able to validate your Windows installation” errors. 2. Run Sysprep Navigate to Sysprep directory and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown Parameters explained: Parameter Purpose /generalize Removes machine-specific info (SID, drivers) /shutdown Powers off VM after completion ⚠️ Handling BitLocker Issues (Common in Confidential VMs): Confidential VMs may automatically enable BitLocker. If Sysprep fails due to encryption, follow the next steps to resolve the issue and execute sysprep again. 3. Check BitLocker Status & Turn Off BitLocker manage-bde -status If Protection Status is 'Protection On': manage-bde -off C: Wait for decryption to complete fully. ⚠️ Do not run Sysprep again until decryption reaches 100%. 4. Reboot and Run Sysprep Again After decryption completes: Reboot the VM Open Command Prompt as Administrator Navigate to Sysprep folder and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown ✔ VM will shut down automatically. 5. Mark VM as Generalized in Azure Now switch to Azure PowerShell: Stop-AzVM -Name $vmName -ResourceGroupName $rg -Force Set-AzVM -Name $vmName -ResourceGroupName $rg -Generalized ✔ This marks the VM as ready for image capture. 🧠 Why These Extra Steps Matter in Confidential VMs Confidential VMs differ from standard VMs because: They use vTPM They may auto-enable BitLocker They enforce Secure Boot They use Gen2 images Improper handling can cause: Sysprep failures Image capture errors Deployment failures from image “VM provisioning failed” issues These cleanup steps dramatically increase success rate. 💾 Step 4 – Export OS Disk as VHD Azure Gallery Image Definitions with Security Type as 'TrustedLaunchAndConfidentialVmSupported' require Source VHD as the support for Source Image VM is not available. Generate the SAS URL for OS Disk of the Virtual Machine. Copy to Storage Account as a .vhd file. Use Get-AzStorageBlobCopyState to validate the copy status and wait for completion. $vm = Get-AzVM -Name $vmName -ResourceGroupName $rg $osDiskName = $vm.StorageProfile.OsDisk.Name $sas = Grant-AzDiskAccess ` -ResourceGroupName $rg ` -DiskName $osDiskName ` -Access Read ` -DurationInSecond 3600 $storageAccountName = "stcvmgiprsbx01" $storageContainerName = "images" $destinationVHDFileName = "cvmwingiprsbx01-OsDisk-VHD.vhd" $destinationContext = New-AzStorageContext -StorageAccountName $storageAccountName Start-AzStorageBlobCopy -AbsoluteUri $sas.AccessSAS -DestContainer $storageContainerName -DestContext $destinationContext -DestBlob $destinationVHDFileName Get-AzStorageBlobCopyState -Blob $destinationVHDFileName -Container $storageContainerName -Context $destContext 🏢 Step 5 – Create Azure Compute Gallery & Image Version Instead of creating a standalone managed image, we will: Create an Azure Compute Gallery Create an Image Definition Publish a Gallery Image Version from the generalized Confidential VM This enables: Versioning Regional replication Staged rollouts Enterprise image lifecycle management 1. Create Azure Compute Gallery $galleryName = "cvmImageGallery" New-AzGallery ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Description "Confidential VM Image Gallery" 2. Create Image Definition for Windows Confidential VM Important settings: OS State = Generalized OS Type = Windows HyperV Generation = V2 Security Type = TrustedLaunchAndConfidentialVmSupported $imageDefName = "img-win-cvm-gi-pr-sbx-01" $ConfidentialVMSupported = @{Name='SecurityType';Value='TrustedLaunchAndConfidentialVmSupported'} $Features = @($ConfidentialVMSupported) New-AzGalleryImageDefinition ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Name $imageDefName ` -OsState Generalized ` -OsType Windows ` -Publisher "prImages" ` -Offer "WindowsServerCVM" ` -Sku "2022-dc-azure-edition" ` -HyperVGeneration V2 ` -Feature $features ✔ HyperVGeneration must be V2 for Confidential VMs. 📸 Reference Screenshot 3. Create Gallery Image Version from Generalized VM Now publish version 1.0.0 from the generalized VM OS Disk VHD to the Image Definition: There is no support for performing this step using Azure PowerShell, hence the Azure Portal needs to be used Ensure the right network and RBAC access on the storage account is in place Replication can be enabled on the Image Version to multiple regions for enterprises ✅ Why Azure Compute Gallery is the Right Choice Feature Managed Image Azure Compute Gallery Versioning ❌ ✅ Cross-region replication ❌ ✅ Enterprise lifecycle Limited Full Recommended for production ❌ ✅ For enterprise confidential workloads, Azure Compute Gallery is strongly recommended. 🚀 Step 6 – Deploy Confidential VM from Gallery Image 🔹 Using PMK (Default) If you do not specify a Disk Encryption Set, Azure uses Platform Managed Keys automatically. $imageId = (Get-AzGalleryImageVersion ` -GalleryName $galleryName ` -GalleryImageDefinitionName $imageDefName ` -ResourceGroupName $rg ` -Name "1.0.0").Id $vmConfig = New-AzVMConfig ` -VMName "cvmwingiprsbx02" ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" $vmConfig = Set-AzVMSourceImage -VM $vmConfig -Id $imageId $vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Windows -ComputerName "cvmwingiprsbx02" -Credential (Get-Credential) New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 🔹 Using CMK (Same Image!) If compliance requires CMK: Create Disk Encryption Set Associate with Key Vault or Managed HSM Attach DES during deployment $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithCustomerKey" ` -DiskEncryptionSetId $des.Id ✔ Same image ✔ Different encryption model ✔ Encryption applied at deployment 🔎 Validation Check Confidential Security: Get-AzVM -Name "cvmwingiprsbx02" -ResourceGroupName $rg | Select SecurityProfile Check disk encryption: Get-AzDisk -ResourceGroupName $rg Architectural Summary Confidential VM security is independent of disk encryption model Encryption choice is applied at deployment One image supports multiple compliance models Source VHD is required for Confidential VM gallery publishing Azure Compute Gallery enables enterprise lifecycle PMK vs CMK Decision Matrix Scenario Recommended Model Standard enterprise workloads PMK Financial services / regulated CMK BYOK requirement CMK Simplicity prioritized PMK 🏢 Enterprise Recommendations ✔ Always use Azure Compute Gallery ✔ Use semantic versioning (1.0.0, 1.0.1) ✔ Automate using Azure Image Builder ✔ Enforce Confidential VM via Azure Policy ✔ Enable Guest Attestation ✔ Monitor with Defender for Cloud Final Thoughts Creating custom images for Azure Confidential VMs allows organizations to combine the security benefits of Confidential Computing with the operational efficiency of standardized deployments. By baking security baselines, monitoring agents, and required configurations directly into a golden image, every new VM starts from a consistent and trusted foundation. A key advantage of this approach is flexibility. The custom image itself is independent of the disk encryption model, meaning the same image can be deployed using Platform Managed Keys (PMK) for simplicity or Customer Managed Keys (CMK) to meet stricter compliance requirements. This allows platform teams to maintain a single image pipeline while supporting multiple security scenarios. By publishing images through Azure Compute Gallery, organizations can version, replicate, and manage their Confidential VM images more effectively. Combined with proper VM generalization and hardening practices, custom images become a reliable way to ensure secure, consistent, and scalable deployments of Confidential workloads in Azure. As Confidential Computing continues to gain adoption across industries handling sensitive data, investing in a well-designed custom image pipeline will enable organizations to scale securely while maintaining consistency, compliance, and operational efficiency across their cloud environments.118Views1like0CommentsCalling all Microsoft Q&A contributors: Join Product Champions Program
🎉 Sign-ups are open for the Microsoft Q&A Product Champions Program (2026)! ✅ Sign up: https://aka.ms/AAzhkru 📘 Learn more + Welcome Guide: https://aka.ms/ProductChampionsWelcome If you love answering questions and helping others on Microsoft Q&A, we’d love to have you join. ``Announcing the General Availability of the Azure Maps Geocode Autocomplete API
It was just last September that we were Introducing the Azure Maps Geocode Autocomplete API but now we’re excited to announce the general availability (GA) of the Azure Maps Geocode Autocomplete API, a production‑ready REST service that delivers fast, intelligent, and structured location suggestions. After a successful public preview, we have integrated your feedback and this API is now fully supported for mission‑critical workloads, providing a modern, scalable successor to the Bing Maps Autosuggest REST API and empowering developers to deliver high‑quality, responsive location experiences. Why This API Matters Today’s applications increasingly rely on smart, real‑time location suggestions, whether for store locators, delivery routing, rideshare dispatch, address entry, or dynamic search UIs. The Azure Maps Geocode Autocomplete API brings: Instant, relevant, typed‑ahead suggestions Granular result type filtering for Place or Address categories or even subtypes of Place category Structured, developer‑friendly outputs Support for multilingual experiences and localized ranking Proximity‑aware and popularity‑aware results Whether you are modernizing an existing solution or building something new, this API gives you precision, flexibility, and performance at scale. From Preview to GA: What’s New ✔ Stable GA API version The service is now available on the 2026‑01‑01 GA version, offering a reliable, long‑term contract for production workloads. ✔ Improved suggestion ranking and language handling This version introduces refined ranking logic and improved language handling, enabling more accurate, geo‑aware, and language‑appropriate suggestions across global markets. ✔ Updated documentation and samples The GA release includes new and refreshed REST API documentation, updated samples, and improved how-to-guide to help developers build complete geocoding workflows and migrate from Bing Maps more easily. What It Can Do The Geocode Autocomplete API is purpose‑built to deliver accurate, structured suggestions as users type. Core capabilities include: Entity suggestions: Places (e.g., administrative divisions, populated places, landmarks, postal codes) and Address (e.g., roads, point addresses) entities Structured output: Consistent address components for easy downstream integration Relevance‑based ranking: Popularity, proximity, and bounding‑box awareness Multilingual support: Honor language preferences via Accept-Language Flexible filtering: Target specific countries or entity types These capabilities help you create intuitive, high‑quality user experiences across global applications. How to Use It Use the GA version of Geocode Autocomplete with the following endpoint: https://atlas.microsoft.com/search/geocode:autocomplete?api-version=2026-01-01&query={query} Common optional parameters include: coordinates – Bias suggestions near the provided location bbox – Biases suggestions to entities that fall within the specified bounding box of the visible map area top – Limit the number of returned suggestions resultTypeGroups / resultTypes – Filter for Places or Addresses or its subtypes countryRegion – Restrict to specific country/Region A typical workflow: Call Autocomplete as the user types Use the selected suggestion with Azure Maps Geocoding service to retrieve coordinates and complete the interaction (e.g., map display, routing, address validation) Example: Autocomplete for Place Entity Query GET https://atlas.microsoft.com/search/geocode:autocomplete?api-version=2026-01-01 &subscription-key={YourAzureMapsKey} &coordinates={coordinates} &query=new yo &top=3 This request returns structured top 3 suggestions for the partial input “new yo,” helping users quickly find places like New York city or New York state based on user location. { "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": { "typeGroup": "Place", "type": "PopulatedPlace", "geometry": null, "address": { "locality": "New York", "adminDistricts": [ { "name": "New York", "shortName": "N.Y." } ], "countryRegions": { "ISO": "US", "name": "United States" }, "formattedAddress": "New York, N.Y." } } }, { "type": "Feature", "properties": { "typeGroup": "Place", "type": "AdminDivision1", "geometry": null, "address": { "locality": "", "adminDistricts": [ { "name": "New York", "shortName": "N.Y." } ], "countryRegions": { "ISO": "US", "name": "United States" }, "formattedAddress": "New York" } } }, { "type": "Feature", "properties": { "typeGroup": "Place", "type": "AdminDivision2", "geometry": null, "address": { "locality": "", "adminDistricts": [ { "name": "New York", "shortName": "N.Y." }, { "name": "New York County" } ], "countryRegions": { "ISO": "US", "name": "United States" }, "formattedAddress": "New York County" } } } ] } Example: Autocomplete for an Address Entity Query GET https://atlas.microsoft.com/search/geocode:autocomplete?api-version=2026-01-01 &subscription-key={YourAzureMapsKey} &bbox={bbox} &query=One Micro &top=3 &countryRegion=US This request returns structured address suggestion for NE One Microsoft Way, categorized as an Address entity. { "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": { "typeGroup": "Address", "type": "RoadBlock", "geometry": null, "address": { "locality": "Redmond", "adminDistricts": [ { "name": "Washington", "shortName": "WA" }, { "name": "King County" } ], "countryRegions": { "ISO": "US", "name": "United States" }, "postalCode": "98052", "streetName": "NE One Microsoft Way", "addressLine": "", "formattedAddress": "NE One Microsoft Way, Redmond, WA 98052, United States" } } } ] } Example: Integrate with Web Application Below sample shows user enter query and autocomplete service provide a series of suggestions based on user query and user location. Ready to Bring Autocomplete to Production? The Azure Maps Geocode Autocomplete API is now fully production‑ready, giving you the stability, performance, and flexibility needed to power fast, intuitive location‑driven experiences. Whether you're migrating from Bing Maps Autosuggest or enhancing your existing address or search workflows, the GA release makes it easier to deliver high‑quality suggestions at scale. If you're ready to upgrade your location experiences, now’s the perfect time to start. Explore the updated Azure Maps Autocomplete documentation, try the new endpoint, and let us know how it works for you. We’d love to hear your feedback. Let’s keep building great location intelligence experiences together. Resources to Get Started Geocode Autocomplete REST API Documentation How-to-Guide for Azure Maps Search Service Geocode Autocomplete Samples Migrate from Bing Maps to Azure Maps How to Use Azure Maps APIs317Views0likes0Comments