azure monitor
1349 TopicsAccelerating AKS troubleshooting with the Azure Copilot Observability Agent
AKS incidents rarely stay within one Kubernetes object, signal, or tool. A latency spike might first appear in application telemetry, but the root cause may sit elsewhere: pod restarts, node pressure, scheduling failures, or a recent configuration change. The Azure Copilot Observability Agent in Azure Monitor helps connect these signals into an explainable investigation, so teams can move from symptoms to evidence-backed next steps. Why AKS troubleshooting is complex Troubleshooting Azure Kubernetes Service (AKS) is complex because failures can originate in workloads, platform components, infrastructure, or the application code running on the cluster. For example, pods stuck in Pending may indicate capacity or scheduling issues, while application latency may be caused by throttling, failed probes, pod restarts, or node pressure below the app. During an incident, simply having more telemetry is not enough. Teams need a way to test likely causes, rule out unrelated signals, and keep the investigation tied to the affected workload and time window. From signal to root cause: the investigation flow The Observability Agent follows a consistent investigation pipeline: Scope the problem by identifying the most likely infrastructure resources involved, plus connected dependencies. Collect data across metrics, logs, traces, change history, and related signals. Detect anomalies using learned baselines (for metrics) and log analysis. Correlate across resources spanning infrastructure and application layers. Run deep diagnostics by invoking resource-specific tools when needed to pinpoint root cause. Summarize findings in a structured format: what happened, why it happened, and what to do next. AKS investigation data sources The agent works with telemetry already available in your Azure Monitor environment. Investigation depth improves as more relevant signals are enabled, including Container insights logs, Kubernetes events and state, Azure managed service for Prometheus, container and pod logs, Application Insights telemetry for AKS-hosted workloads, Azure Activity Log changes, control plane logs routed through diagnostic settings, and resource metadata for the cluster, node pools, workloads, and related Azure resources. Figure 1. AKS investigation data sources You don’t need to enable every telemetry source to get started. The Observability Agent uses the data already available in Azure Monitor, and its findings become more complete as more AKS and application signals are collected. Example 1: AKS infrastructure — explaining why new pods never start Consider a workload rollout on AKS where replacement pods remain stuck in Pending state. What looks like a failed release may stem from the workload definition, cluster state, or underlying infrastructure. Investigation walkthrough Symptom: rollout is blocked Replacement pods remain in Pending during rollout, and Kubernetes events show repeated scheduling failures. This indicates that the rollout is blocked before new pods can start. Workload evidence: scheduling, not startup Pod state identifies the affected workload, while Kubernetes events show repeated placement failures. The issue is therefore tied to scheduling rather than application startup or container crash behavior. Cluster evidence: capacity pressure When enabled, Prometheus node metrics show CPU and memory utilization near capacity. Cluster-level trends show resource pressure increasing at the same time as pending pods and scheduling failures. Likely cause: insufficient schedulable capacity The scheduler cannot place new pods because the relevant node pool does not have enough available capacity. The failed rollout is best explained by capacity pressure in the target node pool rather than an application crash or image startup failure. Recommended action Scale out the affected node pool or adjust workload resource requests, then retry the rollout once schedulable capacity is restored. Figure 2. AKS investigation flow The Observability Agent connects pod state, scheduling events, and node pressure to explain why the rollout is blocked and which capacity action to consider next. Example 2: Joint app-AKS investigation — tracing application latency to pod restarts Now consider a customer-facing application where users see increased latency and intermittent HTTP 5xx errors after deployment. The first symptom appears in application telemetry, but the unhealthy requests are served by pods that are repeatedly restarting in AKS. Investigation walkthrough Symptom: customer-facing service degradation After deployment, application telemetry shows increased latency and HTTP 5xx errors. The first visible impact appears at the application layer. AKS evidence: unstable pods Affected pods enter CrashLoopBackOff, restart counts increase, and Kubernetes events show back-off restarts, probe failures, or image or command errors. Container logs point to startup exceptions, missing configuration, or crash details. Resource evidence: workload-specific pressure Container memory usage approaches configured limits before restarts, while node metrics show no broad node pressure. This suggests the issue is workload-specific rather than cluster-wide capacity related. Change evidence: deployment correlation Deployment history shows a new image or configuration change shortly before restarts began, with no matching platform health event. The timing points to the latest deployment or configuration change. Recommended action Review the latest image or configuration change, inspect container logs, adjust memory limits, or roll back if needed. Focus remediation on the workload change rather than node pool scaling. This pattern shows how an application symptom can map back to AKS workload behavior. Application telemetry establishes the user impact, while Kubernetes events, container logs, and resource metrics help explain why the affected pods keep failing. Operational impact For site reliability engineers, platform teams, and IT professionals, the Observability Agent reduces the time spent moving between application and AKS telemetry. It brings relevant signals into one investigation, surfaces supporting evidence, and applies Azure Monitor and AKS context so your team can review the findings, validate the recommended path, and decide which production changes to make. Figure 3. AKS investigation results Using the Observability Agent You can start using the Observability Agent from the Azure portal in two common AKS troubleshooting flows: Investigation mode: Start an investigation from an Azure Monitor alert on an AKS resource or from an Application Insights alert for an AKS-hosted workload. The agent uses the alert context to scope the incident, correlate application and cluster telemetry, and summarize the likely cause with recommended next steps. Chat-based exploration: Open the Monitor experience in AKS and select the Observability Agent button to chat with your telemetry. Use natural language to ask follow-up questions, explore logs and metrics, detect and inspect anomalies, and narrow down likely causes. Figure 4. Starting Observability Agent from AKS Monitor experience Next steps Azure Copilot Observability Agent overview Monitor Azure Kubernetes Service with Azure Monitor Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar — A walkthrough of real Observability Agent scenarios, best practices, and what's available today, along with a look at what's coming next and live Q&A with the product team. Register for the Observability Agent webinar. We'd love your feedback The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com56Views0likes0CommentsAnomaly detection made easy with Dynamic thresholds for Log search alerts
We’re excited to announce the General Availability of dynamic thresholds for log search alerts in Azure Monitor. Dynamic thresholds make anomaly detection easier by using machine learning to learn normal behavior from your historical log query results, automatically account for patterns such as hourly, daily, and weekly seasonality, and adapt as your environment changes. Instead of manually choosing static limits that can quickly become outdated, you can let Azure Monitor automatically determine the right threshold for each alert rule. Dynamic thresholds for Log search alerts are available at no extra charge - you pay the standard log search alert rule rate. Why it matters Simplified configuration: No need to fine-tune thresholds manually. Adaptive monitoring: Alerts automatically adapt to changing usage patterns and trends. At-scale intelligence: For multi-dimensional monitoring, thresholds are calculated per dimension combination. Example use cases AKS Pod restart spike anomaly detection Scenario: Monitor Kubernetes Pod logs for spikes in pod restarts across clusters. Why dynamic thresholds help: AKS workloads often scale dynamically; static thresholds can’t account for autoscaling patterns. Dynamic thresholds adapt to normal fluctuations in node/pod counts and alert only on true anomalies. Example query: KubePodInventory | summarize restartCount = sum(PodRestartCount) by bin(TimeGenerated, 10m), ClusterName, Namespace, Name Dynamic threshold settings: Namespace (for workload-level baselines). Name (for per-pod granularity if needed). Measure: restartCount (the aggregated column from the query). Split by dimensions (optional): Namespace (for workload-level baselines). Name (for per-pod granularity if needed). Resource Inventory Drift Detection (Azure Resource Graph) Scenario: Detect sudden spikes in resource creation or deletion across subscriptions or management groups utilizing Log search alerts integration with Azure Resource Graph that may indicate runaway deployments. Why dynamic thresholds help: Large organizations often have thousands of resources with varying deployment patterns. Static thresholds can’t account for seasonal changes (e.g., monthly deployments, scaling events). Dynamic thresholds adapt per subscription or resource type, reducing false positives. Example query: arg("").Resources | summarize resourceCount = count() by type, subscriptionId Dynamic threshold settings: type (for specific resource type changes). subscriptionId (for per-subscription granularity). Measure: resourceCount (the aggregated column from the query). Split by dimensions (optional): type (for specific resource type changes). subscriptionId (for per-subscription granularity). Getting Started Learn more about Log search alerts with dynamic thresholds and how to set up alert rules in Azure Monitor.98Views0likes0CommentsGeneral Availability: Simple log alerts in Azure Monitor
We are excited to announce the General Availability of Simple log alerts in Azure Monitor! This feature is designed to provide a simplified and more intuitive experience for monitoring and alerting, enhancing your ability to detect and respond to problems in near real-time. Simple log alerts are a type of Log search alerts in Azure Monitor, designed to provide a simpler alternative to traditional Log search alerts. Unlike Log search alerts that aggregate rows over a defined period, Simple Log Alerts evaluate each row individually. Simple Log Alerts are supported using Basic logs as well. Before, choosing Basic logs for cost optimization - for example, configuring the traces table in Application Insights with Basic logs plan - meant giving up the ability to alert on that data. Simple log alerts close that gap, so you can keep the cost savings and alert on telemetry stored in Basic Logs. 🌐 When to use Simple Log Alerts are ideal for monitoring applications or network traffic where unaggregated, real-time detection and quick incident response are critical. Example scenarios: Failed automation jobs - get notified the moment a backup job, scheduled task, or any automated process fails, rather than waiting for an aggregation window. Windows events affecting storage or security - alert on individual event log entries that signal disk failures, security breaches, or service disruptions. 🔁 Flexible Trigger Recurrence By default, Simple log alerts fire on every matching row, but you can tune this to reduce noise. Choose to alert only when the condition is met at least once, twice, three times, or a custom number of times within a minute - giving you control over sensitivity without sacrificing the low-latency. 💰 Pricing Information Simple log alert rules evaluate your data every minute, so billing is the same as 1-minute frequency alert rules. For detailed pricing information, refer to Pricing - Azure Monitor | Microsoft Azure. You will see these rules in your billing statement tagged with kind:simplelogalert. 📚 Documentation and Links Create a simple log search alert in Azure Monitor - Azure Monitor | Microsoft Learn Overview of Azure Monitor alerts - Azure Monitor | Microsoft Learn178Views1like0CommentsThe Azure Copilot Observability Agent Chat - Stop Writing Queries, Start Asking Questions.
Services and applications produce massive amounts of telemetry – and making sense of all this data takes effort. Data is often spread across different stores, which means the way to clear insights goes through careful querying, refinement and correlation. The Azure Copilot Observability agent now has a chat experience that simplifies this dramatically – you just ask, in your own plain, natural language. Ask questions. Get answers. To start chatting with the Observability agent, select a resource in the Azure Portal, and choose Logs from the resource menu. Click the Observability agent button. Soon, additional Azure observability experiences will show this or similar buttons so you can chat with the agent throughout your observability process. The Observability agent chat opens with a short intro message, and a few suggested prompts. Select one of the suggestions or type your question in natural language: “What errors increased in the last 24 hours?” “¿Existen anomalías de latencia?” (are there any anomalies) “どの依存関係が失敗しているか” (which dependencies are failing) The agent translates your prompt into queries across all relevant data sources, analyses your data, and returns clear, data-backed insights – so you don't need to write KQL queries, switch between logs and metrics experiences, or dive into the schemas of your data store. Explore your data – interactively The chat experience is designed for an interactive process of data exploration and troubleshooting. Through the chat you can explore trends in logs and metrics, identify anomalies and visualize results directly in the chat – all from one interface. Note: The agent operates here as your personal observability assistant - and it can only query data in your behalf, and access resources that you can access. The chat with the agent has a progressive exploration flow, instead of isolated queries. Still, in each step in the conversation the agent provides a clear chain of thought, and in it - the actual queries it used - so you can keep clear track of how it understood your prompt, and created the provided output. Results are show clearly and explained. In the example shown here, we follow up and ask the agent to create a time chart of the failed operations impacted by the errors it reported earlier. The result is clear - GET Customers/Details was impacted significantly, reaching 100K failed requests over a long time: From exploration to guided investigation The chat is very useful for guided investigations that go as deep as you choose, just as you would with the classic analysis tools over logs or metrics. Following the example shown above, we ask the agent to show exceptions or traces correlated with the failed requests: The agent found an association to NullReferenceException, and suggests going deeper and use the operation_Id field to clearly identify the request -> dependency -> exception sequence. We'll accept the recommendation and choose the first suggestion: Pull full transaction timeline. And here it is - each step of the transaction timeline explained, and the culprit is found - a failed Azure Table dependency. We didn't have to write queries, review metrics, join tables or even know which tables are there. We used standard terms to ask questions in natural language, and we were able to get as deep as we wanted, and can dive deeper still. For example, you can tell the agent to: Map this call chain into a sequence-diagram style summary showing request, SQL dependency, table write, and exception. Calculate the average request latency during the last 6 hours, split by client type, location and OS Find anomalies in the exceptions logged over the last 4 hours Create a time chart to show the top 3 anomalies How many users were impacted by each of the top 3 anomalies found? Break down the exception counts by request operation Launching a deep investigation Through the chat with the observability agent, you can also trigger a full, deep investigation process. A deep investigation doesn't handle just one question, but investigates an incident thoroughly - maps all related resources, identifies anomalies, performs correlations, analyzes root causes, and eventually provides a detailed report, including findings and recommendations. To start a deep investigation - select it from the suggestions provided during the conversation, or ask the agent explicitly, for example: run a deep investigation on the NullReferenceException anomaly. Final thought If observability used to start with queries – it now starts with a conversation. You can either guide the agent through the process you want to go through - or let it investigate on its own. Just ask. Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what’s coming next. Check out our recent public preview update of the Azure Copilot Observability agent. Live webinar A walkthrough of real Observability agent scenarios, best practices, and what’s available today - along with a look at what’s coming next, and live Q&A with the product team. 👉 Register here We’d love your feedback The Observability agent continues to evolve based on real‑world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com673Views1like2CommentsInside the Observability Agent: How Deep Investigations and Reasoning Work
Deep investigation in the Azure Copilot Observability Agent turns observability data into a verified, data-backed explanation of what happened during an incident, correlating application, infrastructure, and platform signals across time, scope, and type to produce a structured root cause analysis.183Views0likes0CommentsModern VM monitoring, powered by OpenTelemetry
At Build 2026, we're announcing the general availability of OpenTelemetry (OTel) Guest OS metrics for Azure VMs and Arc-enabled Servers. OTel provides a standards-based foundation for VM monitoring with consistent metrics across Windows and Linux, richer Guest OS and per-process visibility, and streamlined integration with open-source and cloud-native observability tools. Alongside the GA, we're introducing an enhanced VM monitoring experience, recommended alerts, and out-of-the-box Grafana dashboards, all powered by OTel Guest OS metrics. We're also sharing upcoming VM troubleshooting capabilities in the Azure Copilot observability agent enriched by OTel Guest OS metrics. What are OpenTelemetry Guest OS metrics OTel Guest OS metrics are collected from inside a VM. Today's coverage includes a curated set of CPU, memory, disk I/O, networking, and per-process metrics including CPU utilization, memory usage, uptime, and thread count. The supported set is point-in-time and will continue to expand as the OTel Host Metrics Receiver evolves upstream. This level of visibility helps customers diagnose operating system and application issues without manually signing into individual VMs. Why they matter 1. Lower cost and faster queries Default OTel Guest OS metrics are available at no additional cost. They are stored in Azure Monitor Workspace using metric-optimized storage and pricing, providing lower cost and faster query performance compared to LA-based metrics. 2. Per-process visibility for deeper troubleshooting Customers can optionally enable per-process metrics for deeper visibility into VM resource consumption. This helps identify noisy processes, memory leaks, runaway jobs, or resource-intensive applications without manually signing into the VM. 3. Consistent metrics across Windows and Linux Use the same metric names, dashboards, and alerts across operating systems without maintaining separate monitoring configurations. 4. Native PromQL support Use PromQL with the scale and managed experience of Azure Monitor Workspace. 5. OpenTelemetry-based standardization Use the same metrics across Azure Monitor, existing OTel pipelines, or other compatible observability backends. Log Analytics (LA)‑based metrics vs OTel‑based metrics Customers running workloads on Azure VMs and Arc-enabled Servers have long relied on Log Analytics (LA)-based metrics for fleet visibility. That experience continues to be generally available and trusted by thousands of customers. We recommend evaluating your requirements to determine which approach best suits your needs. LA-based metrics remain the foundation for customers who need advanced analytics and correlation, while OTel-based metrics open new possibilities for modern VM observability. Learn more. New Capabilities Powered by OpenTelemetry VM monitoring experience powered by OpenTelemetry (GA) We're excited to announce the general availability of the enhanced monitoring experience for Azure VMs and Arc servers. This experience brings comprehensive monitoring capabilities in a single, streamlined view, helping you more efficiently observe, diagnose, and optimize your virtual machines. The new experience offers two levels of insight within one unified interface: Basic view (Host OS-based): Available for all Azure VMs with no configuration required. This view surfaces key host-level metrics including CPU, disk, and network performance for quick health checks. Detailed view (Guest OS-based): Requires simple onboarding. Azure Monitor continues to support the GA detailed view powered by Log Analytics-based metrics. Customers can now choose to power the experience using OTel Guest OS metrics, which enable recommended alerts and provide expanded visibility into Guest OS and process-level resource consumption, including CPU, memory, disk I/O, and networking. Dashboards with Grafana for VMs For deeper analysis and customization, customers can leverage Azure Monitor dashboards with Grafana powered by OTel Guest OS metrics and PromQL at no additional cost. Built-in dashboards provide out-of-the-box visualizations for at-scale monitoring, host-level monitoring, Guest OS monitoring, and per-process monitoring, while still allowing teams to: Customize panels and dashboards Run ad hoc investigations Import dashboards from the Grafana community Share dashboards using Azure RBAC and ARM/Bicep deployment support Together, the enhanced VM monitoring experience and Grafana dashboards provide both streamlined day-to-day monitoring and flexible deep troubleshooting capabilities for modern VM environments. Query metrics in the context of your resources (GA) We’re also announcing the general availability of resource-scope querying for Azure Monitor Workspace (AMW) metrics, including OTel Guest OS metrics. With resource-scope query, you can query metrics directly from the context of a resource, resource group, or subscription, without needing to know which workspace stores the data. This simplifies troubleshooting, aligns with Azure-native workflows, and enforces least-privilege access using Azure RBAC. This capability powers scenarios like querying OTel Guest OS metrics directly from the Virtual Machine resource in Azure Portal, or resources can be scoped as a dedicated data source in Grafana to query with PromQL, making it easier for application and infrastructure teams to monitor and troubleshoot in the context of their workloads. Coming soon: Observability Agent Troubleshooting for VMs (Public Preview) Today, the Observability Agent helps customers investigate issues by correlating applications, infrastructure signals, LA-based metrics, logs, alerts, health information, and recent changes into a guided investigation narrative. Support for OTel Guest OS metrics is coming soon, extending investigations with richer Guest OS and per-process visibility. With OTel Guest OS metrics, the Observability Agent will be able to incorporate finer-grained operating system and process-level insights into its analysis, helping customers more quickly identify resource bottlenecks and understand their impact on application performance. Instead of manually piecing signals together across multiple tools and timelines, customers will receive a guided investigation summary with likely causes and recommended next steps. Combined with the new VM monitoring experience and Grafana dashboards, customers will have both AI-assisted investigations and powerful manual troubleshooting tools built on the same OTel foundation. Onboarding VMs at scale to OpenTelemetry Onboarding Azure VMs and Arc-enabled Servers to OTel Guest OS metrics is now simpler and more cost-efficient than ever. For teams getting started at scale, the easiest path is through the Monitoring Coverage experience in the Azure portal, where you can review recommended resources and onboard VMs through a guided workflow. Customers that prefer infrastructure-as-code can use ARM and Bicep templates to apply the same monitoring configuration programmatically. Azure Advisor recommendations provide another seamless entry point for onboarding, proactively identifying VMs that are not fully monitored and guiding customers to enable OTel -based monitoring with a few clicks. This helps teams continuously improve coverage across their fleet without needing to manually audit resources. Customers can now also reuse an existing Data Collection Rule (DCR) during onboarding, making it easier to standardize monitoring across large VM fleets. After onboarding, teams can centrally evolve their monitoring configuration by updating that DCR to collect additional metrics and logs, with changes applying across all associated VMs. Get Started Explore the new OpenTelemetry-powered experiences today: Enable enhanced monitoring for an Azure virtual machine - Azure Monitor Migrate from logs-based to OpenTelemetry metrics for Azure virtual machines - Azure Monitor Metrics experience for virtual machines in Azure Monitor - Azure Monitor Use Dashboards with Grafana for Azure Virtual Machines - Azure Monitor414Views3likes1CommentPUBLIC PREVIEW - Azure Monitor - Collect Azure Resource Platform Logs at Scale with DCRs
PUBLIC PREVIEW - Azure Monitor - Collect Azure Resource Platform Logs at Scale with DCRs. How DCR-based platform logs simplify the telemetry collection for organizations managing 1,000+ resources.650Views2likes1CommentAzure Monitor SLIs now Generally Available
Azure Monitor SLIs are now generally available Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in Azure Monitor are now generally available. Teams can now measure reliability based on customer experience, not just infrastructure signals. SLI: A quantitative measure of how well an application or service is performing from the customer’s point of view. SLO: A defined target for an SLI that represents how good or bad the SLI is over a given time-period. This is also referred to as a baseline in Azure Monitor. Traditional monitoring shows what is happening across your systems, but not always what customers are experiencing. A service can be technically available and still feel unreliable because of latency, partial failures, or dependency issues. SLIs help close that gap by measuring reliability from the customer’s point of view. With GA, Azure Monitor now brings SLI authoring, SLO tracking, error budgets, and burn rate–based alerting into one experience, helping teams focus on whether they are meeting the reliability their customers expect. What Azure Monitor SLI helps you do Azure Monitor SLI lets you measure availability and latency with either request-based or window-based evaluation methods. In Azure Monitor, SLIs are defined at the Service Group level, which provides a logical representation of your application across multiple resources. This gives teams a clearer view of application health, customer impact, and the signals that matter most. SLIs continuously evaluate your service by using existing Azure Monitor metrics and store the resulting evaluations in your Azure Monitor Workspace. Azure Monitor uses these SLI evaluations to power error budgets, burn rate visualization, and alerting. This helps teams spot reliability issues earlier and make better release and incident response decisions. Get started To get started, you’ll need: A Service Group. Application metrics flowing into an Azure Monitor Workspace, for example through Managed Prometheus or OpenTelemetry Collect and analyze OpenTelemetry data with Azure Monitor (Preview) - Azure Monitor | Microsoft Learn Learn more here. Summary Azure Monitor SLI helps teams measure customer experience, track reliability against clear targets, and respond sooner with error budgets and burn rate–based alerting. Learn more in the product documentation and start defining SLIs for your services in Azure Monitor today.252Views0likes0CommentsAzure Monitor Metrics Export Generally Available
Today, we’re excited to announce the general availability of Azure Monitor Metrics Export using data collection rules (DCRs). A scalable, flexible way to continuously export platform metrics with dimensional fidelity, lower latency, and more control over what you send downstream. Azure Monitor Metrics Export is configured through data collection rules and can route platform metrics to Azure Storage accounts, Azure Event Hubs, or Azure Log Analytics workspaces. Compared to diagnostic settings, DCR-based metrics export supports multidimensional metrics, metric-name filtering, and improved scalability for large environments. Here are some of the key benefits of Azure Monitor Metrics Export: Control what you export: You can export all supported metrics for a resource type or filter to specific metric names, helping reduce downstream volume and manage cost. Preserve dimensional fidelity: The DCR-based metric export supports multidimensional metrics, making downstream analysis and correlation more meaningful. Get faster export latency: End-to-end export latency is typically within about 3 minutes, improving time to insight for operational and analytics workflows. With Azure Monitor Metrics Export, organizations can build more scalable observability pipelines, route metrics to the destinations that fit their architecture, and unlock richer analysis for operations, reporting, and integration scenarios. What’s new in GA With general availability, Azure Monitor Metrics Export offers a production-ready path to continuously stream supported platform metrics using data collection rules. Azure Monitor Metrics Export now covers 44 Azure regions, up from 12 regions previously. This expanded footprint helps more customers adopt DCR-based metrics export closer to where their resources run, improving rollout flexibility for global deployments. Customers can export metrics to Azure Storage, Azure Event Hubs, or Azure Log Analytics, preserve metric dimensions, and filter by metric name to better control downstream volume and cost. Learn more about metrics export using data collection rules. We’re excited to make Azure Monitor Metrics Export generally available and look forward to seeing how customers use it to build more reliable, cost-conscious, and extensible monitoring solutions on Azure.280Views0likes0CommentsAzure Monitor Copilot Observability Agent: What’s new at Build
The Observability agent in Azure Copilot is an AI-powered assistant built into Azure Monitor that helps engineers investigate issues and explore their systems using natural language. By grounding its analysis in telemetry data such as metrics, logs, and traces, it supports both open-ended exploration and guided troubleshooting. For more details, see the documentation. Since our initial public preview, the Observability agent in Azure Copilot has continued to evolve with new capabilities and expanded coverage (You can read more about the initial release in our previous blog) At Build 2026, we’re introducing updates that expand the Observability agent’s capabilities and the range of scenarios it can support. These updates provide deeper analysis and more detailed responses for both exploration and investigation. Expanded Investigation Scenarios The Observability agent now supports a broader set of scenarios across applications and infrastructure. These can be accessed directly from relevant product experiences, without requiring a prior alert, allowing teams to explore data conversationally and initiate deeper investigations as signals emerge. Integration with Microsoft Foundry AI Agent The Observability agent integrates with Microsoft Foundry AI Agents, enabling correlation of signals across key generative AI and agent observability scenarios such as latency spikes, error patterns, and tool invocation failures. Teams can interact with the Observability agent either from alerts - including alerts based on Foundry telemetry - or directly within Application Insights, where the Agents details experience serves as the primary entry point. From there, users can use the Observability agent to diagnose errors, analyze trends, and explore their data across one or multiple agents. Application Insights integration The Observability agent enables investigation of failure scenarios directly from Application Insights Failures blade, allowing teams to analyze application-level issues and move from symptom to root cause. Azure Kubernetes Service (AKS) integration The Observability agent enables deep investigation of issues in Azure Kubernetes Service (AKS) clusters. AKS investigations correlate signals from Azure Monitor with Kubernetes logs and events, and (coming soon) Prometheus metrics stored in an Azure Monitor Workspace. Together, these signals enable full‑stack analysis of applications running on AKS. The Observability agent helps teams determine whether an issue originates from the application or from the underlying Kubernetes platform, reducing time to diagnosis and resolution. Activity Logs integration Investigations can be initiated based on Azure Resource Health events surfaced in Activity Logs, enabling analysis of service-impacting signals related to the Azure platform. Deeper Insights across systems Multiple Application Insights - Coming soon! The Observability agent supports investigations that can span multiple Application Insights resources, enabling scenarios that involve multiple services within distributed applications. The agent can guide users to expand the investigation scope when cross-service issues are detected. Integration with Azure Service Health The Observability agent correlates investigation context with Azure Service Health events, helping teams understand potential platform impact as part of their investigation. This helps distinguish application-level issues from broader Azure platform conditions and prioritize active impacts. Issue management Enhancements Viewing issues Issues can now be viewed in multiple places, depending on the required scope: Azure Monitor: showing issues across all Azure Monitor Workspaces (AMWs) under the selected subscriptions Azure Monitor Workspace: showing issues stored within a specific AMW Issue actions & notifications Issue actions trigger notifications when issues are created or updated, enabling integration with workflows such as email, webhooks, and automation. Sharing and follow-up You can now download investigation results as a PDF, including supported data, enabling teams to capture and share investigation context for incident reviews and reporting. Coming Soon Billing for the Observability agent starts on July 1, 2026. The agent uses a consumption-based pricing model, so customers pay only for the AI work the agent performs. Agent consumption is measured in Azure Agent Credit (AAC) units, which reflect how many LLM tokens the agent used. For more details, see the documentation. Stay connected Follow this blog for ongoing updates and deeper dives into new capabilities Join our upcoming webinar for real-world scenarios, best practices, and a look at what’s coming next 👉 Register here We’d love your feedback The Observability Agent continues to evolve based on real-world usage and customer feedback. Share feedback through the Give Feedback option in the product or contact us at: azureobsagent@microsoft.com Want to learn more? Read our previous blog posts - Public Preview Update: Azure Copilot Observability Agent | Microsoft Community Hub The Azure Copilot Observability Agent Chat - Stop Writing Queries, Start Asking Questions. | Microsoft Community Hub Explore our documentation - Azure Copilot observability agent (preview) - Azure Monitor | Microsoft Learn419Views0likes1Comment