application insights

52 Topics

Azure Monitor Observability Agent goes autonomous (preview)
Autonomous operations for the Azure Copilot Observability Agent are now in public preview, alongside the agent's general availability. With autonomous operations enabled, the Observability Agent listens to your alerts as they fire, triages them in the background, and runs deep investigations on the issues it creates. Along the way, it correlates related alerts into a single issue - so your team starts from a small set of explained, investigated issues instead of a stream of raw alerts. Until now, teams invoked the agent when they needed it - an interactive assistant, ready to investigate when you pointed it at a problem. Now it also prepares triage context continuously, on its own, while people stay responsible for decisions and any change to the environment. See autonomous operations in action. The Observability Agent triages incoming alerts, correlates related ones into a single Azure Monitor issue, and runs a deep investigation automatically, with no human trigger. From alerts to answers Azure Monitor already gives you strong signals when something is wrong - across both metric and log alerts. Dynamic thresholds learn normal behavior and flag anomalies automatically, and that same anomaly detection now extends to log search alerts and, in preview, to Prometheus and OpenTelemetry metrics. Smart detection in Application Insights surfaces failures and performance anomalies without manual rules. The hard part is what happens next: connecting dozens of alerts, working out what they share, and figuring out what's actually going on - before anyone can act. That's the work that still lands on a person, often in the middle of the night. It's exactly where the Observability Agent comes in. What's in the public preview In public preview, you can enable the Observability Agent to: Promote individual prominent alerts into issues when you configure that with custom instructions. Run a deep investigation automatically on every issue it creates. Correlate related alerts into a single Azure Monitor issue, with a natural-language explanation of why they belong together. You provision the agent once as a resource in your Azure environment - a dedicated identity to scope, govern, and assign autonomous tasks to - then turn on autonomous operations and it gets to work. What it changes for your team The outcome is fewer things to look at and faster triage: Your team works from a short queue of meaningful issues, not a constant stream of alerts. Each issue arrives with context, reasoning, and an investigation already attached. Low-priority issues can be reviewed and dismissed in seconds. The assembly work that used to come first now happens before anyone is paged. People still make every decision and every change. The agent just makes sure they start with full context. How it works Your own instructions. Topology shows how services connect, but your team knows which boundaries matter: ownership, escalation paths, and the alerts that should always become issues. Custom instructions let you capture that in plain language and apply it going forward. For example: "The billing service is owned by a different team with a separate on-call rotation. Even when billing alerts fire alongside clinical service alerts, treat them as separate issues." Instructions shape how the agent correlates and creates issues. They don't grant permissions, bypass Azure RBAC, or change resources. Automatic topology discovery. Point the agent at your Application Insights resource and it maps services, dependencies, and how they relate. That map becomes persisted knowledge the agent builds and reuses - the same context that grounds both its correlation decisions and its deep investigations, so reasoning reflects your real architecture instead of starting from scratch each time. Deeper investigations. When the agent investigates a correlated issue, it starts from the whole picture: every related alert, every impacted resource, and the reasoning correlation already produced. The result is sharper root-cause hypotheses and recommendations that account for the full scope of impact. In practice A database latency spike triggers alerts across checkout, billing, and recommendation services. Without autonomous operations, each alert is triaged on its own. With autonomous operations enabled, the Observability Agent groups the related alerts into one issue, explains the shared timeline, and starts investigating automatically. Because your custom instructions define billing as a separate ownership boundary, its alerts become a distinct issue routed to that team's rotation. Responders start from two clear, ownership-aligned issues - each already investigated - instead of dozens of isolated alerts. What's next Autonomous operations mark the next step for the Observability Agent: from user-invoked analysis to continuous preparation. The agent assembles the context, explains the issue, and runs the investigation; your team reviews the evidence and decides what to do. And once issues are created, you can act on them. Azure Monitor issues connect to Action Groups, so approved actions can flow into your existing workflows - more on that in a future post. Next steps Learn how to get started with the Azure Copilot Observability Agent. Review the preview details in Autonomous operations in the Observability Agent. Explore how investigations work in Deep investigations in the Observability Agent. Learn how teams preserve context with Azure Monitor issues. Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar A walkthrough of real Observability Agent scenarios, best practices, and what's available today, along with a look at what's coming next and live Q&A with the product team. Register for the Observability Agent webinar We'd love your feedback The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience or reach us at azureobsagent@microsoft.com.
Efrat_Ben_Porat
Jul 08, 2026 Place Azure Observability Blog
284Views
1like
0Comments
Find anomalies in Prometheus and OpenTelemetry metrics with Dynamic Thresholds (Preview)
Dynamic thresholds are extended to query-based metric alerts in Azure Monitor, allowing to detect and alert on anomalies in Azure Monitor managed Prometheus metrics and OpenTelemetry metrics stored in an Azure Monitor Workspace. This follows the introduction of Dynamic Thresholds for Log search alerts — Azure Monitor now offers consistent Dynamic Thresholds support across logs and metrics — platform metrics, log search queries, and now query-based metric alerts. A consistent anomaly-detection approach, wherever your signals live. Dynamic thresholds are not a single static formula. They apply a range of machine-learning models and algorithms to historical query results, learn each series’ normal rhythm — including hourly, daily, and weekly seasonality — and automatically fit the most appropriate baseline separately to every time series. This way, a single alert rule can monitor many resources or dimensions while each one gets its own independent, self-refining baseline. Why Dynamic Thresholds Matter Simpler configuration: Reduce the need to define, maintain, and continuously tune static thresholds inside PromQL alert logic. Adaptive monitoring: Let alert thresholds adjust to changing workload behavior, recurring traffic peaks, and seasonal usage patterns. At-scale intelligence: Monitor multiple time series with a single alert rule, while Azure Monitor learns an independent baseline for each resource or dimension combination. Example 1 — Spot CPU anomalies in AKS workloads Scenario: Monitor container CPU utilization across pods or deployments in AKS with a query-based metric alert built on Prometheus metrics. Example query: sum by (microsoft_resource_id, namespace, deployment, container) (rate(container_cpu_usage_seconds_total[5m])) / sum by (microsoft_resource_id, namespace, deployment, container) (container_spec_cpu_quota / container_spec_cpu_period) Why dynamic thresholds help: CPU usage of a Kubernetes workload changes with workload mix, deployment timing, scaling activity, and traffic patterns. Static thresholds can be difficult to tune across namespaces, deployments, and containers. Dynamic thresholds learn a separate baseline for each monitored time series — in this example, for every pod, deployment, and container combination — so genuine CPU spikes stand out while expected variation from autoscaling and traffic mix stays quiet. Example 2 — Catch application latency regressions sooner Scenario: Detect abnormal latency patterns in an application by alerting on custom OpenTelemetry metrics stored in an Azure Monitor Workspace. Example query: histogram_quantile(0.95, sum by (le, service_name, http_route, http_method) (rate(http_server_duration_seconds_bucket[5m]))) Why dynamic thresholds help: Application latency naturally changes with traffic, user behavior, and release cadence. Fixed thresholds can be noisy during peak periods and too loose during quiet ones. Dynamic thresholds learn a separate baseline for each time series — here, for every service, route, and method — so real p95 latency regressions surface even as traffic and release cadence shift throughout the day. Best practices for better results To get the best results from dynamic thresholds for PromQL-based alerts, design your query so Azure Monitor can learn a clear, stable signal over time: Keep the expression numeric. Dynamic thresholds work best when the query returns a continuous numeric signal rather than a Boolean true/false result. For example, use an expression that calculates CPU usage, not a Boolean comparison like CPU > 0.8. Use meaningful dimensions. Split by dimensions such as namespace, deployment, service, or route when you want separate baselines for different workloads or endpoints. Prefer stable entities. Use longer-lived dimensions or aggregate across short-lived entities so the model has enough consistent history to learn from. In Kubernetes, for example, deployment is usually a better baseline dimension than individual pod ID. Choose the right threshold behavior. Decide whether the alert should trigger on values above the learned upper bound, below the lower bound, or both. Start with medium sensitivity. Use Medium as a balanced default, then tune up or down based on noise and missed anomalies. Allow enough historical data. Dynamic thresholds improve as more history is collected. Initial seasonal patterns use recent history, and weekly seasonality becomes more effective after several weeks of data. Get started Ready to try it? Create a query-based metric alert with dynamic thresholds on your metrics in Azure Monitor Workspace. You can create such rules in the Azure portal, where the built-in preview chart shows when your dynamic threshold alert would have fired based on historical baseline analysis. Use the preview chart to tune both the PromQL query and the dynamic threshold sensitivity before enabling the rule. You can also create query-based metric alert rules using programmatic interfaces or resource templates. Figure 1. Dynamic thresholds preview chart showing the learned baseline and the points where an alert would have fired. Dynamic thresholds cut alert noise where it starts — at detection. The alerts that do fire connect into Azure Monitor’s broader AIOps experience, where the Azure Copilot Observability Agent can help correlate signals into investigated issues with explainable reasoning — with humans in control. Next steps Related blog: Anomaly detection made easy with Dynamic thresholds for Log search alerts Dynamic thresholds in Azure Monitor Query-based metric alerts overview Create query-based metric alerts Prometheus metrics in Azure Monitor OpenTelemetry on Azure Monitor Stay connected Follow the Azure Observability Blog for more updates on Azure Monitor, Prometheus-based monitoring, alerting, and troubleshooting experiences. We’ll continue sharing product updates, practical guidance, and examples to help you improve observability across your Azure environments. Feedback We’d love to hear how dynamic thresholds for query-based metric alerts work for your scenarios. Share your feedback through your Microsoft account team, Azure support channels, or the feedback options in the Azure portal so we can continue improving the experience.
yairgil
Jul 02, 2026 Place Azure Observability Blog
114Views
0likes
0Comments
Azure Copilot Observability Agent is generally available, with autonomous operations in preview
Complex cloud environments have outpaced manual operations. Agentic cloud operations connect people, tools, and data to streamline investigation workflows and move teams from scattered signals to evidence-backed next steps. With unified observability, teams can investigate Azure-monitored applications, Azure Kubernetes Service (AKS) environments, VMs, Foundry telemetry, infrastructure, and platform signals with greater context and control. Powered by Azure Monitor, the Azure Copilot Observability Agent is now generally available. It helps engineering, SRE, DevOps, and operations teams move from telemetry and alert noise to investigated issues, explainable reasoning, and recommended next steps that can reduce Time-To-Mitigate (TTM). Autonomous operations are also available in public preview. They help prepare context and reduce triage work while people remain responsible for mitigation decisions and any changes to the environment. From alert noise to investigated issues The Observability Agent helps teams reduce the effort required to understand operational problems. Instead of starting every investigation from a dashboard, query editor, or alert payload, teams can work with an AI companion that reasons across telemetry, Azure resource context, discovered topology, and custom instructions to identify what changed, what is correlated, and what evidence supports the conclusion. Teams can start with natural-language exploration and continue into deeper investigations when an issue requires more evidence. That light-to-deep workflow helps responders move from broad questions to a structured investigation without losing the reasoning trail. Here's what this looks like in practice: after a deployment, several alerts might fire across an app, database dependency, and compute resource. The Observability Agent can group those signals around the affected service, identify when the regression started, compare related dependencies and infrastructure metrics, and capture the findings in an Azure Monitor issue. The responder can then validate the evidence, add team context, route work to the right owner, and decide whether a rollback, configuration change, or code fix is appropriate. Explainable investigations across Azure-monitored signals Operations teams need more than a chatbot that answers questions. The Observability Agent follows an investigation workflow: it frames hypotheses, gathers evidence, compares signals by time, scope, and type, rules out weak explanations, and shows the reasoning path behind its findings. The Observability Agent can help teams: Investigate incidents and alerts across Azure-monitored applications, Azure Kubernetes Service (AKS) environments, VMs, Foundry telemetry, infrastructure, and platform signals Correlate related signals to reduce noise and surface higher-signal issues with context Explore telemetry using natural language while preserving transparency into the supporting data Compare signals by time, scope, and type to separate likely causes from coincidental changes Provide a reasoning trail that shows what the agent found, what it ruled out, and why Recommend next steps that engineers can review before deciding how to act This same investigation model applies to specialized skills and issue types, including customer's application, Azure Kubernetes Service (AKS), Foundry, VMs, and GenAI issues. When the relevant telemetry is available, the Observability Agent can correlate logs, metrics, traces, alerts, dependencies, resource graph, resource health, activity logs, Foundry telemetry, and changes. This helps teams investigate customer-visible issues with evidence, including latency, token spikes, tool-call failures, agent errors, hallucinations, deployments, API failures, performance regressions, infrastructure dependencies, and platform incidents. This explainability is central to the product. In production operations, trust is earned through evidence. The Observability agent is built to support human judgment, not bypass it. . Azure expertise, with context from your environment Context matters in every investigation. The same symptom can mean different things depending on application architecture, recent deployments, dependencies, historical incidents, and team practices. The Observability Agent brings Microsoft and Azure operational knowledge into the investigation experience. It can use discovered topology, Azure resource context, logs, metrics, traces, and custom instructions to ground investigations in signals that are more relevant to your environment. Native to Azure Monitor, with humans in control Because the Observability Agent is built into Azure Monitor, teams can use it close to the telemetry, alerts, and workflows they already rely on. Investigations can also be captured as Azure Monitor issues, creating a shared case file for humans and agents to collaborate on evidence, reasoning, and next steps. The Observability Agent is designed for governed AI operations inside Azure Monitor. Interactive chat and investigations use the signed-in user's identity and Azure role-based access control (RBAC). Prompts and responses are not used to train foundation models, and the agent doesn't restart resources, change configuration, or resolve issues on its own. Autonomous operations in public preview Alongside general availability, autonomous operations for the Observability Agent are available in public preview. When enabled, the agent can analyze alerts in the background, correlate related alerts when they likely represent the same incident, create Azure Monitor issues automatically, and run deep investigations on agent-created issues. This automatic triage helps reduce alert noise by turning streams of individual alerts into higher-signal issues with context, findings, and recommended next steps. Teams can review the issue, continue the investigation, and decide what action to take. Autonomous operations are designed to prepare context and reduce triage work, not to remove human control. Engineers remain responsible for decisions, approvals, and any changes to the environment. Next steps Check out our latest announcements and related blogs: Azure Blog and OMB Blog. Learn how to use the Observability Agent in Azure Copilot Observability Agent. Explore how investigations work in Deep investigations in the Azure Copilot Observability Agent. Learn more on how to Chat with your observability data Learn how teams preserve context in Azure Monitor issues. Review preview details in Autonomous operations in the Azure Copilot Observability Agent. Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar - a walkthrough of real Observability Agent scenarios, best practices, and what's available today - along with a look at what's coming next, and live Q&A with the product team. Register for the Observability Agent webinar. We'd love your feedback The Observability agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at enauerman@microsoft.com.
EfratNauerman
Jun 23, 2026 Place Azure Observability Blog
9.2KViews
6likes
0Comments
Accelerating AKS troubleshooting with the Azure Copilot Observability Agent
AKS incidents rarely stay within one Kubernetes object, signal, or tool. A latency spike might first appear in application telemetry, but the root cause may sit elsewhere: pod restarts, node pressure, scheduling failures, or a recent configuration change. The Azure Copilot Observability Agent in Azure Monitor helps connect these signals into an explainable investigation, so teams can move from symptoms to evidence-backed next steps. Why AKS troubleshooting is complex Troubleshooting Azure Kubernetes Service (AKS) is complex because failures can originate in workloads, platform components, infrastructure, or the application code running on the cluster. For example, pods stuck in Pending may indicate capacity or scheduling issues, while application latency may be caused by throttling, failed probes, pod restarts, or node pressure below the app. During an incident, simply having more telemetry is not enough. Teams need a way to test likely causes, rule out unrelated signals, and keep the investigation tied to the affected workload and time window. From signal to root cause: the investigation flow The Observability Agent follows a consistent investigation pipeline: Scope the problem by identifying the most likely infrastructure resources involved, plus connected dependencies. Collect data across metrics, logs, traces, change history, and related signals. Detect anomalies using learned baselines (for metrics) and log analysis. Correlate across resources spanning infrastructure and application layers. Run deep diagnostics by invoking resource-specific tools when needed to pinpoint root cause. Summarize findings in a structured format: what happened, why it happened, and what to do next. AKS investigation data sources The agent works with telemetry already available in your Azure Monitor environment. Investigation depth improves as more relevant signals are enabled, including Container insights logs, Kubernetes events and state, Azure managed service for Prometheus, container and pod logs, Application Insights telemetry for AKS-hosted workloads, Azure Activity Log changes, control plane logs routed through diagnostic settings, and resource metadata for the cluster, node pools, workloads, and related Azure resources. Figure 1. AKS investigation data sources You don’t need to enable every telemetry source to get started. The Observability Agent uses the data already available in Azure Monitor, and its findings become more complete as more AKS and application signals are collected. Example 1: AKS infrastructure — explaining why new pods never start Consider a workload rollout on AKS where replacement pods remain stuck in Pending state. What looks like a failed release may stem from the workload definition, cluster state, or underlying infrastructure. Investigation walkthrough Symptom: rollout is blocked Replacement pods remain in Pending during rollout, and Kubernetes events show repeated scheduling failures. This indicates that the rollout is blocked before new pods can start. Workload evidence: scheduling, not startup Pod state identifies the affected workload, while Kubernetes events show repeated placement failures. The issue is therefore tied to scheduling rather than application startup or container crash behavior. Cluster evidence: capacity pressure When enabled, Prometheus node metrics show CPU and memory utilization near capacity. Cluster-level trends show resource pressure increasing at the same time as pending pods and scheduling failures. Likely cause: insufficient schedulable capacity The scheduler cannot place new pods because the relevant node pool does not have enough available capacity. The failed rollout is best explained by capacity pressure in the target node pool rather than an application crash or image startup failure. Recommended action Scale out the affected node pool or adjust workload resource requests, then retry the rollout once schedulable capacity is restored. Figure 2. AKS investigation flow The Observability Agent connects pod state, scheduling events, and node pressure to explain why the rollout is blocked and which capacity action to consider next. Example 2: Joint app-AKS investigation — tracing application latency to pod restarts Now consider a customer-facing application where users see increased latency and intermittent HTTP 5xx errors after deployment. The first symptom appears in application telemetry, but the unhealthy requests are served by pods that are repeatedly restarting in AKS. Investigation walkthrough Symptom: customer-facing service degradation After deployment, application telemetry shows increased latency and HTTP 5xx errors. The first visible impact appears at the application layer. AKS evidence: unstable pods Affected pods enter CrashLoopBackOff, restart counts increase, and Kubernetes events show back-off restarts, probe failures, or image or command errors. Container logs point to startup exceptions, missing configuration, or crash details. Resource evidence: workload-specific pressure Container memory usage approaches configured limits before restarts, while node metrics show no broad node pressure. This suggests the issue is workload-specific rather than cluster-wide capacity related. Change evidence: deployment correlation Deployment history shows a new image or configuration change shortly before restarts began, with no matching platform health event. The timing points to the latest deployment or configuration change. Recommended action Review the latest image or configuration change, inspect container logs, adjust memory limits, or roll back if needed. Focus remediation on the workload change rather than node pool scaling. This pattern shows how an application symptom can map back to AKS workload behavior. Application telemetry establishes the user impact, while Kubernetes events, container logs, and resource metrics help explain why the affected pods keep failing. Operational impact For site reliability engineers, platform teams, and IT professionals, the Observability Agent reduces the time spent moving between application and AKS telemetry. It brings relevant signals into one investigation, surfaces supporting evidence, and applies Azure Monitor and AKS context so your team can review the findings, validate the recommended path, and decide which production changes to make. Figure 3. AKS investigation results Using the Observability Agent You can start using the Observability Agent from the Azure portal in two common AKS troubleshooting flows: Investigation mode: Start an investigation from an Azure Monitor alert on an AKS resource or from an Application Insights alert for an AKS-hosted workload. The agent uses the alert context to scope the incident, correlate application and cluster telemetry, and summarize the likely cause with recommended next steps. Chat-based exploration: Open the Monitor experience in AKS and select the Observability Agent button to chat with your telemetry. Use natural language to ask follow-up questions, explore logs and metrics, detect and inspect anomalies, and narrow down likely causes. Figure 4. Starting Observability Agent from AKS Monitor experience Next steps Azure Copilot Observability Agent overview Monitor Azure Kubernetes Service with Azure Monitor Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar — A walkthrough of real Observability Agent scenarios, best practices, and what's available today, along with a look at what's coming next and live Q&A with the product team. Register for the Observability Agent webinar. We'd love your feedback The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com
yairgil
Jun 17, 2026 Place Azure Observability Blog
260Views
0likes
0Comments
Inside the Observability Agent: How Deep Investigations and Reasoning Work
Deep investigation in the Azure Copilot Observability Agent turns observability data into a verified, data-backed explanation of what happened during an incident, correlating application, infrastructure, and platform signals across time, scope, and type to produce a structured root cause analysis.
yalavi
Jun 15, 2026 Place Azure Observability Blog
484Views
1like
0Comments
PUBLIC PREVIEW - Azure Monitor - Collect Azure Resource Platform Logs at Scale with DCRs
PUBLIC PREVIEW - Azure Monitor - Collect Azure Resource Platform Logs at Scale with DCRs. How DCR-based platform logs simplify the telemetry collection for organizations managing 1,000+ resources.
Mahesh_Sundaram
Jun 14, 2026 Place Azure Observability Blog
914Views
2likes
1Comment
Azure Monitor Copilot Observability Agent: What’s new at Build
The Observability agent in Azure Copilot is an AI-powered assistant built into Azure Monitor that helps engineers investigate issues and explore their systems using natural language. By grounding its analysis in telemetry data such as metrics, logs, and traces, it supports both open-ended exploration and guided troubleshooting. For more details, see the documentation. Since our initial public preview, the Observability agent in Azure Copilot has continued to evolve with new capabilities and expanded coverage (You can read more about the initial release in our previous blog) At Build 2026, we’re introducing updates that expand the Observability agent’s capabilities and the range of scenarios it can support. These updates provide deeper analysis and more detailed responses for both exploration and investigation. Expanded Investigation Scenarios The Observability agent now supports a broader set of scenarios across applications and infrastructure. These can be accessed directly from relevant product experiences, without requiring a prior alert, allowing teams to explore data conversationally and initiate deeper investigations as signals emerge. Integration with Microsoft Foundry AI Agent The Observability agent integrates with Microsoft Foundry AI Agents, enabling correlation of signals across key generative AI and agent observability scenarios such as latency spikes, error patterns, and tool invocation failures. Teams can interact with the Observability agent either from alerts - including alerts based on Foundry telemetry - or directly within Application Insights, where the Agents details experience serves as the primary entry point. From there, users can use the Observability agent to diagnose errors, analyze trends, and explore their data across one or multiple agents. Application Insights integration The Observability agent enables investigation of failure scenarios directly from Application Insights Failures blade, allowing teams to analyze application-level issues and move from symptom to root cause. Azure Kubernetes Service (AKS) integration The Observability agent enables deep investigation of issues in Azure Kubernetes Service (AKS) clusters. AKS investigations correlate signals from Azure Monitor with Kubernetes logs and events, and (coming soon) Prometheus metrics stored in an Azure Monitor Workspace. Together, these signals enable full‑stack analysis of applications running on AKS. The Observability agent helps teams determine whether an issue originates from the application or from the underlying Kubernetes platform, reducing time to diagnosis and resolution. Activity Logs integration Investigations can be initiated based on Azure Resource Health events surfaced in Activity Logs, enabling analysis of service-impacting signals related to the Azure platform. Deeper Insights across systems Multiple Application Insights - Coming soon! The Observability agent supports investigations that can span multiple Application Insights resources, enabling scenarios that involve multiple services within distributed applications. The agent can guide users to expand the investigation scope when cross-service issues are detected. Integration with Azure Service Health The Observability agent correlates investigation context with Azure Service Health events, helping teams understand potential platform impact as part of their investigation. This helps distinguish application-level issues from broader Azure platform conditions and prioritize active impacts. Issue management Enhancements Viewing issues Issues can now be viewed in multiple places, depending on the required scope: Azure Monitor: showing issues across all Azure Monitor Workspaces (AMWs) under the selected subscriptions Azure Monitor Workspace: showing issues stored within a specific AMW Issue actions & notifications Issue actions trigger notifications when issues are created or updated, enabling integration with workflows such as email, webhooks, and automation. Sharing and follow-up You can now download investigation results as a PDF, including supported data, enabling teams to capture and share investigation context for incident reviews and reporting. Coming Soon Billing for the Observability agent starts on July 1, 2026. The agent uses a consumption-based pricing model, so customers pay only for the AI work the agent performs. Agent consumption is measured in Azure Agent Credit (AAC) units, which reflect how many LLM tokens the agent used. For more details, see the documentation. Stay connected Follow this blog for ongoing updates and deeper dives into new capabilities Join our upcoming webinar for real-world scenarios, best practices, and a look at what’s coming next 👉 Register here We’d love your feedback The Observability Agent continues to evolve based on real-world usage and customer feedback. Share feedback through the Give Feedback option in the product or contact us at: azureobsagent@microsoft.com Want to learn more? Read our previous blog posts - Public Preview Update: Azure Copilot Observability Agent | Microsoft Community Hub The Azure Copilot Observability Agent Chat - Stop Writing Queries, Start Asking Questions. | Microsoft Community Hub Explore our documentation - Azure Copilot observability agent (preview) - Azure Monitor | Microsoft Learn
EfratNauerman
Jun 09, 2026 Place Azure Observability Blog
637Views
0likes
1Comment
Connect Metrics to Traces with Exemplars in Azure Monitor
Following Microsoft’s recent GA announcement for OpenTelemetry (OTel) support, we are excited to announce support for Exemplars for customers instrumenting metrics with Prometheus or OpenTelemetry and traces using OpenTelemetry, enhancing Azure Monitor’s integrated observability experience for cloud-native applications. Modern cloud-native applications generate enormous volumes of telemetry. Metrics help teams detect that something is wrong, but traces explain why. Exemplars bridge these two worlds by attaching trace references directly to metric data points, making it dramatically easier to pivot from a spike in latency or errors to the exact distributed trace responsible for the issue. With Azure Monitor, customers can now ingest metrics with exemplars and visualize them in Azure Managed Grafana. This enables seamless correlation between metrics and traces, helping engineering teams troubleshoot issues faster and reduce mean time to resolution (MTTR). Why Exemplars Matter Traditional monitoring workflows often require users to manually correlate data across multiple systems. Exemplars simplify this workflow by embedding trace context directly into metric samples. For example, if a latency metric spikes at a specific timestamp, the exemplar associated with that data point can link directly to the distributed trace responsible for the outlier. This provides several benefits: Faster root cause analysis Quicker transition from aggregate metrics to request-level details Simplified debugging workflows for SRE and platform teams Better observability experiences for microservices and distributed applications Unified Observability with Azure Monitor With Azure Monitor and Azure Managed Grafana, you can now: Ingest OTLP or Prometheus metrics with exemplars into Azure Monitor Workspace Store and analyze traces in Azure Monitor Application Insights Visualize exemplar markers directly in Grafana charts Navigate from a metric spike to the exact distributed trace associated with that data point By combining these signals in a single observability platform, organizations can correlate infrastructure health, application behavior, and request traces without context switching between tooling. How It Works Once metrics, exemplars, and traces are ingested into Azure Monitor, Azure Managed Grafana can consume exemplar information from the configured Prometheus data source. When exemplars are enabled in Grafana dashboards, users will see markers associated with individual metric data points. Selecting an exemplar opens the associated trace in Azure Monitor, providing end-to-end diagnostic context. Getting Started Setup data ingestion: Instrument your application to emit OpenTelemetry traces, OpenTelemetry or Prometheus metrics with exemplars, and enable ingestion of the same to Azure Monitor using OpenTelemetry Collector. Follow the instructions in Ingest OTLP Data into Azure Monitor with OTel Collector - Azure Monitor | Microsoft Learn. After this step, you will have the Log Analytics Workspace, Azure Monitor Workspace and Application Insights resources all set up to store the telemetry data. Create an Azure Managed Grafana instance and connect it with the Azure Monitor Workspace by navigating to your Azure Monitor Workspace in the Azure portal and then clicking on “Linked Grafana workspaces”. To learn more, see Manage an Azure Monitor workspace - Azure Monitor | Microsoft Learn Optionally, enable Azure Managed Prometheus on your AKS cluster or use remote-write and configure it to use the same Azure Monitor Workspace to centralize infrastructure and application metrics. Enable Exemplars in Azure Managed Grafana: After setting up the data ingestion, ensure that logs and traces are flowing into Log Analytics Workspace, and metrics are flowing into Azure Monitor Workspace. Step 1: Enable Exemplars on Prometheus Data Source in Azure Managed Grafana Navigate to Connections -> Data Sources in Azure Managed Grafana. Since you have connected Azure Managed Grafana to Azure Monitor Workspace, you will see the data source (Managed_Prometheus_<AMW-Name>) already configured. If the data source is not configured, follow the steps here to add your Azure Monitor Workspace as a data source. Open the data source configuration. Click Add Exemplars to enable exemplar support. Step 2: Configure Trace Linking with Azure Monitor In the exemplar configuration section, toggle Internal Link to On. Select Azure Monitor as the data source. In the Label Name, enter the name of the field in the labels object that should be used to get the trace id, eg. trace_id. Click Save & Test. This configuration enables direct navigation from exemplar markers in Grafana charts to the associated traces stored in Azure Monitor. Azure Managed Grafana also supports trace correlation from other solutions like Jaeger etc. To use your trace solution, use the appropriate links. Step 3: Enable Exemplars in Dashboards Navigate to a Grafana dashboard that uses your configured Prometheus data source. Open the panel options for a metrics chart. Toggle Exemplars to On. Once enabled, exemplar markers will appear on supported metric visualizations. Clicking on it will show exemplar details along with an option to open the corresponding distributed trace in Azure Monitor. To learn more, visit https://aka.ms/azmon-exemplars
sunayanasingh
Jun 08, 2026 Place Azure Observability Blog
240Views
1like
0Comments
New Capabilities to Observe Agents in Azure Monitor
Over the last six months, we have been listening to you and building new capabilities to help you observe your agents. You’ve been sharing with us that quality issues are tricky and evaluation is critical, that agent reasoning needs to be understood, that humans must be in the loop to review select agent interactions, and that security and privacy are essential. To address these concerns, we’re announcing several new capabilities that make agents a first-class artifact in Azure Monitor, so you can debug them in the context of your broader distributed application alongside non-agentic components. Microsoft Foundry remains the surface for building and evaluating agents within the context of your project, while Azure Monitor provides the full-stack observability platform and underlying data foundation that powers those experiences. Today, we’re announcing new capabilities in Azure Monitor across ingestion, performance, evaluation workflows, agent debugging, and instrumentation updates to help teams get telemetry faster, inspect agent behavior more deeply, and standardize observability across hosting environments and frameworks. What’s new Reducing pipeline latency from more than 60 seconds to 7.5 seconds at P90. This makes telemetry available faster for teams troubleshooting agents at scale. Emitting events up to 1MB and up to 256kB per attribute. Prompts and responses can get large, and this helps avoid data truncation. Introducing a new view that shows a list of all agents being monitored. Whether you use Microsoft Agent Framework, LangChain, Microsoft Copilot Studio, Foundry Hosting, AKS Hosting, or something else, they all show up here. Improving drill-in from Evaluations to underlying prompts/responses. Evaluations in Azure Monitor are powered by Foundry, and we continue to improve visuals. Showing conversation context in end-to-end transaction view. In chat agents, conversations have become critical glue that connects traces and eases debugging. Searching by text and showing prompt previews in end-to-end transaction view. Prompts and responses are essential to understanding agent logic, and now you can search based on keyword text in Search and End-to-end transaction details views. Show evaluation scores in end-to-end transaction details and sort by evaluation score in Search. Evaluation is emerging as a “4 th pillar” of telemetry, and you’ll see it surface more prominently across Azure Monitor Application Insights. Access the entire JSON blob of prompt/response text. This makes it easier to get to your underlying data and copy out of Azure Monitor for custom analysis/evaluation. Adding a “trace tree” to enhance traversing the agent’s reasoning logic. This new addition to end-to-end transaction view makes traversing long-traces much easier. Enabling builders to annotate (i.e., manual evaluations) from transaction details. Get rid of spreadsheets on the side and annotate from within Azure Monitor. Enabling capture of end-user feedback (i.e., thumbs up/down). Brings end-user feedback alongside other telemetry for more powerful troubleshooting. Extending AI-powered troubleshooting to agents. Observability agent offers full-stack, AI-powered troubleshooting and surfaces up findings in an issue. Learn More. Observability of Coding Agents. Get end-to-end visibility into agent and model usage, performance, and cost with Azure Monitor Application Insights, and built-in Grafana dashboards. Learn More. A unified “Microsoft OpenTelemetry Distro” to observe agents hosted anywhere. A unified Microsoft OpenTelemetry Distro for observing agents hosted anywhere gives teams a single starting point across Foundry, Azure Monitor, and A365, reducing fragmentation and simplifying onboarding (GH Repos: Python, .NET, JavaScript). Skills-based enablement. Getting started is easier. Just point your agent to a skill for AI-assisted instrumentation. We also plan to upgrade tools for instrumentation in Azure MCP. What’s next We’re continuing to invest in this area, with upcoming work focused on stronger security controls for prompts and responses, better cost transparency for agents, and clearer ways to measure ROI across your agent fleet. These updates make it possible to observe agents without adopting a separate toolchain. Explore the new capabilities, and if you see gaps, let us know so we can continue shaping the roadmap based on your feedback. Learn More.
MattMc
Jun 05, 2026 Place Azure Observability Blog
716Views
1like
0Comments
What’s new in Observability at Build 2026
At Build 2026, Azure Monitor introduces major advancements in end-to-end observability, extending across AI agents, applications, and infrastructure with OpenTelemetry at its core. New capabilities with Azure Copilot Observability agent, SLI/SLO support, and smarter alerting help teams move faster from detection to root cause while reducing noise and manual effort. Together, these innovations enable developers and SREs to operate modern, AI-driven systems with greater insight, efficiency, and alignment to customer experience.
Priyanka Nanda
Jun 03, 2026 Place Azure Observability Blog
912Views
2likes
0Comments