azure monitor

218 Topics

Understanding billing for the Azure Copilot Observability Agent
The Azure Copilot Observability Agent brings an agentic investigation experience directly into Azure Monitor. Teams can chat with their observability data, run deep investigations across application and infrastructure signals, and, in preview, use autonomous operations to correlate alerts and create issues for review. The common thread across these experiences is that the agent performs AI work on behalf of the user or configured workflow, and that work has a cost. Azure Copilot Observability Agent billing went into effect July 1, 2026. This post explains the billing model at a practical level: what is measured, which agent operations are billable today, how usage appears to users, and how this differs from the standard Azure Monitor costs customers already manage for telemetry ingestion, retention, alerting, and other monitoring capabilities. For current list pricing and the most detailed billing guidance, always refer to the official billing documentation and the Azure Monitor pricing page. A consumption-based model for agentic work The Observability Agent uses a consumption-based pricing model: customers pay for the AI work the agent performs. This consumption is measured in Azure Agent Credits, or AAC. AAC provides a consistent unit for agent work across models and tokens used. AAC is designed to reflect the amount of agentic processing required to complete a task. Simple questions, such as "what was the maximum latency of this app yesterday?", typically use few tokens. A deep investigation consumes more agent and tool work, and therefore typically incurs higher cost. Note that a single agent operation - be it a chat question or a deep investigation - is currently capped at 500 AACs. Charges are scoped to the Azure subscription of the monitored resource, or to the subscription of the named agent instance if one is used (required for autonomous operations). This keeps the cost associated with the environment where the agent is being used, and lets teams review agent consumption alongside other subscription-level Azure costs. What is billable today There are three main usage patterns to understand. Chat - the agent's chat allows users to explore and analyze their observability data through natural-language questions about their Azure resources and their logs, metrics, traces, or related telemetry, and the agent performs the work needed to answer. This is typically the lowest-cost pattern because the scope is often focused and iterative. Deep investigation - can be initiated through a number of entry points in the Azure Portal, and also through the chat (users can tell the agent to run a deep investigation). A deep investigation performs a broader analysis - it gathers signals, correlates findings, reasons across application, infrastructure, and Azure platform context, and produces an investigation report. Because this workflow runs multiple agent and tool steps, it typically consumes more AAC than chat. Autonomous operations are currently in preview. Autonomous alert processing, triage and optional correlation can run in the background to group related alerts and reduce noise. Alert correlation itself isn’t billed during preview. If autonomous operations automatically run a deep investigation on an agent-created issue, that deep investigation is billable. This is an important distinction: preview correlation and issue creation are different from the investigation work that may be triggered as part of that flow. How users see usage in the product Cost transparency is part of the experience. After the agent returns a response in chat, users can open the usage indicator (hexagon-shaped icon) located next to the thumbs-up/down icons, to see how many AACs were used to generate that response. This makes consumption visible at the point where the user sees the value of the answer, rather than only later in a billing report. This is especially useful because not all agent interactions are equal. A short question that summarizes a recent trend can require much less agent work than a long-running investigation that reviews multiple signals and hypotheses. Showing AAC usage per response helps users understand that relationship and adjust how they use the agent when needed. How costs appear in Azure Cost Management Teams can review the overall agent cost in subscription Cost Management. The product name appears as Azure Monitor Observability Agent, and the meter name appears as Observability Agent Azure Agent Credits. This gives admins a familiar place to monitor consumption. The agent cost is not a replacement for standard Azure Monitor charges. Existing Azure Monitor costs — such as logs ingestion, retention, alerting, web tests, and other metered monitoring capabilities — continue to follow their own billing models. Through Cost Analysis smart views, such as Services, users can select the Azure Monitor service and review specific entries of the Azure Monitor Observability Agent. Practical guidance for teams Start by using chat for focused exploration: ask about trends, errors, performance, anomalies, or a specific resource. Use deep investigations when you need a broader, multi-signal analysis of an incident or suspected root cause. Review the AAC usage shown after agent responses so users can build intuition about which prompts stay lightweight and which workflows require deeper analysis. Use Azure Cost Management to monitor the subscription-level cost of agent usage, and keep the Observability Agent cost distinct from standard Azure Monitor telemetry costs such as logs ingestion, retention, alerting, and web tests. For current pricing details, billable behavior, and any updates to what is billed in preview or GA experiences, use the official billing documentation as the source of truth. Coming up... Looking ahead, we plan to introduce billing caps for the Observability Agent, giving customers greater control over monthly token consumption, capacity usage, and overall costs. Learn more Billing and cost management for Azure Copilot Observability Agent Azure Copilot Observability Agent overview Chat with your observability data Deep investigations in the Azure Copilot Observability Agent Autonomous operations in the Observability Agent Azure Monitor pricing We’d love your feedback The Observability Agent continues to evolve based on real-world usage and operator feedback. Share feedback through the Give Feedback option in the product, or reach us at noakuper@microsoft.com.
Noa Kuperberg
Jul 16, 2026 Place Azure Observability Blog
211Views
0likes
0Comments
Public Preview: Advanced platform metrics in Azure Monitor
We are excited to announce the Public Preview of advanced platform metrics for Azure Monitor, delivering more granular telemetry to help customers monitor and optimize their workloads more effectively. This new capability builds on Azure Monitor platform metrics, which continue to provide broad insight into the health, activity, and consumption of Azure resources. Advanced platform metrics add finer-grained signals, helping customers pinpoint changes and trends within resources more quickly and accurately. Azure Storage is the first Azure resource to provide advanced platform metrics to customers. Today, Azure Storage users rely on platform metrics to understand overall storage account trends, but this account-level telemetry does not always show what is driving change. For example, a storage account may show steady capacity growth without revealing which specific container is responsible. That growth could be coming from one container used for backups, another storing application logs, or a staging container used by a data pipeline. Advanced platform metrics for Azure Monitor address this scenario by providing container-level visibility, helping customers quickly identify where growth is occurring, investigate unexpected consumption increases, and make more informed cost and capacity planning decisions. What is available in Public Preview? In Public Preview, the following advanced platform metrics are available for Azure Storage across all Azure public cloud regions: Container Blob Capacity: The amount of storage used by a specific container in a storage account. Container Blob Count: The number of blob objects in a specific container in a storage account. Pricing and billing Advanced platform metrics for Azure Monitor are offered as a paid capability during Public Preview. For the latest pricing details, see Azure Monitor pricing. Getting started Advanced platform metrics can be enabled per storage account through PowerShell or Azure CLI. For instructions on enabling, managing, and viewing Azure Storage advanced platform metrics, see Azure Platform Metrics for Azure Blob Storage (preview). After advanced platform metrics are enabled for a storage account, they can be queried, visualized, and used for alerts through the same existing platform metrics experiences. Container Blob Capacity and Container Blob Count will appear in the Metric dropdown menu in Metrics Explorer, alongside all existing Azure Storage platform metrics. Users can then select Apply splitting and choose Container name to view metrics for individual containers. The chart below shows container-level capacity data for three containers. Azure Storage scenarios enabled by advanced platform metrics Standard Azure Monitor platform metrics provide visibility into the storage account as a whole, such as total blob capacity or total object count. With the addition of advanced platform metrics for Azure Storage, customers can understand which individual containers are contributing to growth, object count increases, or operational issues, enabling scenarios such as: Cost analysis and internal attribution: In shared storage accounts, identify which containers are consuming the most storage so teams can better understand which applications, environments, or business functions are driving usage without building custom reporting pipelines for this scenario. Capacity planning and growth forecasting: Track storage growth at the container level to see which workloads are driving overall account growth and make more informed planning and budgeting decisions. Runaway storage growth detection: Quickly isolate individual containers experiencing unexpected increases in capacity or object count, reducing investigation time when usage changes unexpectedly. What's next? As the feature moves toward General Availability (GA) and beyond, customers can expect to see more advanced platform metrics for Azure Storage, as well as new advanced platform metrics for other Azure resources. Continue to follow the Azure Observability Blog for the latest updates. Feedback We would love to hear your feedback on advanced platform metrics, including how your teams are using the feature to optimize workloads and additional advanced platform metrics that you would like to see onboarded to Azure Monitor. Please fill out this Azure Monitor advanced platform metrics feedback form, or email advancedplatformmetrics@microsoft.com.
alyssaschimm
Jul 15, 2026 Place Azure Observability Blog
683Views
0likes
0Comments
Azure Monitor Observability Agent goes autonomous (preview)
Autonomous operations for the Azure Copilot Observability Agent are now in public preview, alongside the agent's general availability. With autonomous operations enabled, the Observability Agent listens to your alerts as they fire, triages them in the background, and runs deep investigations on the issues it creates. Along the way, it correlates related alerts into a single issue - so your team starts from a small set of explained, investigated issues instead of a stream of raw alerts. Until now, teams invoked the agent when they needed it - an interactive assistant, ready to investigate when you pointed it at a problem. Now it also prepares triage context continuously, on its own, while people stay responsible for decisions and any change to the environment. See autonomous operations in action. The Observability Agent triages incoming alerts, correlates related ones into a single Azure Monitor issue, and runs a deep investigation automatically, with no human trigger. From alerts to answers Azure Monitor already gives you strong signals when something is wrong - across both metric and log alerts. Dynamic thresholds learn normal behavior and flag anomalies automatically, and that same anomaly detection now extends to log search alerts and, in preview, to Prometheus and OpenTelemetry metrics. Smart detection in Application Insights surfaces failures and performance anomalies without manual rules. The hard part is what happens next: connecting dozens of alerts, working out what they share, and figuring out what's actually going on - before anyone can act. That's the work that still lands on a person, often in the middle of the night. It's exactly where the Observability Agent comes in. What's in the public preview In public preview, you can enable the Observability Agent to: Promote individual prominent alerts into issues when you configure that with custom instructions. Run a deep investigation automatically on every issue it creates. Correlate related alerts into a single Azure Monitor issue, with a natural-language explanation of why they belong together. You provision the agent once as a resource in your Azure environment - a dedicated identity to scope, govern, and assign autonomous tasks to - then turn on autonomous operations and it gets to work. What it changes for your team The outcome is fewer things to look at and faster triage: Your team works from a short queue of meaningful issues, not a constant stream of alerts. Each issue arrives with context, reasoning, and an investigation already attached. Low-priority issues can be reviewed and dismissed in seconds. The assembly work that used to come first now happens before anyone is paged. People still make every decision and every change. The agent just makes sure they start with full context. How it works Your own instructions. Topology shows how services connect, but your team knows which boundaries matter: ownership, escalation paths, and the alerts that should always become issues. Custom instructions let you capture that in plain language and apply it going forward. For example: "The billing service is owned by a different team with a separate on-call rotation. Even when billing alerts fire alongside clinical service alerts, treat them as separate issues." Instructions shape how the agent correlates and creates issues. They don't grant permissions, bypass Azure RBAC, or change resources. Automatic topology discovery. Point the agent at your Application Insights resource and it maps services, dependencies, and how they relate. That map becomes persisted knowledge the agent builds and reuses - the same context that grounds both its correlation decisions and its deep investigations, so reasoning reflects your real architecture instead of starting from scratch each time. Deeper investigations. When the agent investigates a correlated issue, it starts from the whole picture: every related alert, every impacted resource, and the reasoning correlation already produced. The result is sharper root-cause hypotheses and recommendations that account for the full scope of impact. In practice A database latency spike triggers alerts across checkout, billing, and recommendation services. Without autonomous operations, each alert is triaged on its own. With autonomous operations enabled, the Observability Agent groups the related alerts into one issue, explains the shared timeline, and starts investigating automatically. Because your custom instructions define billing as a separate ownership boundary, its alerts become a distinct issue routed to that team's rotation. Responders start from two clear, ownership-aligned issues - each already investigated - instead of dozens of isolated alerts. What's next Autonomous operations mark the next step for the Observability Agent: from user-invoked analysis to continuous preparation. The agent assembles the context, explains the issue, and runs the investigation; your team reviews the evidence and decides what to do. And once issues are created, you can act on them. Azure Monitor issues connect to Action Groups, so approved actions can flow into your existing workflows - more on that in a future post. Next steps Learn how to get started with the Azure Copilot Observability Agent. Review the preview details in Autonomous operations in the Observability Agent. Explore how investigations work in Deep investigations in the Observability Agent. Learn how teams preserve context with Azure Monitor issues. Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar A walkthrough of real Observability Agent scenarios, best practices, and what's available today, along with a look at what's coming next and live Q&A with the product team. Register for the Observability Agent webinar We'd love your feedback The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience or reach us at azureobsagent@microsoft.com.
Efrat_Ben_Porat
Jul 08, 2026 Place Azure Observability Blog
284Views
1like
0Comments
Export historical data from Log Analytics workspace with Export Job (preview)
Log Analytics Export Job is now available in public preview. It gives you a straightforward way to export historical log data from your workspace to Azure Blob Storage, without writing custom scripts or disrupting live operations. You submit a job including a query, time range, and the service handles the rest asynchronously. Historical data had no built-in exit path Your Log Analytics workspace accumulates months, sometimes years, of telemetry. That data has real value beyond the workspace: training security models, satisfying compliance requirements, supporting forensic investigations with external tools, or migrating to a new analytics platform. The challenge has always been getting it out. Log Analytics supports continuous data export for ongoing ingestion, but that doesn’t help with data that already exists. Teams that needed to export historical data had to build their own solutions: scripted query loops, Logic Apps, or Azure Functions calling the query API in batches and stitching results into storage. These approaches were slow, brittle, and hard to operationalize at scale. Export Job closes that gap. One job per table, across Analytics and Basic tiers You target a specific table, define a KQL filter on table, set a time range, and the job exports that data, whether it sits in Analytics or Basic tier, writing the results directly to your storage account as Parquet files. End-to-end flow of a Log Analytics Export Job You can filter with KQL to scope the export to exactly the columns and records you need, reducing cost and downstream processing time. Output is gzip-compressed Parquet, the standard columnar format for data lakes, Spark, Azure Data Explorer, and most ML frameworks, with no conversion step required. Export data in hourly folders to your blob storage. Billing is based on two existing meters: data scanned, using existing Log Analytics scan rates, and data volume exported as measured in your storage account. Resilient execution Large exports can be interrupted by network issues, transient storage errors, or downtime. Export Job includes a built-in retry mechanism to overcome these interruptions automatically. The service splits the job into hourly bins, each tracked and written independently to your storage container. Transient failures are retried without any action on your part. If a bin fails after retry exhaustion or job 7-days' timeout, you can retry it manually within 7 days of job completion, without restarting the entire job or re-exporting data that already completed successfully. Before a retry writes new data, any partial output from the failed bin is automatically cleaned up, so there is no risk of duplicates in your storage account. Getting started Log Analytics Export Job is available in public preview today. Configuration is programmatic through the Azure Monitor REST API, letting you create, check status, cancel, and retry jobs. Before your first job: Enable the workspace Managed Identity in your Log Analytics workspace settings. Assign the Storage Blob Data Contributor and Log Analytics Reader roles to the workspace Managed Identity on your destination storage account. Ensure the destination storage account is in the same Azure region as the workspace (cross-region support is on the roadmap). Enable the Jobs category in your workspace’s diagnostic settings, to route job execution records to the LAJobLogs table. This gives you creation time, job parameters, and bin-level status for every job you run. Assess that export volume and run duration using suggested query in export job article. Consider export job bounderies: The maximum time range per job is one year The maximum run duration per job is seven days. When reached due to volume, you can retry to continue export from where it stopped. Five concurrent jobs are supported Once prerequisites are in place, create a job with a single API call: POST https://api.loganalytics.azure.com/v2/subscriptions/{subscriptionId}/resourcegroups/{resourcegroup}/providers/Microsoft.OperationalInsights/workspaces/{workspace}/jobs/export?api-version=2023-09-01-preview Authorization: {credential} content-type: application/json { "startTime": "2025-01-01T00:00:00Z", "endTime": "2025-06-30T23:59:59Z", "query": "{query}", "destinationStorageAccounts": [ "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Storage/storageAccounts/{storageAccountName}" ], "containerName": "{containerName}", "outputDataFormat": "Parquet", "dateTimeFormat": "yyyy-MM-ddTHH" } Copy the job ID returned in the response, which can be used to poll status, cancel, or retry individual failed bins. Learn more: https://aka.ms/LogsExportJob Share your feedback as we continue to improve the feature.
YossiY
Jul 07, 2026 Place Azure Observability Blog
201Views
0likes
0Comments
Find anomalies in Prometheus and OpenTelemetry metrics with Dynamic Thresholds (Preview)
Dynamic thresholds are extended to query-based metric alerts in Azure Monitor, allowing to detect and alert on anomalies in Azure Monitor managed Prometheus metrics and OpenTelemetry metrics stored in an Azure Monitor Workspace. This follows the introduction of Dynamic Thresholds for Log search alerts — Azure Monitor now offers consistent Dynamic Thresholds support across logs and metrics — platform metrics, log search queries, and now query-based metric alerts. A consistent anomaly-detection approach, wherever your signals live. Dynamic thresholds are not a single static formula. They apply a range of machine-learning models and algorithms to historical query results, learn each series’ normal rhythm — including hourly, daily, and weekly seasonality — and automatically fit the most appropriate baseline separately to every time series. This way, a single alert rule can monitor many resources or dimensions while each one gets its own independent, self-refining baseline. Why Dynamic Thresholds Matter Simpler configuration: Reduce the need to define, maintain, and continuously tune static thresholds inside PromQL alert logic. Adaptive monitoring: Let alert thresholds adjust to changing workload behavior, recurring traffic peaks, and seasonal usage patterns. At-scale intelligence: Monitor multiple time series with a single alert rule, while Azure Monitor learns an independent baseline for each resource or dimension combination. Example 1 — Spot CPU anomalies in AKS workloads Scenario: Monitor container CPU utilization across pods or deployments in AKS with a query-based metric alert built on Prometheus metrics. Example query: sum by (microsoft_resource_id, namespace, deployment, container) (rate(container_cpu_usage_seconds_total[5m])) / sum by (microsoft_resource_id, namespace, deployment, container) (container_spec_cpu_quota / container_spec_cpu_period) Why dynamic thresholds help: CPU usage of a Kubernetes workload changes with workload mix, deployment timing, scaling activity, and traffic patterns. Static thresholds can be difficult to tune across namespaces, deployments, and containers. Dynamic thresholds learn a separate baseline for each monitored time series — in this example, for every pod, deployment, and container combination — so genuine CPU spikes stand out while expected variation from autoscaling and traffic mix stays quiet. Example 2 — Catch application latency regressions sooner Scenario: Detect abnormal latency patterns in an application by alerting on custom OpenTelemetry metrics stored in an Azure Monitor Workspace. Example query: histogram_quantile(0.95, sum by (le, service_name, http_route, http_method) (rate(http_server_duration_seconds_bucket[5m]))) Why dynamic thresholds help: Application latency naturally changes with traffic, user behavior, and release cadence. Fixed thresholds can be noisy during peak periods and too loose during quiet ones. Dynamic thresholds learn a separate baseline for each time series — here, for every service, route, and method — so real p95 latency regressions surface even as traffic and release cadence shift throughout the day. Best practices for better results To get the best results from dynamic thresholds for PromQL-based alerts, design your query so Azure Monitor can learn a clear, stable signal over time: Keep the expression numeric. Dynamic thresholds work best when the query returns a continuous numeric signal rather than a Boolean true/false result. For example, use an expression that calculates CPU usage, not a Boolean comparison like CPU > 0.8. Use meaningful dimensions. Split by dimensions such as namespace, deployment, service, or route when you want separate baselines for different workloads or endpoints. Prefer stable entities. Use longer-lived dimensions or aggregate across short-lived entities so the model has enough consistent history to learn from. In Kubernetes, for example, deployment is usually a better baseline dimension than individual pod ID. Choose the right threshold behavior. Decide whether the alert should trigger on values above the learned upper bound, below the lower bound, or both. Start with medium sensitivity. Use Medium as a balanced default, then tune up or down based on noise and missed anomalies. Allow enough historical data. Dynamic thresholds improve as more history is collected. Initial seasonal patterns use recent history, and weekly seasonality becomes more effective after several weeks of data. Get started Ready to try it? Create a query-based metric alert with dynamic thresholds on your metrics in Azure Monitor Workspace. You can create such rules in the Azure portal, where the built-in preview chart shows when your dynamic threshold alert would have fired based on historical baseline analysis. Use the preview chart to tune both the PromQL query and the dynamic threshold sensitivity before enabling the rule. You can also create query-based metric alert rules using programmatic interfaces or resource templates. Figure 1. Dynamic thresholds preview chart showing the learned baseline and the points where an alert would have fired. Dynamic thresholds cut alert noise where it starts — at detection. The alerts that do fire connect into Azure Monitor’s broader AIOps experience, where the Azure Copilot Observability Agent can help correlate signals into investigated issues with explainable reasoning — with humans in control. Next steps Related blog: Anomaly detection made easy with Dynamic thresholds for Log search alerts Dynamic thresholds in Azure Monitor Query-based metric alerts overview Create query-based metric alerts Prometheus metrics in Azure Monitor OpenTelemetry on Azure Monitor Stay connected Follow the Azure Observability Blog for more updates on Azure Monitor, Prometheus-based monitoring, alerting, and troubleshooting experiences. We’ll continue sharing product updates, practical guidance, and examples to help you improve observability across your Azure environments. Feedback We’d love to hear how dynamic thresholds for query-based metric alerts work for your scenarios. Share your feedback through your Microsoft account team, Azure support channels, or the feedback options in the Azure portal so we can continue improving the experience.
yairgil
Jul 02, 2026 Place Azure Observability Blog
114Views
0likes
0Comments
Azure Copilot Observability Agent is generally available, with autonomous operations in preview
Complex cloud environments have outpaced manual operations. Agentic cloud operations connect people, tools, and data to streamline investigation workflows and move teams from scattered signals to evidence-backed next steps. With unified observability, teams can investigate Azure-monitored applications, Azure Kubernetes Service (AKS) environments, VMs, Foundry telemetry, infrastructure, and platform signals with greater context and control. Powered by Azure Monitor, the Azure Copilot Observability Agent is now generally available. It helps engineering, SRE, DevOps, and operations teams move from telemetry and alert noise to investigated issues, explainable reasoning, and recommended next steps that can reduce Time-To-Mitigate (TTM). Autonomous operations are also available in public preview. They help prepare context and reduce triage work while people remain responsible for mitigation decisions and any changes to the environment. From alert noise to investigated issues The Observability Agent helps teams reduce the effort required to understand operational problems. Instead of starting every investigation from a dashboard, query editor, or alert payload, teams can work with an AI companion that reasons across telemetry, Azure resource context, discovered topology, and custom instructions to identify what changed, what is correlated, and what evidence supports the conclusion. Teams can start with natural-language exploration and continue into deeper investigations when an issue requires more evidence. That light-to-deep workflow helps responders move from broad questions to a structured investigation without losing the reasoning trail. Here's what this looks like in practice: after a deployment, several alerts might fire across an app, database dependency, and compute resource. The Observability Agent can group those signals around the affected service, identify when the regression started, compare related dependencies and infrastructure metrics, and capture the findings in an Azure Monitor issue. The responder can then validate the evidence, add team context, route work to the right owner, and decide whether a rollback, configuration change, or code fix is appropriate. Explainable investigations across Azure-monitored signals Operations teams need more than a chatbot that answers questions. The Observability Agent follows an investigation workflow: it frames hypotheses, gathers evidence, compares signals by time, scope, and type, rules out weak explanations, and shows the reasoning path behind its findings. The Observability Agent can help teams: Investigate incidents and alerts across Azure-monitored applications, Azure Kubernetes Service (AKS) environments, VMs, Foundry telemetry, infrastructure, and platform signals Correlate related signals to reduce noise and surface higher-signal issues with context Explore telemetry using natural language while preserving transparency into the supporting data Compare signals by time, scope, and type to separate likely causes from coincidental changes Provide a reasoning trail that shows what the agent found, what it ruled out, and why Recommend next steps that engineers can review before deciding how to act This same investigation model applies to specialized skills and issue types, including customer's application, Azure Kubernetes Service (AKS), Foundry, VMs, and GenAI issues. When the relevant telemetry is available, the Observability Agent can correlate logs, metrics, traces, alerts, dependencies, resource graph, resource health, activity logs, Foundry telemetry, and changes. This helps teams investigate customer-visible issues with evidence, including latency, token spikes, tool-call failures, agent errors, hallucinations, deployments, API failures, performance regressions, infrastructure dependencies, and platform incidents. This explainability is central to the product. In production operations, trust is earned through evidence. The Observability agent is built to support human judgment, not bypass it. . Azure expertise, with context from your environment Context matters in every investigation. The same symptom can mean different things depending on application architecture, recent deployments, dependencies, historical incidents, and team practices. The Observability Agent brings Microsoft and Azure operational knowledge into the investigation experience. It can use discovered topology, Azure resource context, logs, metrics, traces, and custom instructions to ground investigations in signals that are more relevant to your environment. Native to Azure Monitor, with humans in control Because the Observability Agent is built into Azure Monitor, teams can use it close to the telemetry, alerts, and workflows they already rely on. Investigations can also be captured as Azure Monitor issues, creating a shared case file for humans and agents to collaborate on evidence, reasoning, and next steps. The Observability Agent is designed for governed AI operations inside Azure Monitor. Interactive chat and investigations use the signed-in user's identity and Azure role-based access control (RBAC). Prompts and responses are not used to train foundation models, and the agent doesn't restart resources, change configuration, or resolve issues on its own. Autonomous operations in public preview Alongside general availability, autonomous operations for the Observability Agent are available in public preview. When enabled, the agent can analyze alerts in the background, correlate related alerts when they likely represent the same incident, create Azure Monitor issues automatically, and run deep investigations on agent-created issues. This automatic triage helps reduce alert noise by turning streams of individual alerts into higher-signal issues with context, findings, and recommended next steps. Teams can review the issue, continue the investigation, and decide what action to take. Autonomous operations are designed to prepare context and reduce triage work, not to remove human control. Engineers remain responsible for decisions, approvals, and any changes to the environment. Next steps Check out our latest announcements and related blogs: Azure Blog and OMB Blog. Learn how to use the Observability Agent in Azure Copilot Observability Agent. Explore how investigations work in Deep investigations in the Azure Copilot Observability Agent. Learn more on how to Chat with your observability data Learn how teams preserve context in Azure Monitor issues. Review preview details in Autonomous operations in the Azure Copilot Observability Agent. Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar - a walkthrough of real Observability Agent scenarios, best practices, and what's available today - along with a look at what's coming next, and live Q&A with the product team. Register for the Observability Agent webinar. We'd love your feedback The Observability agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at enauerman@microsoft.com.
EfratNauerman
Jun 23, 2026 Place Azure Observability Blog
9.2KViews
6likes
0Comments
Accelerating AKS troubleshooting with the Azure Copilot Observability Agent
AKS incidents rarely stay within one Kubernetes object, signal, or tool. A latency spike might first appear in application telemetry, but the root cause may sit elsewhere: pod restarts, node pressure, scheduling failures, or a recent configuration change. The Azure Copilot Observability Agent in Azure Monitor helps connect these signals into an explainable investigation, so teams can move from symptoms to evidence-backed next steps. Why AKS troubleshooting is complex Troubleshooting Azure Kubernetes Service (AKS) is complex because failures can originate in workloads, platform components, infrastructure, or the application code running on the cluster. For example, pods stuck in Pending may indicate capacity or scheduling issues, while application latency may be caused by throttling, failed probes, pod restarts, or node pressure below the app. During an incident, simply having more telemetry is not enough. Teams need a way to test likely causes, rule out unrelated signals, and keep the investigation tied to the affected workload and time window. From signal to root cause: the investigation flow The Observability Agent follows a consistent investigation pipeline: Scope the problem by identifying the most likely infrastructure resources involved, plus connected dependencies. Collect data across metrics, logs, traces, change history, and related signals. Detect anomalies using learned baselines (for metrics) and log analysis. Correlate across resources spanning infrastructure and application layers. Run deep diagnostics by invoking resource-specific tools when needed to pinpoint root cause. Summarize findings in a structured format: what happened, why it happened, and what to do next. AKS investigation data sources The agent works with telemetry already available in your Azure Monitor environment. Investigation depth improves as more relevant signals are enabled, including Container insights logs, Kubernetes events and state, Azure managed service for Prometheus, container and pod logs, Application Insights telemetry for AKS-hosted workloads, Azure Activity Log changes, control plane logs routed through diagnostic settings, and resource metadata for the cluster, node pools, workloads, and related Azure resources. Figure 1. AKS investigation data sources You don’t need to enable every telemetry source to get started. The Observability Agent uses the data already available in Azure Monitor, and its findings become more complete as more AKS and application signals are collected. Example 1: AKS infrastructure — explaining why new pods never start Consider a workload rollout on AKS where replacement pods remain stuck in Pending state. What looks like a failed release may stem from the workload definition, cluster state, or underlying infrastructure. Investigation walkthrough Symptom: rollout is blocked Replacement pods remain in Pending during rollout, and Kubernetes events show repeated scheduling failures. This indicates that the rollout is blocked before new pods can start. Workload evidence: scheduling, not startup Pod state identifies the affected workload, while Kubernetes events show repeated placement failures. The issue is therefore tied to scheduling rather than application startup or container crash behavior. Cluster evidence: capacity pressure When enabled, Prometheus node metrics show CPU and memory utilization near capacity. Cluster-level trends show resource pressure increasing at the same time as pending pods and scheduling failures. Likely cause: insufficient schedulable capacity The scheduler cannot place new pods because the relevant node pool does not have enough available capacity. The failed rollout is best explained by capacity pressure in the target node pool rather than an application crash or image startup failure. Recommended action Scale out the affected node pool or adjust workload resource requests, then retry the rollout once schedulable capacity is restored. Figure 2. AKS investigation flow The Observability Agent connects pod state, scheduling events, and node pressure to explain why the rollout is blocked and which capacity action to consider next. Example 2: Joint app-AKS investigation — tracing application latency to pod restarts Now consider a customer-facing application where users see increased latency and intermittent HTTP 5xx errors after deployment. The first symptom appears in application telemetry, but the unhealthy requests are served by pods that are repeatedly restarting in AKS. Investigation walkthrough Symptom: customer-facing service degradation After deployment, application telemetry shows increased latency and HTTP 5xx errors. The first visible impact appears at the application layer. AKS evidence: unstable pods Affected pods enter CrashLoopBackOff, restart counts increase, and Kubernetes events show back-off restarts, probe failures, or image or command errors. Container logs point to startup exceptions, missing configuration, or crash details. Resource evidence: workload-specific pressure Container memory usage approaches configured limits before restarts, while node metrics show no broad node pressure. This suggests the issue is workload-specific rather than cluster-wide capacity related. Change evidence: deployment correlation Deployment history shows a new image or configuration change shortly before restarts began, with no matching platform health event. The timing points to the latest deployment or configuration change. Recommended action Review the latest image or configuration change, inspect container logs, adjust memory limits, or roll back if needed. Focus remediation on the workload change rather than node pool scaling. This pattern shows how an application symptom can map back to AKS workload behavior. Application telemetry establishes the user impact, while Kubernetes events, container logs, and resource metrics help explain why the affected pods keep failing. Operational impact For site reliability engineers, platform teams, and IT professionals, the Observability Agent reduces the time spent moving between application and AKS telemetry. It brings relevant signals into one investigation, surfaces supporting evidence, and applies Azure Monitor and AKS context so your team can review the findings, validate the recommended path, and decide which production changes to make. Figure 3. AKS investigation results Using the Observability Agent You can start using the Observability Agent from the Azure portal in two common AKS troubleshooting flows: Investigation mode: Start an investigation from an Azure Monitor alert on an AKS resource or from an Application Insights alert for an AKS-hosted workload. The agent uses the alert context to scope the incident, correlate application and cluster telemetry, and summarize the likely cause with recommended next steps. Chat-based exploration: Open the Monitor experience in AKS and select the Observability Agent button to chat with your telemetry. Use natural language to ask follow-up questions, explore logs and metrics, detect and inspect anomalies, and narrow down likely causes. Figure 4. Starting Observability Agent from AKS Monitor experience Next steps Azure Copilot Observability Agent overview Monitor Azure Kubernetes Service with Azure Monitor Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what's coming next. Live webinar — A walkthrough of real Observability Agent scenarios, best practices, and what's available today, along with a look at what's coming next and live Q&A with the product team. Register for the Observability Agent webinar. We'd love your feedback The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com
yairgil
Jun 17, 2026 Place Azure Observability Blog
260Views
0likes
0Comments
Anomaly detection made easy with Dynamic thresholds for Log search alerts
We’re excited to announce the General Availability of dynamic thresholds for log search alerts in Azure Monitor. Dynamic thresholds make anomaly detection easier by using machine learning to learn normal behavior from your historical log query results, automatically account for patterns such as hourly, daily, and weekly seasonality, and adapt as your environment changes. Instead of manually choosing static limits that can quickly become outdated, you can let Azure Monitor automatically determine the right threshold for each alert rule. Dynamic thresholds for Log search alerts are available at no extra charge - you pay the standard log search alert rule rate. Why it matters Simplified configuration: No need to fine-tune thresholds manually. Adaptive monitoring: Alerts automatically adapt to changing usage patterns and trends. At-scale intelligence: For multi-dimensional monitoring, thresholds are calculated per dimension combination. Example use cases AKS Pod restart spike anomaly detection Scenario: Monitor Kubernetes Pod logs for spikes in pod restarts across clusters. Why dynamic thresholds help: AKS workloads often scale dynamically; static thresholds can’t account for autoscaling patterns. Dynamic thresholds adapt to normal fluctuations in node/pod counts and alert only on true anomalies. Example query: KubePodInventory | summarize restartCount = sum(PodRestartCount) by bin(TimeGenerated, 10m), ClusterName, Namespace, Name Dynamic threshold settings: Namespace (for workload-level baselines). Name (for per-pod granularity if needed). Measure: restartCount (the aggregated column from the query). Split by dimensions (optional): Namespace (for workload-level baselines). Name (for per-pod granularity if needed). Resource Inventory Drift Detection (Azure Resource Graph) Scenario: Detect sudden spikes in resource creation or deletion across subscriptions or management groups utilizing Log search alerts integration with Azure Resource Graph that may indicate runaway deployments. Why dynamic thresholds help: Large organizations often have thousands of resources with varying deployment patterns. Static thresholds can’t account for seasonal changes (e.g., monthly deployments, scaling events). Dynamic thresholds adapt per subscription or resource type, reducing false positives. Example query: arg("").Resources | summarize resourceCount = count() by type, subscriptionId Dynamic threshold settings: type (for specific resource type changes). subscriptionId (for per-subscription granularity). Measure: resourceCount (the aggregated column from the query). Split by dimensions (optional): type (for specific resource type changes). subscriptionId (for per-subscription granularity). Getting Started Learn more about Log search alerts with dynamic thresholds and how to set up alert rules in Azure Monitor.
Efrat_Ben_Porat
Jun 16, 2026 Place Azure Observability Blog
250Views
0likes
0Comments
General Availability: Simple log alerts in Azure Monitor
We are excited to announce the General Availability of Simple log alerts in Azure Monitor! This feature is designed to provide a simplified and more intuitive experience for monitoring and alerting, enhancing your ability to detect and respond to problems in near real-time. Simple log alerts are a type of Log search alerts in Azure Monitor, designed to provide a simpler alternative to traditional Log search alerts. Unlike Log search alerts that aggregate rows over a defined period, Simple Log Alerts evaluate each row individually. Simple Log Alerts are supported using Basic logs as well. Before, choosing Basic logs for cost optimization - for example, configuring the traces table in Application Insights with Basic logs plan - meant giving up the ability to alert on that data. Simple log alerts close that gap, so you can keep the cost savings and alert on telemetry stored in Basic Logs. 🌐 When to use Simple Log Alerts are ideal for monitoring applications or network traffic where unaggregated, real-time detection and quick incident response are critical. Example scenarios: Failed automation jobs - get notified the moment a backup job, scheduled task, or any automated process fails, rather than waiting for an aggregation window. Windows events affecting storage or security - alert on individual event log entries that signal disk failures, security breaches, or service disruptions. 🔁 Flexible Trigger Recurrence By default, Simple log alerts fire on every matching row, but you can tune this to reduce noise. Choose to alert only when the condition is met at least once, twice, three times, or a custom number of times within a minute - giving you control over sensitivity without sacrificing the low-latency. 💰 Pricing Information Simple log alert rules evaluate your data every minute, so billing is the same as 1-minute frequency alert rules. For detailed pricing information, refer to Pricing - Azure Monitor | Microsoft Azure. You will see these rules in your billing statement tagged with kind:simplelogalert. 📚 Documentation and Links Create a simple log search alert in Azure Monitor - Azure Monitor | Microsoft Learn Overview of Azure Monitor alerts - Azure Monitor | Microsoft Learn
Efrat_Ben_Porat
Jun 16, 2026 Place Azure Observability Blog
431Views
1like
0Comments
The Azure Copilot Observability Agent Chat - Stop Writing Queries, Start Asking Questions.
Services and applications produce massive amounts of telemetry – and making sense of all this data takes effort. Data is often spread across different stores, which means the way to clear insights goes through careful querying, refinement and correlation. The Azure Copilot Observability agent now has a chat experience that simplifies this dramatically – you just ask, in your own plain, natural language. Ask questions. Get answers. To start chatting with the Observability agent, select a resource in the Azure Portal, and choose Logs from the resource menu. Click the Observability agent button. Soon, additional Azure observability experiences will show this or similar buttons so you can chat with the agent throughout your observability process. The Observability agent chat opens with a short intro message, and a few suggested prompts. Select one of the suggestions or type your question in natural language: “What errors increased in the last 24 hours?” “¿Existen anomalías de latencia?” (are there any anomalies) “どの依存関係が失敗しているか” (which dependencies are failing) The agent translates your prompt into queries across all relevant data sources, analyses your data, and returns clear, data-backed insights – so you don't need to write KQL queries, switch between logs and metrics experiences, or dive into the schemas of your data store. Explore your data – interactively The chat experience is designed for an interactive process of data exploration and troubleshooting. Through the chat you can explore trends in logs and metrics, identify anomalies and visualize results directly in the chat – all from one interface. Note: The agent operates here as your personal observability assistant - and it can only query data in your behalf, and access resources that you can access. The chat with the agent has a progressive exploration flow, instead of isolated queries. Still, in each step in the conversation the agent provides a clear chain of thought, and in it - the actual queries it used - so you can keep clear track of how it understood your prompt, and created the provided output. Results are show clearly and explained. In the example shown here, we follow up and ask the agent to create a time chart of the failed operations impacted by the errors it reported earlier. The result is clear - GET Customers/Details was impacted significantly, reaching 100K failed requests over a long time: From exploration to guided investigation The chat is very useful for guided investigations that go as deep as you choose, just as you would with the classic analysis tools over logs or metrics. Following the example shown above, we ask the agent to show exceptions or traces correlated with the failed requests: The agent found an association to NullReferenceException, and suggests going deeper and use the operation_Id field to clearly identify the request -> dependency -> exception sequence. We'll accept the recommendation and choose the first suggestion: Pull full transaction timeline. And here it is - each step of the transaction timeline explained, and the culprit is found - a failed Azure Table dependency. We didn't have to write queries, review metrics, join tables or even know which tables are there. We used standard terms to ask questions in natural language, and we were able to get as deep as we wanted, and can dive deeper still. For example, you can tell the agent to: Map this call chain into a sequence-diagram style summary showing request, SQL dependency, table write, and exception. Calculate the average request latency during the last 6 hours, split by client type, location and OS Find anomalies in the exceptions logged over the last 4 hours Create a time chart to show the top 3 anomalies How many users were impacted by each of the top 3 anomalies found? Break down the exception counts by request operation Launching a deep investigation Through the chat with the observability agent, you can also trigger a full, deep investigation process. A deep investigation doesn't handle just one question, but investigates an incident thoroughly - maps all related resources, identifies anomalies, performs correlations, analyzes root causes, and eventually provides a detailed report, including findings and recommendations. To start a deep investigation - select it from the suggestions provided during the conversation, or ask the agent explicitly, for example: run a deep investigation on the NullReferenceException anomaly. Final thought If observability used to start with queries – it now starts with a conversation. You can either guide the agent through the process you want to go through - or let it investigate on its own. Just ask. Stay connected Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what’s coming next. Check out our recent public preview update of the Azure Copilot Observability agent. Live webinar A walkthrough of real Observability agent scenarios, best practices, and what’s available today - along with a look at what’s coming next, and live Q&A with the product team. 👉 Register here We’d love your feedback The Observability agent continues to evolve based on real‑world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com
Noa Kuperberg
Jun 15, 2026 Place Azure Observability Blog
791Views
1like
2Comments