azure monitor managed service for prometheus
21 TopicsIntroducing a unified Azure Monitor cloud native offering for Kubernetes monitoring
Azure Monitor now offers Managed Prometheus service alongside Azure Managed Grafana and Azure Monitor Container insights, fostering an ecosystem of open-source and vendor neutrality. Now you can get the power of cloud native monitoring tools with enterprise grade security and reliability of Azure Monitor. What is Azure Monitor’s unified cloud native offering for Kubernetes monitoring? We are excited to announce the preview for Azure Monitor managed service for Prometheus to monitor your Kubernetes cluster. But it is more than “just another service” to monitor your Kubernetes cluster. Managed service for Prometheus is deeply integrated with Azure Monitor tools. You can now set up end-to-end monitoring for your Kubernetes cluster without the headache of patching various tools and manual integrations. With Azure Monitor’s unified cloud native offering, you get: A fully managed service that handles ingestion, storage for long-term data retention and querying of your Prometheus metrics using Prometheus Query Language (PromQL) with Azure Monitor managed service for Prometheus. Advanced troubleshooting capabilities with Azure Monitor Container Insights that offers visualizations and troubleshooting logs of nodes, controllers, and containers. A full stack observability with Azure Managed Grafana by a single-click configuration to seamlessly link various data sources and get popular out-of-the-box Grafana dashboards Get started with Azure Monitor’s cloud native offering Getting started with Azure Monitor’s cloud native offering for Kubernetes monitoring is a breeze. Once you create your Kubernetes cluster, you can enable monitoring using Azure portal, Azure resource manager templates and CLI. Using Azure portal, you can easily enable Prometheus metrics by selecting ‘Insights’ in your cluster view. You can store the Prometheus metrics in an Azure monitor workspace. Additionally, you can pick a log analytics workspace and enable logs collection for advanced troubleshooting. Finally, link your Grafana workspace (or create a new one) to visualize your monitoring data. Once you configure, unleash the power of the cloud native capabilities for monitoring the Kubernetes cluster with Azure Monitor. Open-source compatibility and portability for Prometheus metrics Azure Monitor managed service for Prometheus is a fully managed service that handles ingestion for the native Prometheus metrics types. It is available to use on its own or as an integrated component of Azure Monitor container insights and Azure Managed Grafana. Additional benefits include: 18-month data retention for storage and PromQL based query service Highly available, scalable, and enterprise-grade secure service Easily enable managed Prometheus using our collector or use it as a drop-in replacement for self-managed Prometheus Retain your existing Prometheus configurations, recording rules and alert rules. Monitor health and performance of your Kubernetes cluster with Azure Monitor container insights Azure Monitor container insights collects critical logs, enables alerts to identify issues, and provides visualization to monitor health and performance of your Kubernetes cluster. It complements CNCF backed open-source tools for end-to-end Kubernetes monitoring including logs collection for advanced troubleshooting. Enabling container insights with managed Prometheus will open possibilities such as: Correlating spikes in Prometheus metrics with troubleshooting logs for Kubernetes cluster Identifying capacity needs and determining the maximum load that the cluster can sustain by understanding the behavior of the cluster under average and heaviest loads Advanced diagnostics with collection of container logs (stdout/stderror), events, and pod metrics We understand that your monitoring needs may vary depending on scale, topology, organizational roles, and multi-cluster tenancy. Learn more about the best practices for monitoring each layer starting from infrastructure up through applications with Prometheus metrics and logs collected with container insights. Get full stack observability with Azure Managed Grafana Getting the right monitoring data is halfway through to get started with monitoring. It is imperative to have a single pane of glass to determine the performance issues and quickly mitigate them. Azure Managed Grafana is integrated with Azure Monitor and allows you to get full stack observability from multiple data sources on a single screen. With Azure Managed Grafana, you can: Get popular Grafana dashboards out-of-the-box for the Prometheus metrics Easily add your existing Grafana dashboards or Azure Monitor visualizations in a single view Combine application metrics and infrastructure metrics from various data sources into a single dashboard for full stack visibility Add Grafana dashboards from the open-source community That is all you need to get started with Kubernetes monitoring on Azure Monitor. We’d love to hear what you like and don’t like about this feature, and where you’d like us to take it. Please provide feedback on Azure Monitor Community under Managed Prometheus category. If you wish to learn more, you can always find a great ton of learning content in our documentation. Learn more Azure Monitor managed service for Prometheus (preview) | Microsoft Learn Overview of Container insights - Azure Monitor | Microsoft Learn Quickstart: create an Azure Managed Grafana instance using the Azure portal | Microsoft Learn Monitor Azure services and applications using Grafana - Azure Monitor | Microsoft Learn6.9KViews1like0CommentsMonitoring GPU Metrics in AKS with Azure Managed Prometheus, DCGM Exporter and Managed Grafana
https://aka.ms/managedpromdocumentation provides a production-grade solution for monitoring without the hassle of installation and maintenance. By leveraging these managed services, we can focus on extracting insights from your metrics and logs rather than managing the underlying infrastructure. The integration of essential GPU metrics—such as Framebuffer Memory Usage, GPU Utilization, Tensor Core Utilization, and SM Clock Frequencies—into Azure Managed Prometheus and Grafana enhances the visualization of actionable insights. This integration facilitates a comprehensive understanding of GPU consumption patterns, enabling more informed decisions regarding optimization and resource allocation. Azure Managed Prometheus https://aka.ms/ampcrdblog of Operator and CRD support, which will enable customers to customize metrics collection and add scraping of metrics from workloads and applications using Service and Pod Monitors, similar to the OSS Prometheus Operator. This blog will demonstrate how we leveraged the CRD/Operator support in Azure Managed Prometheus and used the Nvidia DCGM Exporter and Grafana to enable GPU monitoring. GPU monitoring As the use of GPUs has skyrocketed for deploying large language models (LLMs) for both inference and fine-tuning, monitoring these resources becomes critical to ensure optimal performance and utilization. https://prometheus.io/docs/introduction/overview/, an open-source monitoring and alerting toolkit, coupled with https://grafana.com/docs/grafana/latest/fundamentals/, a powerful dashboarding and visualization tool, provides an excellent solution for collecting, visualizing, and acting on these metrics. Essential metrics such as Framebuffer Memory Usage, GPU Utilization, Tensor Core Utilization, and SM Clock Frequencies serve as fundamental indicators of GPU consumption, offering invaluable insights into the performance and efficiency of graphics processing units, and thereby enabling us to reduce our COGs and improve operations. Using Nvidia’s DGCM Exporter with Azure Managed Prometheus The https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html is a tool developed by Nvidia to collect and export GPU metrics. It runs as a pod on Kubernetes clusters and gathers various metrics from Nvidia GPUs, such as utilization, memory usage, temperature, and power consumption. These metrics are crucial for monitoring and managing the performance of GPUs. You can integrate this exporter with Azure Managed Prometheus. The section below in blog describes the steps and changes needed to deploy the DCGM Exporter successfully. Prerequisites Before we jump straight to the installation, ensure your AKS cluster meets the following requirements: GPU Node Pool:https://learn.microsoft.com/azure/aks/create-node-pools with the required VM SKU that includes GPU support. GPU Driver: Ensure the https://learn.microsoft.com/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool driver is running as a DaemonSet on your GPU nodes. https://learn.microsoft.com/azure/azure-monitor/containers/kubernetes-monitoring-enable?tabs=cli Azure Managed Prometheus and Azure Managed Grafana on your AKS cluster. Refactoring Nvidia DCGM Exporter for AKS: Code Changes and Deployment Guide Updating API Versions and Configurations for Seamless Integration As per the official documentation, the best way to get started with DGCM Exporter is to install it using Helm. When installing over AKS with Managed Prometheus, you might encounter the below error: Error: Installation Failed: Unable to build Kubernetes objects from release manifest: resource mapping not found for name: "dcgm-exporter-xxxxx" namespace: "default" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1". Ensure CRDs are installed first. To resolve this, follow these steps to make necessary changes in the https://github.com/NVIDIA/dcgm-exporter: Clone the Project: Go to the GitHub repository of the DCGM Exporter and clone the project or download it to your local machine. Navigate to the Template Folder: The code used to deploy the DCGM Exporter is located in the template folder within the deployment folder. Modify the service-monitor.yaml File: Find the file service-monitor.yaml. The apiVersion key in this file needs to be updated from monitoring.coreos.com/v1 to azmonitoring.coreos.com/v1. This change allows the DCGM Exporter to use the Azure managed Prometheus CRD. apiVersion: azmonitoring.coreos.com/v1 4. Handle Node Selectors and Tolerations: GPU node pools often have tolerations and node selector tags. Modify the values.yaml file in the deployment folder to handle these configurations: nodeSelector: accelerator: nvidia tolerations: - key: "sku" operator: "Equal" value: "gpu" effect: "NoSchedule" Helm: Packaging, Pushing, and Installation on Azure Container Registry We followed the https://learn.microsoft.com/azure/container-registry/container-registry-helm-repos for pushing and installing the package through Helm on Azure Container Registry. For a comprehensive understanding, you can refer to the documentation. Here are the quick steps for installation: After making all the necessary changes in the deployment folder on the source code, be on that directory to package the code. Log in to your registry to proceed further. 1. Package the Helm chart and login to your container registry: helm package . helm registry login <container-registry-url> --username $USER_NAME --password $PASSWORD 2. Push the Helm Chart to the Registry: helm push dcgm-exporter-3.4.2.tgz oci://<container-registry-url>/helm 3. Verify that the package has been pushed to the registry on Azure portal. 4. Install the chart and verify the installation: helm install dcgm-nvidia oci://<container-registry-url>/helm/dcgm-exporter -n gpu-resources #Check the installation on your AKS cluster by running: helm list -n gpu-resources #Verify the DGCM Exporter: Kubectl get po -n gpu-resources Kubectl get ds -n gpu-resources You can now check that the DGCM Exporter is running on the GPU nodes as a DaemonSet. Exporting GPU Metrics and Configuring Azure Managed Grafana Dashboard Once the DGCM Exporter DaemonSet is running across all GPU node pools, you need to export the GPU metrics generated by this workload to Azure Managed Prometheus. This is accomplished by deploying a PodMonitor resource. Follow these steps: Deploy the PodMonitor: Apply the following YAML configuration to deploy the PodMonitor: apiVersion: azmonitoring.coreos.com/v1 kind: PodMonitor metadata: name: nvidia-dcgm-exporter labels: app.kubernetes.io/name: nvidia-dcgm-exporter spec: selector: matchLabels: app.kubernetes.io/name: nvidia-dcgm-exporter podMetricsEndpoints: - port: metrics interval: 30s podTargetLabels: 2. Check if the PodMonitor is deployed and running by executing: kubectl get podmonitor -n <namespace> 3. Verify Metrics export: Ensure that the metrics are being exported to Azure Managed Prometheus on the portal by navigating to the "Metrics" page on your Azure Monitor Workspace. Create the DGCM Dashboard on Azure Managed Grafana The GitHub repository for the https://github.com/NVIDIA/dcgm-exporter/blob/main/grafana/dcgm-exporter-dashboard.json for the Grafana dashboard. Follow the https://learn.microsoft.com/azure/managed-grafana/how-to-create-dashboard?tabs=azure-portal to import this JSON into your Managed Grafana instance. After importing the JSON, the dashboard displaying GPU metrics will be visible on Grafana.3.9KViews0likes0CommentsIntroducing Query editor: Empowering Users with PromQL in Azure Monitor Metrics!
We're thrilled to announce the public preview launch of Query Editor in Azure Monitor Metrics, an advanced feature that allows customers to query and write PromQL queries directly within their Azure Monitor workspace (AMW). This long-awaited addition comes as a direct response to the growing demand from our customers, and we're excited to finally deliver this capability to you. What’s new? Unlocking the Power of PromQL: Prometheus Query Language (PromQL) has emerged as a standard in the realm of monitoring and observability, offering users flexibility and expressiveness in querying metric data. With the Query Editor in Azure Monitor Metrics, users can now harness the full potential of PromQL to derive actionable insights for their resources. Previously, users in the Azure portal were unable to query their Prometheus metrics on AKS or Arc-enabled clusters sent to Azure Monitor Workspace via Azure Managed Prometheus in the portal. With this new capability, users can now query Prometheus metrics for their AKS resource or Arc-enabled clusters directly in the Query editor within the portal. Seamless Querying Experience: With the Query Editor, users can compose and execute PromQL queries directly within their Azure Monitor workspace that they are emitting metrics to. This streamlines the monitoring workflow, enabling users to stay focused and productive without the hassle of context switching while querying different types of metric data. Benefits of Query editor with PromQL: Rich Query Language: PromQL offers a rich set of functions and operators for querying metric data, allowing users to perform complex aggregations, transformations, and calculations with ease. Familiarity and Interoperability: For users familiar with Prometheus-based monitoring solutions, the Query Editor provides a familiar environment for querying Azure metrics, facilitating a smoother transition and interoperability between platforms. How it works? Using the Query Editor is simple. Just navigate to your Azure Monitor workspace (AMW), select the Azure Monitor Metrics Query Editor, and start writing your PromQL queries. Get Started Today: The public preview of Query Editor in Azure Monitor Metrics is now available, and we invite you to try it out and share your feedback with us. Your input is invaluable as we continue to refine and improve this feature to better serve your monitoring and analytics needs. Please note, currently, the Query editor only supports querying metrics stored in an Azure Monitor Workspace. We plan to offer future support for platform metrics. https://aka.ms/queryEditorPreview https://learn.microsoft.com/en-Us/azure/azure-monitor/essentials/azure-monitor-workspace-overview?tabs=azure-portal Stay tuned for more updates and enhancements as we work towards delivering even more value to our valued Azure customers.3.9KViews3likes1CommentOperator/CRD support with Azure Monitor managed service for Prometheus is now Generally Available
We are excited to announce that custom resource definitions (CRD) support with Azure Monitor managed service for Prometheus is now generally available. Azure Monitor managed service for Prometheus is a component of Azure Monitor Metrics, allowing you to collect and analyze metrics at scale using a Prometheus-compatible monitoring solution, based on the Prometheus project from the Cloud Native Computing Foundation. This fully managed service enables using the Prometheus query language (PromQL) to analyze and alert on the performance of monitored infrastructure and workloads. What's new? With this new update, customers can customize scraping targets using Custom Resources (Pod Monitors and Service Monitors), similar to the OSS Prometheus Operator. Enabling Managed Prometheus add-on in an AKS cluster will deploy the Pod and Service Monitor custom resource definitions to allow you to create your own custom resources. If you are already using Prometheus Service and Pod monitors to collect metrics from your workloads, you can simply change the apiVersion in the Service/Pod monitor definitions to use them with Azure Managed Prometheus. Earlier, customers who did not have access to kube-system namespace were not able to customize metrics collection. With this update, customers can create custom resources to enable custom configuration of scrape jobs in any namespace. This is especially useful in a multitenancy scenario where customers are running workloads in different namespaces. Here is how a leading Public Sector Banking and Financial Services and Insurance (BFSI) company in India has used Service and Pod monitors custom resources to enable monitoring of GPU metrics with Azure Managed Prometheus, DCGM Exporter, and Azure Managed Grafana. “Azure Monitor managed service for Prometheus provides a production-grade solution for monitoring without the hassle of installation and maintenance. By leveraging these managed services, we can focus on extracting insights from your metrics and logs rather than managing the underlying infrastructure. The integration of essential GPU metrics—such as Framebuffer Memory Usage, GPU Utilization, Tensor Core Utilization, and SM Clock Frequencies—into Azure Managed Prometheus and Grafana enhances the visualization of actionable insights. This integration facilitates a comprehensive understanding of GPU consumption patterns, enabling more informed decisions regarding optimization and resource allocation.” -A leading public sector BFSI company in India Get started today! To use CRD support with Azure Managed Prometheus, enable Managed Prometheus add-on on your AKS cluster. This will automatically deploy the custom resource definitions (CRD) for service and pod monitors. To add Prometheus exporters to collect metrics from third-party workloads or other applications, and to see a list of workloads which have curated configurations and instructions, see Integrate common workloads with Azure Managed Prometheus - Azure Monitor | Microsoft Learn. For more details refer to this article, or our documentation. We would love to hear from you - Please share your feedback and suggestions in Azure Monitor · Community.2.7KViews1like2Comments