Azure Monitor managed service for Prometheus provides a production-grade solution for monitoring without the hassle of installation and maintenance. By leveraging these managed services, we can focus on extracting insights from your metrics and logs rather than managing the underlying infrastructure.
The integration of essential GPU metrics—such as Framebuffer Memory Usage, GPU Utilization, Tensor Core Utilization, and SM Clock Frequencies—into Azure Managed Prometheus and Grafana enhances the visualization of actionable insights. This integration facilitates a comprehensive understanding of GPU consumption patterns, enabling more informed decisions regarding optimization and resource allocation.
Azure Managed Prometheus recently announced general availability of Operator and CRD support, which will enable customers to customize metrics collection and add scraping of metrics from workloads and applications using Service and Pod Monitors, similar to the OSS Prometheus Operator.
This blog will demonstrate how we leveraged the CRD/Operator support in Azure Managed Prometheus and used the Nvidia DCGM Exporter and Grafana to enable GPU monitoring.
As the use of GPUs has skyrocketed for deploying large language models (LLMs) for both inference and fine-tuning, monitoring these resources becomes critical to ensure optimal performance and utilization. Prometheus, an open-source monitoring and alerting toolkit, coupled with Grafana, a powerful dashboarding and visualization tool, provides an excellent solution for collecting, visualizing, and acting on these metrics.
Essential metrics such as Framebuffer Memory Usage, GPU Utilization, Tensor Core Utilization, and SM Clock Frequencies serve as fundamental indicators of GPU consumption, offering invaluable insights into the performance and efficiency of graphics processing units, and thereby enabling us to reduce our COGs and improve operations.
The DGCM Exporter is a tool developed by Nvidia to collect and export GPU metrics. It runs as a pod on Kubernetes clusters and gathers various metrics from Nvidia GPUs, such as utilization, memory usage, temperature, and power consumption. These metrics are crucial for monitoring and managing the performance of GPUs.
You can integrate this exporter with Azure Managed Prometheus. The section below in blog describes the steps and changes needed to deploy the DCGM Exporter successfully.
Before we jump straight to the installation, ensure your AKS cluster meets the following requirements:
As per the official documentation, the best way to get started with DGCM Exporter is to install it using Helm. When installing over AKS with Managed Prometheus, you might encounter the below error:
Error: Installation Failed: Unable to build Kubernetes objects from release manifest: resource mapping not found for name: "dcgm-exporter-xxxxx" namespace: "default" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1". Ensure CRDs are installed first.
To resolve this, follow these steps to make necessary changes in the DCGM code:
apiVersion: azmonitoring.coreos.com/v1
4. Handle Node Selectors and Tolerations: GPU node pools often have tolerations and node selector tags. Modify the values.yaml file in the deployment folder to handle these configurations:
nodeSelector:
accelerator: nvidia
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
We followed the MS Learn documentation for pushing and installing the package through Helm on Azure Container Registry. For a comprehensive understanding, you can refer to the documentation. Here are the quick steps for installation:
After making all the necessary changes in the deployment folder on the source code, be on that directory to package the code. Log in to your registry to proceed further.
1. Package the Helm chart and login to your container registry:
helm package .
helm registry login <container-registry-url> --username $USER_NAME --password $PASSWORD
2. Push the Helm Chart to the Registry:
helm push dcgm-exporter-3.4.2.tgz oci://<container-registry-url>/helm
3. Verify that the package has been pushed to the registry on Azure portal.
4. Install the chart and verify the installation:
helm install dcgm-nvidia oci://<container-registry-url>/helm/dcgm-exporter -n gpu-resources
#Check the installation on your AKS cluster by running:
helm list -n gpu-resources
#Verify the DGCM Exporter:
Kubectl get po -n gpu-resources
Kubectl get ds -n gpu-resources
You can now check that the DGCM Exporter is running on the GPU nodes as a DaemonSet.
Once the DGCM Exporter DaemonSet is running across all GPU node pools, you need to export the GPU metrics generated by this workload to Azure Managed Prometheus. This is accomplished by deploying a PodMonitor resource. Follow these steps:
apiVersion: azmonitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: nvidia-dcgm-exporter
labels:
app.kubernetes.io/name: nvidia-dcgm-exporter
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-dcgm-exporter
podMetricsEndpoints:
- port: metrics
interval: 30s
podTargetLabels:
2. Check if the PodMonitor is deployed and running by executing:
kubectl get podmonitor -n <namespace>
3. Verify Metrics export: Ensure that the metrics are being exported to Azure Managed Prometheus on the portal by navigating to the "Metrics" page on your Azure Monitor Workspace.
The GitHub repository for the DGCM Exporter includes a JSON file for the Grafana dashboard. Follow the MS Learn documentation to import this JSON into your Managed Grafana instance.
After importing the JSON, the dashboard displaying GPU metrics will be visible on Grafana.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.