Contributed by: Rafael Salas
Moneo is Microsoft’s solution for characterizing hardware on distributed systems. To learn the basics of Moneo please visit the previous blogpost HERE
We’ve recently expanded Moneo to include CPU, Memory, and Ethernet by default. This unlocks new optimization experiences for MPI workloads. We can observe how CPU and IB traffic interplay on a single node or a collection of nodes to see if the rank design should be changed. If the pre-configured metrics still don’t provide the time series telemetry needed then we also offer custom metrics, which are anything that can be polled on the node as a function call.
Moneo works well alongside a compute cluster enabled with any job scheduler. A recommended solution in Azure is deploying a SLURM cluster using Azure Cyclecloud and enabling Moneo (Documentation for deployment resides HERE). SLURM offers a scheduling interface to the end-user. Azure Cyclecloud provides dynamic resources as needed by the SLURM scheduling environment and Moneo takes us through the last mile to show the underlying resource consumption by the jobs.
Let’s look at how managing and troubleshooting a deep learning workload might look using Moneo and SLURM in Azure. A managed environment is deployed in Azure using ND A100 v4 VMs optimized for deep learning workloads. In Moneo, all the nodes are emitting telemetry to the database and a Graphana-enabled webhost is serving the different cluster view. A neural-network training job lands on the cluster and uses NCCL collectives to synchronize the computation across the grid. You can see the cycling of the resource consumption across compute iterations.
The primary cluster wide view delves into various metrics around GPU utilization, memory utilization, GPU power, max throttle code. It is extremely useful when trying to understand the health of all the nodes at a glance. For Example, the diagram below shows how the node circled in red is running hotter than the other. This lens can further be narrowed to only show nodes that are working on a particular job as defined by a host file.
Fig 1: Cluster wide view showing an outlier node
The detailed view of Moneo helps narrow in on the specifics. For Example, in the diagram below we can see that the outlier node above while running hot is not throttling. Moneo helps give a good understanding of how well the system is performing.
Fig 2: Outlier node running hotter than others
Using Moneo you can easily narrow down scenarios like a zombie process i.e. the job has ended, but a process is still running on a node. The primary cluster wide view can help narrow down on the specifics
Fig 3: Zombie process
Looking at the detailed view below, we can easily narrow down the node that still has a process running. This is especially useful when running a workload on hundreds of nodes.
To learn more about Moneo and see it in action, come visit our booth at SC (2433).
#AzureHPCAI #MakeAIYourReality
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.