So there are solutions like logging into each VM and running "watch nvidia-smi" but this simply is not scalable and complex to manage across an estate of machines or clusters.
So the request is how can I do this simply and have a nice visual of usage across my class or cohort.
So wouldn't it be great is to have a single view of the utilisation in some form of dashboard visual.
Well you now can! Thanks to some Microsoft colleagues Mathew Salvaris and Miguel Fierro. They have created an app for monitoring GPUs on a single machine and across a clusters.
You can use it to record various GPU measurements during a specific period using the context based loggers or continuously using the gpumon cli command. The context logger can either record to a file, which can be read back into a dataframe, or to an InfluxDB database.
Data from the InfluxDB database can then be accessed using the python InfluxDB client or can be viewed in realtime using dashboards such as Grafana.
They have a great example which is available in Jupyter notebook and can be found
Below is an example dashboard using the InfluxDB log context and a Grafana dashboard