Profiling on HB Series with AMD uProf
Published Feb 01 2023 03:52 PM 4,038 Views

Profiling can be performed on Azure HPC VMs with various tools. Today we are going to look at MPI application profiling with AMD uProf.

AMD uProf is a tool for performance and system analysis. AMD uProf can gather time and instruction-based profiles. As well as tracing and visualizing MPI processes/threads.


Profiling is the analysis of an application’s execution via the measurement of system metrics.  When profiling hardware counters can be sampled to verify the occurrence and frequency of certain hardware events. For example, these events could be L1 cache misses or CPU cycles per instruction. Profiling can also include time-based measurements that relate to specific instructions in an application call stack.


The insight from profiling an application’s runtime can be used to improve latency, throughput, and scalability on HPC systems.


Modes of operation

AMD uProf is cross platform. It provides a CLI interface as well as a GUI. The CLI can generate a CSV report that can be later analyzed. However, for this experiment we will be using the CLI to gather metrics and then switch over to the GUI to analyses and visualize the data.

Environment setup

Collecting and Generating Reports with CLI


Collect profile data:

PROF_CMD="AMDuProfCLI collect --config tbp -g --mpi --output-dir <output dir>

mpirun <MPI flags> $PROF_CMD ./application.out

Collect MPI trace:

PROF_CMD="$AMD_PERF_DIR/AMDuProfCLI collect --trace mpi=openmpi,full -g --mpi --output-dir < output dir >

mpirun <MPI flags> $PROF_CMD ./application.out

Generate report:

AMDuProfCLI report -g --detail --input-dir <output dir/prof dir>


To avoid latency due to processing to much data refer to the following:

  • Collect profile data from only a few ranks are specify a short duration.
  • Using an MPI config file can help designate which ranks to be profiled.
  • Perform MPI trace separate from profile data collection.

Other options:

The –config flag can be used to provide a config file that designates which events should be sampled. Example config files are provided within the AMD uProf directory. The config flag also has predefined arguments which capture a preset list of metrics. Use the following to list them:

./AMDuProfCLI info --list collect-configs

Alternatively, the –events flag can be used to pick out performance monitoring unit (PMU) events to collect. Multiple event flags can be used when collecting more than one PMU event. To view the list of PMU events use the following:

./AMDuProfCLI info --list  pmu-events

Profiling and Tracing WRF

  1. First, we will collect WRF profiling data.
    1. We want to specify what type of events to collect. We will use the predefined config options to collect time-based and CPI (cycles per instruction) profile.
    2. To keep things neat we will set an env variable with a string of the AMD uProf executable and arguments:

PROF_CMD="./AMDuProfCLI collect --config tbp --config cpi -g --mpi --output-dir <o/p dir>

  1. The -g option enables call stack sampling and –mpi is used to specify profiling of MPI apps. (Refer to the documentation section 6.4 for a complete list of options and descriptions)
  2. Now that we have a nice env variable we can place it in our MPI command. For this I used an mpi config file to specify the ranks I want to profile on.



                -np 1 $PROF_CMD ./wrf.exe"

                -np 119 ./ wrf.exe"


mpirun command:

mpirun --allow-run-as-root $PIN_PROCESSOR_LIST --rank-by slot -mca coll ^hcoll -x LD_LIBRARY_PATH -x PATH -x PWD --app mpi_config.txt


Note: The $PIN_PROCESSOR_LIST variable is a string like this:

“--bind-to cpulist:ordered --cpu-set 0,1,2,3,4,5,6,7,8” which ensures proper pinning of all the cores.

  1. When the WRF simulation is launched is will notify that profiling has begun and completed:





  1. Now we are ready to generate the report:

~/AMDuProf_Linux_x64_4.0.341/bin/AMDuProfCLI report -g --detail --input-dir …/profout/AMDuProf-wrf-Custom_MPI

  1. This will generate a report csv, cpu database, and uProf session file.
  2. Collecting trace data is similar to the above steps for the exception of CLI options used and the use all MPI ranks.
  3. We can now compress the output directory containing the report data and transfer to a commodity machine where we can use the AMD uProf GUI.

Visualization and Analysis Overview

Please refer to the AMD uProf documentation (section 5, 5.5) for how to import and use the AMD uProf GUI.

Summary Hot Spots view:



Analyze Function Hotspots



Analyze Metrics

Shows similar call stack but with filtering by process, thread. module

---Analyze Flame Graph

Analyze Call Graph




-maps function calls to assembly instructions for detailed view


MPI flat profile

Time chart:







Support for hardware counters

Hardware counter VM visibility is exposed through the Virtual Performance Monitoring Unit (VPMU).    VPMU is enabled on the following VM SKUs:

  • HC
  • HBv2
  • HBv3
  • HB4/X


Version history
Last update:
‎Feb 03 2023 10:03 AM
Updated by: