Azure High Performance Computing (HPC) Blog

6 MIN READ

Centralized cluster performance metrics with ReFrame HPC and Azure Log Analytics

Microsoft

Feb 06, 2026

Imagine having several clusters across different environments (dev, test and prod) or planning a migration between PBS and Slurm or porting codes to a different system. They can all seem like daunting tasks.

This is where the combination of ReFrame HPC, a powerful and feature rich testing framework, and Azure Log Analytics can help improve confidence and assurance in the performance and accuracy of a system.

Here we will look at how to configure ReFrame HPC specifically for Azure: Deploying the required Azure resources, running a test and capturing the results in Log Analytics for analysis.

Deploying the required Azure Resources

Firstly, deploy the required resources in Azure by using this bicep from GitHub. The deployment includes the creation and configuration of everything required for ReFrame HPC. These resources include a data collection endpoint, a data collection rule and a log analytics workspace.

Azure icons for a Data Collection Endpoint, Data Collection Rule with an arrow pointing from them to the icon for Log Analytics Workspace.

⚠️Make sure to capture the URL from the bicep output.

The structure of the endpoint that is needed later is complex, but the bicep generates it and outputs it at the end so make sure to caputure it now.

Running ior via ReFrame HPC

For the purpose of demonstrating a running test and capturing the results in Azure from start to finish, here is a simple ior test which will run both a read and a write operation against the shared storage.

import reframe as rfm import reframe.utility.sanity as sn @rfm.simple_test class SimplePerfTest(rfm.RunOnlyRegressionTest): valid_systems = ["*"] valid_prog_environs = ["+ior"] executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float)

Test explanation

Set the binary to be executed to ior, along with its arguments.

executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ]

Specify which systems the test should run on. In this case, any system/cluster which is known to have ior available will be selected. Look at the ReFrame HPC documentation to get a better understanding of the options available for use.

valid_systems = ["*"] valid_prog_environs = ["+ior"]

Verify the stdout of the job by searching for a specific value to assert that it ran successfully.

@sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout)

If the sanity function passed it will then extract the performance metrics from the stdout of the job. The naming of the methods is important, as they will be stored in the results later.

@performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float)

Performance references are used to determine if the current cluster has met the requirement or not. It also allows margins to be specified in either direction.

reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } }

ReFrame HPC Configuration

The ReFrame HPC configuration is key to determine how and where the test will run. It is also where the logic allowing Reframe HPC to use Azure for centralized logging will be defined. The full configuration file is vast and is covered in detail within the ReFrame HPC documentation. For the purpose of this test an example can be found on GitHub. Below is a breakdown of the key parts that allow Reframe HPC to push its results into Azure Log Analytics.

Logging Handler

The most important part of this configuration is the logging section, without it ReFrame HPC will not attempt to log the results. A handler_perflog of type httpjson is added to enable the logs to be sent to a HTTP endpoint with specific values which our covered below.

'logging': [ { 'perflog_multiline': True, 'handlers_perflog': [ { 'type': 'httpjson', 'url': 'REDACTED', 'level': 'info', 'debug': False, 'extra_headers': {'Authorization': f'Bearer {_get_token()}'}, 'extras': { 'TimeGenerated': f'{datetime.now(timezone.utc).isoformat()}', 'facility': 'reframe', 'reframe_azure_data_version': '1.0', }, 'ignore_keys': ['check_perfvalues'], 'json_formatter': _format_record } ] }

Multiline Perflog

To ensure this works with Azure, enable perflog_multiline. This will ensure a single record per metric is sent to Log Analytics. This is the cleanest way to output the results. Having this set to False will move the metric names into column names, which means that the schema will be different for each test and will become hard to maintain.

Extra Headers

A bearer token is required to authenticate the request. ReFrame HPC allows the adding of headers via the extra_headers property and a simple Python function, which obtains a scoped token that can be appended to the additional header.

def _get_token(scope='https://monitor.azure.com/.default') -> str: credential = DefaultAzureCredential() token = credential.get_token(scope) return token.token

Url Structure

The url can be found in the output of the bicep which was run previously. It can also be obtained via the portal. Here is the structure of the url for reference.

'${dce.properties.logsIngestion.endpoint}/dataCollectionRules/${dcr.properties.immutableId}/streams/Custom-${table.name}?api-version=2023-01-01'

json Formatter

A small work around is needed as the Data Collection Rule expects an array of items and ReFrame HPC outputs a single record. To resolve this another Python function can be used which simply wraps the record up in an array. In this example it also tidys up and removes some items that are not required and would cause issues with the json serialization.

def _format_record(record, extras, ignore_keys): data = {} for attr, val in record.__dict__.items(): if attr in ignore_keys or attr.startswith('_'): continue data[attr] = val data.update(extras) return json.dumps([data])

Running the Test

Now that the infrastructure has been deployed, the test has been defined and is correctly configured, we can run the test.

Start by logging in. Here I am using the managed identity of the node, but User auth and User Assigned Managed Identities are also supported.

$ az login --identity

ReFrame HPC can be installed via Spack or Python and, while I am using Spack for packages on the cluster, I find the simplest approach is to activate a Python environment and install ReFrame HPC along with test specfic Python dependencies.

$ python3 -m venv .venv $ . .venv/bin/activate $ python -m pip install -U pip $ pip install -r requirements.txt

Now using the ReFrame HPC cli, the test can be run using the configuration file and the test file.

$ reframe -C config.py -c simple_perf.py --performance-report -r

ReFrame HPC will now run the test against the system/cluster defined in the configuration. For this example it is a Slurm cluster on a partition of HBv4 nodes and running squeue clarifys that.

$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 955 hbv4 rfm_Simp jim.pain R 0:28 1 tst4-hbv4-97

Results

And there we have it, results are now appearing in Azure! From here we can use kql to query and filter the results. This is just a subset of the values available but the dataset is vast and includes a huge range of values that are extremely helpful.

Summary

By standardizing on the combination of ReFrame HPC and Azure Log Analytics for testing and reporting of performance data across our clusters, whether Slurm based, Azure CycleCloud or existing on-prem clusters, you can gain unprecendented visibility and confidence in the systems you manage and the codes you deploy that were previously hard to obtain. Enabling the potential for: