hpc

248 Topics

Centralized cluster performance metrics with ReFrame HPC and Azure Log Analytics
Imagine having several clusters across different environments (dev, test and prod) or planning a migration between PBS and Slurm or porting codes to a different system. They can all seem like daunting tasks. This is where the combination of ReFrame HPC, a powerful and feature rich testing framework, and Azure Log Analytics can help improve confidence and assurance in the performance and accuracy of a system. Here we will look at how to configure ReFrame HPC specifically for Azure: Deploying the required Azure resources, running a test and capturing the results in Log Analytics for analysis. Deploying the required Azure Resources Firstly, deploy the required resources in Azure by using this bicep from GitHub. The deployment includes the creation and configuration of everything required for ReFrame HPC. These resources include a data collection endpoint, a data collection rule and a log analytics workspace. Running ior via ReFrame HPC For the purpose of demonstrating a running test and capturing the results in Azure from start to finish, here is a simple ior test which will run both a read and a write operation against the shared storage. import reframe as rfm import reframe.utility.sanity as sn @rfm.simple_test class SimplePerfTest(rfm.RunOnlyRegressionTest): valid_systems = ["*"] valid_prog_environs = ["+ior"] executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) Test explanation Set the binary to be executed to ior, along with its arguments. executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] Specify which systems the test should run on. In this case, any system/cluster which is known to have ior available will be selected. Look at the ReFrame HPC documentation to get a better understanding of the options available for use. valid_systems = ["*"] valid_prog_environs = ["+ior"] Verify the stdout of the job by searching for a specific value to assert that it ran successfully. @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) If the sanity function passed it will then extract the performance metrics from the stdout of the job. The naming of the methods is important, as they will be stored in the results later. @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) Performance references are used to determine if the current cluster has met the requirement or not. It also allows margins to be specified in either direction. reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } ReFrame HPC Configuration The ReFrame HPC configuration is key to determine how and where the test will run. It is also where the logic allowing Reframe HPC to use Azure for centralized logging will be defined. The full configuration file is vast and is covered in detail within the ReFrame HPC documentation. For the purpose of this test an example can be found on GitHub. Below is a breakdown of the key parts that allow Reframe HPC to push its results into Azure Log Analytics. Logging Handler The most important part of this configuration is the logging section, without it ReFrame HPC will not attempt to log the results. A handler_perflog of type httpjson is added to enable the logs to be sent to a HTTP endpoint with specific values which our covered below. 'logging': [ { 'perflog_multiline': True, 'handlers_perflog': [ { 'type': 'httpjson', 'url': 'REDACTED', 'level': 'info', 'debug': False, 'extra_headers': {'Authorization': f'Bearer {_get_token()}'}, 'extras': { 'TimeGenerated': f'{datetime.now(timezone.utc).isoformat()}', 'facility': 'reframe', 'reframe_azure_data_version': '1.0', }, 'ignore_keys': ['check_perfvalues'], 'json_formatter': _format_record } ] } Multiline Perflog To ensure this works with Azure, enable perflog_multiline. This will ensure a single record per metric is sent to Log Analytics. This is the cleanest way to output the results. Having this set to False will move the metric names into column names, which means that the schema will be different for each test and will become hard to maintain. Extra Headers A bearer token is required to authenticate the request. ReFrame HPC allows the adding of headers via the extra_headers property and a simple Python function, which obtains a scoped token that can be appended to the additional header. def _get_token(scope='https://monitor.azure.com/.default') -> str: credential = DefaultAzureCredential() token = credential.get_token(scope) return token.token Url Structure The url can be found in the output of the bicep which was run previously. It can also be obtained via the portal. Here is the structure of the url for reference. '${dce.properties.logsIngestion.endpoint}/dataCollectionRules/${dcr.properties.immutableId}/streams/Custom-${table.name}?api-version=2023-01-01' json Formatter A small work around is needed as the Data Collection Rule expects an array of items and ReFrame HPC outputs a single record. To resolve this another Python function can be used which simply wraps the record up in an array. In this example it also tidys up and removes some items that are not required and would cause issues with the json serialization. def _format_record(record, extras, ignore_keys): data = {} for attr, val in record.__dict__.items(): if attr in ignore_keys or attr.startswith('_'): continue data[attr] = val data.update(extras) return json.dumps([data]) Running the Test Now that the infrastructure has been deployed, the test has been defined and is correctly configured, we can run the test. Start by logging in. Here I am using the managed identity of the node, but User auth and User Assigned Managed Identities are also supported. $ az login --identity ReFrame HPC can be installed via Spack or Python and, while I am using Spack for packages on the cluster, I find the simplest approach is to activate a Python environment and install ReFrame HPC along with test specfic Python dependencies. $ python3 -m venv .venv $ . .venv/bin/activate $ python -m pip install -U pip $ pip install -r requirements.txt Now using the ReFrame HPC cli, the test can be run using the configuration file and the test file. $ reframe -C config.py -c simple_perf.py --performance-report -r ReFrame HPC will now run the test against the system/cluster defined in the configuration. For this example it is a Slurm cluster on a partition of HBv4 nodes and running squeue clarifys that. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 955 hbv4 rfm_Simp jim.pain R 0:28 1 tst4-hbv4-97 Results And there we have it, results are now appearing in Azure! From here we can use kql to query and filter the results. This is just a subset of the values available but the dataset is vast and includes a huge range of values that are extremely helpful. Summary By standardizing on the combination of ReFrame HPC and Azure Log Analytics for testing and reporting of performance data across our clusters, whether Slurm based, Azure CycleCloud or existing on-prem clusters, you can gain unprecendented visibility and confidence in the systems you manage and the codes you deploy that were previously hard to obtain. Enabling the potential for: 🔎Fast cross-cluster comparisions 📈Trend analysis over long running periods 📊Standardized metrics regardless of scheduler or system ☁️Unified monitoring and reporting across clusters ReFrame HPC is suitable for a wide range of testing, so if testing is something you have been looking to implement, take a look at ReFrame HPC
jimpaine
Feb 06, 2026 Place Azure High Performance Computing (HPC) Blog
203Views
0likes
0Comments
Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor
Unlocking Nvidia GPU Monitoring for Azure N-Series VMs with Telegraf and Azure Monitor. In the world of AI and HPC, optimizing GPU performance is critical for avoiding inefficiencies that can bottleneck workflows and drive up costs. While Azure Monitor tracks key resources like CPU and memory, it falls short in native GPU monitoring for Azure N-series VMs. Enter Telegraf—a powerful tool that integrates seamlessly with Azure Monitor to bridge this gap. In this blog, discover how to harness Telegraf for comprehensive GPU monitoring and ensure your GPUs perform at peak efficiency in the cloud.
vinilv
Feb 04, 2026 Place Azure High Performance Computing (HPC) Blog
4.5KViews
2likes
4Comments
Scaling physics-based digital twins: Neural Concept on Azure delivers a New Record in Industrial AI
Automotive Design and the DrivAerNet++ Benchmark In automotive design, external aerodynamics have a direct impact on performance, energy efficiency, and development cost. Even small reductions in drag can translate into significant fuel savings or extended EV range. As development timelines accelerate, engineering teams increasingly rely on data-driven methods to augment or replace traditional CFD workflows. MIT’s DrivAerNet++ dataset is the largest open multimodal dataset for automotive aerodynamics, offering a large-scale benchmark for evaluating learning-based approaches that capture the physical signals required by engineers. It includes 8,000 vehicle geometries across 3 variants (fastback, notchback and estate-back) and aggregates 39 TB of high-fidelity CFD outputs such as surface pressure, wall shear stress, volumetric flow fields, and drag coefficients. Benchmark Highlights Neural Concept trained its geometry-native Geometric Regressor, designed to handle any type of engineering data. The benchmark was executed on Azure HPC infrastructure to evaluate the capabilities of the geometry-native platform under transparent, scalable, and fully reproducible conditions. Surface pressure: Lowest prediction error recorded on the benchmark, revealing where high- and low-pressure zones form. Wall shear stress: Outperforming all competing methods to detect flow attachment and separation for drag and stability control. Volumetric velocity field: More than 50% lower error than previous best, capturing full flow structure for wake stability analysis. Drag coefficient Cd: R² of 0.978 on the test set, accurate enough for early design screening without full CFD runs. Dataset Scale and Ingestion: 39 TB of data was ingested into Neural Concept’s platform through a parallel conversion task with 128 workers and 5 GB RAM each that finished in about 1 hour and produced a compact 3 TB dataset in the platform’s native format. Data Pre Processing: Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices for handling external aerodynamics workflows. Model Training and Deployment: Training completed in 24 hours on 4 A100 GPUs, with the best model obtained after 16 hours. The final model is compact and real-time predictions can be served on a single 16 GB GPU for industrial use. Neural Concept outperformed all other competing methods, achieving state-of-the-art performance prediction on all metrics and physical quantities within a week: “Neural Concept’s breakthrough demonstrates the power of combining advanced AI with the scalability of Microsoft Azure,” said Jack Kabat, Partner, Azure HPC and AI Infrastructure Products, Microsoft. “By running training and deployment on Azure’s high-performance infrastructure — specifically the NC A100 Virtual Machine— Neural Concept was able to transform 39 terabytes of data into a production-ready workflow in just one week. This shows how Azure accelerates innovation and helps automotive manufacturers bring better products to market faster.” For additional benchmark metrics and comparisons, please refer to the Detailed Quantitative Results section at the end of the article. From State-Of-The-Art Benchmark Accuracy to Proven Industrial Impact Model accuracy alone is necessary, but not sufficient for industrial impact. Transformative gains at scale and over time are only revealed once high-performing models are deployed into maintainable and repeatable workflows across organizations. Customers using Neural Concept’s platform have achieved: 30% shorter design cycles $20M in savings on a 100,000-unit vehicle program These outcomes fundamentally result from a transformed, systematic approach to design, unlocking better and faster data-driven decisions. The Design Lab interface, described in the next section, is at the core of this transformation. Within Neural Concept’s ecosystem, validated geometry and physics models can be deployed directly into the Design Lab - a collaborative environment where aerodynamicists and designers evaluate concepts in real time. AI copilots provide instant performance feedback, geometry-aware improvement suggestions, and live KPI updates, effectively reconnecting aerodynamic analysis with the pace of modern vehicle design. CES 2026: See how OEMs are transforming product development with Engineering Intelligence Neural Concept and Microsoft will showcase how AI-native aerodynamic workflows can reshape vehicle development — from real-time design exploration to enterprise-scale deployment. Visit the Microsoft booth to see DrivAerNet++ running on Azure HPC and meet the teams shaping the future of automotive engineering. Visit Microsoft Booth to find out more Neural Concept’s executive team will also be at CES to share flagship results achieved by leading OEMs and Tier-1 suppliers already using the platform in production. Learn more on: https://www.neuralconcept.com/ces-2026 Credits Microsoft: Hugo Meiland (Principal Program Manager), Guy Bursell (Director Business Strategy, Manufacturing), Fernando Aznar Cornejo (Product Marketing Manager) and Dr. Lukasz Miroslaw (Sr. Industry Advisor) Neural Concept: Theophile Allard (CTO), Benoit Guillard (Senior ML Research Scientist), Alexander Gorgin (Product Marketing Engineer), Konstantinos Samaras-Tsakiris (Software Engineer) Detailed Quantitative Results In the sections that follow, we share the results obtained by applying Neural Concept’s aerodynamics predictive model training template to Drivaernet++. We evaluated our model’s prediction errors using the official train/test split and the standard evaluation strategy. For comparison, metrics from other methods were taken from the public leaderboard. We reported both Mean Squared Error (MSE) and Mean Absolute Error (MAE) to quantify prediction accuracy. Lower values for either metric indicate closer agreement with the ground truth simulations, meaning better predictions. 1. Surface Field Predictions: Pressure and Wall Shear Stress We began by evaluating predictions for the two physical quantities defined on the vehicle surface. Surface Pressure The Geometric Regressor achieved substantially better performance than all existing methods in predicting surface pressure distribution. Rank Deep Learning Model MSE (*10-2, lower = better) MAE (*10-1, lower = better) #1 Neural Concept 3.98 1.08 #2 GAOT (May 2025) 4.94 1.10 #3 FIGConvNet (February 2025) 4.99 1.22 #4 TripNet (March 2025) 5.14 1.25 #5 RegDGCNN (June 2024) 8.29 1.61 Table 1: Neural Concept’s Geometric Regressor predicts surface pressure more accurately than previously published state-of-the-art methods. The dates indicate when the competing model architectures were published. Figure 1: Side-by-side comparison of the ground truth pressure field (left), Neural Concept model’s prediction (middle), and the corresponding error for a representative test sample (right). Wall Shear Stress Similarly, the model delivered top-tier results, outperforming all competing methods. Rank Deep Learning Model MSE (*10 -2 , lower = better) MAE (*10 -1 , lower = better) #1 Neural Concept 7.80 1.44 #2 GAOT (May 2025) 8.74 1.57 #3 TripNet (March 2025) 9.52 2.15 #4 FIGConvNet (Feb. 2025) 9.86 2.22 #5 RegDGCNN (June 2024) 13.82 3.64 Table 2: Neural Concept’s Geometric Regressor predicts wall shear stress more accurately than previously published state-of-the-art methods. Figure 2: Side-by-side comparison of the ground truth magnitude of the wall shear stress, Neural Concept model’s prediction, and the corresponding error for a representative test sample. Across both surface fields (pressure and wall shear stress), the Geometric Regressor achieved the lowest MSE and MAE by a clear margin. The baseline methods represent several high-quality and recent academic work (the earliest being from June 2024), yet our architecture established a new state-of-the-art in predictive performance. 2. Volumetric Predictions: Velocity Beyond surface quantities, DrivAerNet++ provides 3D velocity fields in the flow volume surrounding the vehicle, which we also predicted using the Geometric Regressor. Rank Deep Learning Model MSE (lower = better) MAE (*10 -1 , lower = better) #1 Neural Concept 3.11 9.22 #2 TripNet (March 2025) 6.71 15.2 Table 3: Neural Concept’s Geometric Regressor predicts velocity more accurately than the previously published state-of-the-art method. The illustration below shows the velocity magnitude for two test samples. Note that only a single 2D slice of the 3D volumetric domain is shown here, focusing on the wake region behind the car. In practice, the network predicts velocity at any location within the full 3D domain, not just on this slice. Figure 3: Velocity magnitude for two test samples, arranged in two columns (left and right). For each sample, the top row displays the simulated velocity field, the middle row shows the prediction from the network, and the bottom row presents the error between the two. 3. Scalar Predictions: Drag Coefficient The drag coefficient (Cd) is the most critical parameter in automotive aerodynamics, as reducing it directly translates to lower fuel consumption in combustion vehicles and increased range in electric vehicles. Using the same underlying architecture, our model achieved state-of-the-art performance in Cd prediction. In addition to MSE and MAE, we reported the Maximum Absolute Error (Max AE) to reflect worst-case accuracy. We also included the Coefficient of Determination (R² score), which measures the proportion of variance explained by the model. An R² value of 1 indicates a perfect fit to the target data. Rank Deep Learning Model MSE (*1e-5) MAE (*1e-3) Max AE (*1e-2) R² #1 Neural Concept 0.8 2.22 1.13 0.978 #2 TripNet 9.1 7.19 7.70 0.957 #3 PointNet 14.9 9.60 12.45 0.643 #4 RegDGCNN 14.2 9.31 12.79 0.641 #5 GCNN 17.1 10.43 15.03 0.596 On the official split, the model shows tight agreement with CFD (R² of 0.978) across the test set, which is sufficient for early design screening where engineers need to rank variants confidently and spot meaningful gains without running full simulations for every change. 4. Compute Efficiency and Azure HPC&AI Collaboration Executing the full DrivAerNet++ benchmark at industrial scale required Neural Concept’s full software and infrastructure stack combined with seamless cloud integration on Microsoft Azure to dynamically scale computing resources on demand. The entire pipeline runs natively on Microsoft Azure and can scale within minutes, allowing us to process new industrial datasets that contain thousands of geometries without complex capacity planning. Dataset Scale and Ingestion DrivAerNet++ dataset contains 8000 car designs along with their corresponding CFD simulations. The raw dataset occupies approximately 39TB of storage. Generating the simulations required a total of about 3 million CPU hours by MIT’s DeCoDE Lab. Ingestion into Neural Concept’s platform is the first step of the pipeline. To convert the raw data into the platform’s native format, we use a Conversion task that transforms raw files into the platform’s optimized native format. This task was parallelized with 128 workers; each allocated 5 GB of RAM. As a result, the entire conversion process was completed in approximately one hour only. After converting the relevant data (car geometry, wall shear stress, pressure, and velocity), the full dataset occupies approximately 3 TB in Neural Concept’s native format. Data Pre-Processing Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices. During this phase, workloads were distributed across multiple compute nodes with peak memory usage reaching approximately 1.5 TB of RAM. The pre-processing pipeline consists of two main stages. In the first stage, we repaired the car meshes and pre-computed geometric features needed for training. The second stage involved filtering the volumetric domain and re-sampling points to follow a spatial distribution that is more efficient for training our deep learning model. We scaled the compute resources so that each of the two stages in the pipeline completes in 1 to 3 hours when processing the full dataset. The first stage is the most computationally intensive. To handle it efficiently, we parallelized the task across 256 independent workers, each allocated 6 GB of RAM. Model Training and Deployment While we use state-of-the-art hardware for training, our performance gains come primarily from model design. Once trained, the model remains lightweight and cost-effective to run. Training was performed on Azure Standard_NC96ads_A100_v4 node, which provided access to four A100 GPUs, each with 80 GB of memory. The model was trained for approximately 24 hours. Neural Concept’s Geometric Regressor achieved the best reported performance on the official benchmark for surface pressure, wall shear stress, volumetric velocity and drag prediction.
lmiroslaw
Jan 12, 2026 Place Azure High Performance Computing (HPC) Blog
427Views
0likes
0Comments
mpi-stage: High-Performance File Distribution for HPC Clusters
When running containerized workloads on HPC clusters, one of the first problems you hit is getting container images onto the nodes quickly and repeatably. A .sqsh is a Squashfs image (commonly used by container runtimes on HPC). In some environments you can run a Squashfs image directly from shared storage, but at scale that often turns the shared filesystem into a hot spot. Copying the image to local NVMe keeps startup time predictable and avoids hundreds of nodes hammering the same source during job launch. In this post, I'll introduce mpi-stage, a lightweight tool that uses MPI broadcasts to distribute large files across cluster nodes at speeds that can saturate the backend network. The Problem: Staging Files at Scale On an Azure CycleCloud Workspace for Slurm cluster with GB300 GPU nodes, I needed to stage a large Squashfs container image from shared storage onto each node's local NVMe storage before launching training jobs. At small scale you can often get away with ad-hoc copies, but once hundreds of nodes are all trying to read the same source file, the shared source filesystem quickly becomes the bottleneck. I tried several approaches: Attempt 1: Slurm's sbcast Slurm's built-in sbcast seemed like the natural choice. In my quick testing it was slower than I wanted, and the overwrite/skip-existing behavior didn't match the "fast no-op if already present" workflow I was after. I didn't spend much time exploring all the configuration options before moving on. Attempt 2: Shell Script Fan-Out I wrote a shell script using a tree-based fan-out approach: copy to N nodes, then each of those copies to N more, and so on. This worked and scaled reasonably, but had some drawbacks: Multiple stages: The script required orchestrating multiple rounds of copy commands, adding complexity Source filesystem stress: Even with fan-out, the initial copies still hit the source filesystem simultaneously — a fan-out of 4 meant 4 nodes competing for source bandwidth Frontend network: Copies went over the Ethernet network by default — I could have configured IPoIB, but that added more setup The Solution: MPI Broadcasts The key insight was that MPI's broadcast primitive (MPI_Bcast) is specifically optimized for one-to-many data distribution. Modern MPI implementations like HPC-X use tree-based algorithms that efficiently utilize the high-bandwidth, low-latency InfiniBand network. With mpi-stage: Single source read: Only one node reads from the source filesystem Backend network utilization: Data flows over InfiniBand using optimized MPI collectives Intelligent skipping: Nodes that already have the file (verified by size or checksum) skip the copy entirely Combined, this keeps the shared source (NFS, Lustre, blobfuse, etc.) from being hammered by many concurrent readers while still taking full advantage of the backend fabric. How It Works mpi-stage is designed around a simple workflow: The source node reads the file in chunks and streams each chunk via MPI_Bcast. Destination nodes write each chunk to local storage immediately upon receipt. This streaming approach means the entire file never needs to fit in memory — only a small buffer is required. Key Features Pre-copy Validation Before any data is transferred, each node checks if the destination file already exists and matches the source. You can choose between: Size check (default): Fast comparison of file sizes—sufficient for most use cases Checksum: Stronger validation, but requires reading the full file and is therefore slower If all nodes already have the correct file, mpi-stage completes in milliseconds with no data transfer. Double-Buffered Transfers The implementation uses double-buffered, chunked transfers to overlap network communication with disk I/O. While one buffer is being broadcast, the next chunk is being read from the source. Post-copy Validation Optionally verify that all nodes received the file correctly after the copy completes. Single-Writer Per Node The tool enforces one MPI rank per node to prevent filesystem contention and ensure predictable performance. Real-World Performance In one run using 156 GPU nodes, distributing a container image achieved approximately 3 GB/s effective distribution rate (file_size/time), completing in just over 5 seconds: [0] Copy required: yes [0] Starting copy phase (source writes: yes) [0] Copy complete, Bandwidth: 3007.14 MB/s [0] Post-validation complete [0] Timings (s): Topology check: 5.22463 Source metadata: 0.00803746 Pre-validation: 0.0046786 Copy phase: 5.21189 Post-validation: 2.2944e-05 Total time: 5.2563 Because every node writes the file to its own local NVMe, the cumulative write rate across the cluster is roughly this number times the node count: ~3 GB/s × 156 ≈ ~468 GB/s of total local writes. Workflow: Container Image Distribution The primary use case is distributing Squashfs images to local NVMe before launching containerized workloads. Run mpi-stage as a job step before your main application: #!/bin/bash #SBATCH --job-name=my-training-job #SBATCH --ntasks-per-node=1 #SBATCH --exclusive # Stage the container image srun --mpi=pmix ./mpi_stage \ --source /shared/images/pytorch.sqsh \ --dest /nvme/images/pytorch.sqsh \ --pre-validate size \ --verbose # Run the actual job (from local NVMe - much faster!) srun --container-image=/nvme/images/pytorch.sqsh ... mpi-stage will create the destination directory if it doesn't exist. If your container runtime supports running the image directly from shared storage, you may not strictly need this step—but staging to local NVMe tends to be faster and more predictable at large scale. Because of the pre-validation, you can include this step in every job script without penalty—if the image is already present, it completes in milliseconds. Getting Started git clone https://github.com/edwardsp/mpi-stage.git cd mpi-stage make For detailed usage and options, see the README. Summary mpi-stage started as a solution to a very specific problem—staging large container images efficiently across a large GPU cluster—but the same pattern may be useful in other scenarios where many nodes need the same large file. By using MPI broadcasts, only a single node reads from the source filesystem, while data is distributed over the backend network using optimized collectives. In practice, this can significantly reduce load on shared filesystems and cloud-backed mounts, such as Azure Blob Storage accessed via blobfuse2, where hundreds of concurrent readers can otherwise become a bottleneck. While container images were the initial focus, this approach could also be applied to staging training datasets, distributing model checkpoints or pretrained weights, or copying large binaries to local NVMe before a job starts. Anywhere that a “many nodes, same file” pattern exists is a potential fit. If you're running large-scale containerized workloads on Azure HPC infrastructure, give it a try. If you use mpi-stage in other workflows, I'd love to hear what worked (and what didn't). Feedback and contributions are welcome. Have questions or feedback? Leave a comment below or open an issue on GitHub.
pauledwards
Jan 09, 2026 Place Azure High Performance Computing (HPC) Blog
268Views
1like
0Comments
Azure V710 V5 Series -AMD Radeon GPU - Validation of Siemens CAD -NX
Overview of Siemens NX Siemens NX is a next-generation integrated CAD/CAM/CAE platform used by aerospace, automotive, industrial machinery, energy, medical, robotics, and defense manufacturers. It spans: Complex 3D modeling Assemblies containing thousands to millions of parts Surfacing and composites Tolerance engineering CAM and machining simulation Integrated multi physics through Simcenter / NX Nastran Because NX is used to design real-world engineered systems — aircraft structures, automotive platforms, satellites, robotic arms, injection molds — its usability and performance directly affect engineering velocity and product timelines. NX Needs GPU Acceleration NX is highly visual. It leans heavily on: OpenGL acceleration Shader-based rendering Hidden line removal Real-time shading / material rendering Ray-Traced Studio for photorealistic output Switch shading modes → CAD content must stay readable Zoom, section, annotate → requires stable frame pacing NVads V710 v5-Series on Azure The NVads V710 v5-series virtual machines on Azure are designed for GPU-accelerated workloads and virtual desktop environments. Key highlights: Hardware Specs: o GPU: AMD Radeon™ Pro V710 (up to 24 GiB frame buffer; fractional GPU options available). o CPU: AMD EPYC™ 9V64 F (Genoa) with SMT, base frequency 3.95 GHz, peak 4.3 GHz. o Memory: 16 GiB to 160 GiB. o Storage: NVMe-based ephemeral local storage supported. VM Sizes: o Ranges from Standard_NV4ads_V710_v5 (4 vCPUs, 16 GiB RAM, 1/6 GPU) to Standard_NV28adms_V710_v5 (28 vCPUs, 160 GiB RAM, full GPU). Supported Features: o Premium storage, accelerated networking, ephemeral OS disk. o Both Windows and Linux VMs supported. o No additional GPU licensing is required. AMD Radeon™ PRO GPUs offer: o Optimized OpenGL professional driver stack o Stable interactive performance vs large assemblies Business Scenario Enabled by NX + Cloud GPU Engineering Anywhere Distributed teams can securely work on the same assemblies from any geographic region. Supplier Ecosystem Collaboration Tier-1/2 manufacturers and engineering partners can access controlled models without local high-end workstations. Secure IP Protection Data stays in Azure — files never leave the controlled workspace. Faster Engineering Cycles Visualization + simulation accelerate design reviews, decision making, and manufacturability evaluations. Scalable Cost Model Pay for compute only when needed — ideal for burst design cycles and testing workloads. Architecture Overview – Siemens NX on Azure NVads_v710 Key Architecture Elements Create Azure Virtual Machine- NVads_v710_24 Install Azure AMD V710 GPU drivers Deploy Azure File-based storage Hosting assemblies, metadata, drawing packages, PMI, simulation data. Configure Vnet with Accelerated Networking Install NX licenses and software. Install NXCP & ATS Test suites on the Virtual Machine Qualitative Benchmark on Azure NVads_v710_24 Siemens has approved the following qualitative test results. The certification matrix update is currently in progress. Technical variant: Complex assemblies with thousands of components maintained smooth rotation, zooming, and selection, even under concurrent session load. NXCP and ATS test results on NVads_v710_24 Non-Interactive test results: Note: Execution Time (seconds) ATS Non‑Interactive Test Results validate the correctness and stability of Siemens NX graphical rendering by comparing generated images against approved reference outputs. The minimal or zero pixel differences confirm deterministic and visually consistent rendering, indicating a stable GPU driver and visualization pipeline. The reported test execution times (in seconds) represent the duration required to complete each automated graphics validation scenario, demonstrating predictable and repeatable processing performance under non‑interactive conditions. Interactive test results on Azure NVads_v710_24: Note: Execution Time (seconds) ATS Interactive Test Results evaluate Siemens NX graphics behavior during real‑time user interactions such as rotation, zoom, pan, sectioning, and view manipulation. The results demonstrate stable and consistent rendering during interactive workflows, confirming that the GPU driver and visualization stack reliably support user‑driven NX operations. The measured execution times (in seconds) reflect the responsiveness of each interactive graphics operation, indicating predictable behavior under live, user‑controlled conditions rather than peak performance tuning. NX CAD functions Automatic Tests Interactive Tests Grace1 Basic Tests GrPlayer_xp64.exe <FILE> Basic_Features.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> Fog_Measurement_Clipping.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> lighting.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> Shadow_Bump_Environment.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> Texture_Map.tgl Passed! Passed! Grace2 Graphics Tests GrPlayer_64.exe <FILE> GrACETrace.tgl Passed! Passed! Grace2 Graphics Tests GrPlayer_64.exe <FILE> GrACETrace.tgl Passed! Passed! NXCP Test Scenarios Automatic Tests NXCP Gdat Tests gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_1.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_2.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_4.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_5.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_6.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_7.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_8.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_9.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_10.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_11.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_12.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_13.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_14.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_15.cgi Passed! Benefits Azure NVads_v710 (AMD GPU Platform for NX Workstation-class AMD Radeon PRO graphics drivers baked into Azure Ensures ISV-validated driver pipeline. Excellent performance for CAD workloads Makes GPU-accelerated NX accessible to wider user bases. Remote engineering enablement Critical for companies who now operate global design teams. Elastic scale Spin up GPU when development peaks; scale down when idle. Conclusion: Siemens NX on Azure NVads_v710 powered by AMD GPUs enables enterprise-class CAD/CAM/CAE experiences in the cloud. NX benefits directly from workstation-grade OpenGL optimization, shading stability, and Ray Traced Studio acceleration, allowing engineers to interact smoothly with large assemblies, run visualization workloads, and perform design reviews without local hardware dependencies. Right‑sized GPU delivers workstation‑class experience at lower cost The family enables fractional GPU allocation (down to 1/6 of a Radeon™ Pro V710), allowing Siemens NX deployments to be right‑sized per user role. This avoids over‑provisioning full GPUs while still delivering ISV‑grade OpenGL and visualization stability, resulting in a lower per‑engineer cost compared to fixed full‑GPU cloud or on‑prem workstations Elastic scale improves cost efficiency for burst engineering workloads NVads_V710_v5 instances support on demand scaling and ephemeral NVMe storage, allowing NX environments to scale up for design reviews, supplier collaboration, or peak integration cycles and scale down when idle. This consumption model provides a cost advantage over fixed on prem workstations that remain underutilized outside peak engineering periods NX visualization pipelines benefit from balanced CPU–GPU architecture The combination of high‑frequency AMD EPYC™ Genoa CPUs (up to 4.3 GHz) and Radeon™ Pro V710 GPUs addresses Siemens NX’s mixed CPU–GPU workload profile, where scene graph processing, tessellation, and OpenGL submission are CPU‑sensitive. This balance reduces idle GPU cycles, improving effective utilization and overall cost efficiency when compared with GPU‑heavy but CPU‑constrained configurations The result is a scalable, secure, and cost-efficient engineering platform that supports distributed innovation, supplier collaboration, and digital product development workflows — all backed by the Rendering and interaction consistency of AMD GPU virtualization on Azure.
Sunita_AZ0708
Jan 07, 2026 Place Azure High Performance Computing (HPC) Blog
198Views
1like
0Comments
Announcing Azure CycleCloud Workspace for Slurm: Version 2025.12.01 Release
The Azure CycleCloud Workspace for Slurm 2025.12.01 release introduces major upgrades that strengthen performance, monitoring, authentication, and platform flexibility for HPC environments. This update integrates Prometheus self‑agent monitoring and Azure Managed Grafana, giving teams real‑time visibility into node metrics, Slurm jobs, and cluster health through ready‑to‑use dashboards. The release also adds Entra ID Single Sign‑On (SSO) to streamline secure access across CycleCloud and Open OnDemand. With centralized identity management and support for MFA, organizations can simplify user onboarding while improving security. Additionally, the update expands platform support with ARM64 compute nodes and compatibility for Ubuntu 24.04 and AlmaLinux 9, enabling more flexible and efficient HPC cluster deployments. Overall, this version focuses on improved observability, stronger security, and broader infrastructure options for technical and scientific HPC teams.
xpillons
Jan 07, 2026 Place Azure High Performance Computing (HPC) Blog
239Views
2likes
0Comments
Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. 🔔 Update: Supported Monitoring Option Now Available Update (December 2025): At the time this guide was written, monitoring InfiniBand (IB) and GPU metrics on Azure H-series and N-series VMs required a largely unofficial approach using Telegraf and Azure Monitor. Microsoft has since introduced a supported solution: Azure Managed Prometheus on VM / VM Scale Sets (VMSS), currently available in private preview. This new capability provides a native, managed Prometheus experience for collecting infrastructure and accelerator metrics directly from VMs and VMSS. It significantly simplifies deployment, lifecycle management, and long-term support compared to custom Telegraf-based setups. For new deployments, customers are encouraged to evaluate Azure Managed Prometheus on VM / VMSS as the preferred and supported approach for HPC and AI workload monitoring. Official announcement: Private Preview: Azure Managed Prometheus on VM / VMSS Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin Documentation
vinilv
Dec 18, 2025 Place Azure High Performance Computing (HPC) Blog
1.6KViews
2likes
0Comments
Private Preview: Azure Managed Prometheus on VM / VMSS
Announcing private preview support for Azure Managed Prometheus on VM/VMSS, enabling unified monitoring with GPU, InfiniBand, and node-level metrics for HPC workloads.
Daramfon
Dec 11, 2025 Place Azure High Performance Computing (HPC) Blog
855Views
0likes
0Comments
HIGH PERFORMANCE COMPUTING (HPC): OIL AND GAS IN AZURE
The goal of this blog is to share our experiences running key Oil and Gas workloads in Azure. We have worked with multiple customers running these workloads successfully in Azure. We now have a great potential for using the cloud in the Oil and Gas industry, to optimize the business workflows that were previously limited by capacity and older hardware.
kanchanm
Dec 04, 2025 Place Azure High Performance Computing (HPC) Blog
24KViews
4likes
1Comment
Automating HPC Workflows with Copilot Agents
High Performance Computing (HPC) workloads are complex, requiring precise job submission scripts and careful resource management. Manual scripting for platforms like OpenFOAM is time-consuming, error-prone, and often frustrating. At SC25, we showcased how Copilot Agents—powered by AI—are transforming HPC workflows by automating Slurm submission scripts, making scientific computing more efficient and accessible.
xpillons
Dec 03, 2025 Place Azure High Performance Computing (HPC) Blog
602Views
1like
0Comments