hpc

255 Topics

Performance and Scalability of Azure HBv5-series Virtual Machines
Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs. Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to : 5x higher performance for CFD workloads with 43% lower costs 3.2x higher performance for weather simulation with 16% lower costs 2.8x higher performance for geoscience workloads at the same costs HBv5-series Technical Overview & VM Sizes Each HBv5 VMs features several new technologies for HPC customers, including: Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled) Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores 800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs) 180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs) 15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs) The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4 th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems. HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios: ISV licensing constraining a job to a targeted number of cores Maximum-performance-per-VM or maximum performance per core Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads Table 1: Technical specifications of HBv5-series VMs Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application. For more information see official documentation for HBv5-series VMs Microbenchmark Performance This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs. Memory & Compute Performance To capture synthetic performance, we ran the following industry standard benchmarks: STREAM – memory bandwidth High Performance Conjugate Gradient (HPCG) – sparse linear algebra High Performance Linpack (HPL)– dense linear algebra Absolute results and comparisons to HBv4 VMs are shown in Table 2, below: Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks. Note: STREAM was run with the following CLI parameters: OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream STREAM data size: 2621440000 bytes InfiniBand Networking Performance Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node). We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below: Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server. Unidirectional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak. Bi-directional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 -b Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak. Latency: Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications. Application Performance, Cost/Performance, and Server (VM) Consolidation This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family. We characterize HBv5 below in three (3) ways of high relevance to customer interests: Performance (“how much faster can it do the work”) Cost/Performance (“how much can it reduce the costs to complete the work”) Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”) Where possible, we have included comparisons to other Azure HPC VMs, including: Azure HBv4/HX series with 176 physical cores of 4 th Gen AMD EPYC CPUs with 3D V-Cache (“Genoa-X”) (HBv4 specifications, HX specifications) Azure HBv3 with 120 physical cores of 3 rd Gen AMD EPYC CPUs with 3D V-Cache (“Milan-X”) (HBv3 specifications) Azure HBv2 with 120 physical cores of 2 nd Gen AMD EPYC CPUs (“Rome”) processors (full specifications) Unless otherwise noted, all tests shown below were performed with: Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest) NVIDIA HPC-X MPI Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach. Computational Fluid Dynamics (CFD) OpenFOAM – version 2306 with 100M Cell Motorbike case Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs. Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4. Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5). Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs. Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5. Ansys Fluent – version 2025 R2 with F1 Racecar 140M case Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Star-CCM+ with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Weather Modeling WRF – version 4.2.2 with CONUS 2.5KM case Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4. Above, we can see that for customers running WRF with cases similar to the size and complexity of the 2.5km CONUS, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Energy Research Devito – version 4.8.7 with Acoustic Forward case Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4. Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Molecular Dynamics NAMD - version 2.15a2 with STMV 20M case Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs. Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4 Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2. Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU. Scalability of HBv5-series VMs Weak Scaling Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting. To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier: Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased. Strong Scaling Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so. To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads. First, let us examine NAMD with the 20M STMV benchmark Figure 19: On NAMD with the STMV 20M cell case, HBv5 scales linearly as the 3D cavity size is proportionately increased. As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase. If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly. Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores). As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline. The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.
jvenkatesh
Nov 05, 2025 Place Azure High Performance Computing (HPC) Blog
245Views
3likes
0Comments
The Complete Guide to Renewing an Expired Certificate in Microsoft HPC Pack 2019 (Single Head Node)
Managing certificates in an HPC Pack 2019 cluster is critical for secure communication between nodes. However, if your certificate has expired, your cluster services (Scheduler, Broker, Web Components, etc.) may stop functioning properly — preventing nodes from communicating or jobs from scheduling. When the HPC Pack certificate expires, the HPC Cluster Manager will fail to launch, and you may encounter error messages similar to the examples shown below. This comprehensive guide walks you through how to renew an already expired HPC Pack certificate on a single-head-node setup and bring your cluster back online. Step 1: Check the Current Certificate Expiry Start by checking the existing certificate and its expiry date. Get-ChildItem -Path Cert:\LocalMachine\root | Where-Object { $_.Subject -like "HPC" } $thumbprint = "<Thumbprint value from the previous command>".ToUpper() $cert = Get-ChildItem -Path Cert:\LocalMachine\My | Where-Object { $_.Thumbprint -eq $thumbprint } $cert | Select-Object Subject, NotBefore, NotAfter, Thumbprint Date You can also confirm the system date using the PowerShell date command: Date This ensures you’re viewing the correct validity period for the currently installed certificate. Step 2: Prepare a New Self-Signed Certificate Next, we’ll create a new certificate that meets the HPC communication requirements. Certificate Requirements: Must have a private key capable of key exchange. Key usage should include: Digital Signature, Key Encipherment, Key Agreement, and Certificate Signing. Enhanced key usage should include: Client Authentication and Server Authentication. If two certificates are used (private/public), both must have the same subject name. When you prepare a new certificate, make sure that you use the same subject name as that of the old certificate. Run the following PowerShell commands on the HPC node to get the subject name of your certificate. You can verify the existing certificate’s subject name using the following command: $thumbprint = (Get-ItemProperty -Path HKLM:\SOFTWARE\Microsoft\HPC -Name SSLThumbprint).SSLThumbPrint $subjectName = (Get-Item Cert:\LocalMachine\My\$thumbprint).Subject $subjectName Use the same subject name when generating the new certificate. Step 3: Create a New Certificate Use the below commands to create and export a new self-signed certificate (valid for 1 year). $subjectName = "HPC Pack Node Communication" $pfxcert = New-SelfSignedCertificate -Subject $subjectName -KeySpec KeyExchange -KeyLength 2048 -HashAlgorithm SHA256 -TextExtension @("2.5.29.37={text}1.3.6.1.5.5.7.3.1,1.3.6.1.5.5.7.3.2") -Provider "Microsoft Enhanced RSA and AES Cryptographic Provider" -CertStoreLocation Cert:\CurrentUser\My -KeyExportPolicy Exportable -NotAfter (Get-Date).AddYears(1) -NotBefore (Get-Date).AddDays(-1) $certThumbprint = $pfxcert.Thumbprint $null = New-Item $env:Temp\$certThumbprint -ItemType Directory $pfxPassword = Get-Credential -UserName 'Protection password' -Message 'Enter protection password below' Export-PfxCertificate -Cert Cert:\CurrentUser\My\$certThumbprint -FilePath "$env:Temp\$certThumbprint\PrivateCert.pfx" -Password $pfxPassword.Password Export-Certificate -Cert Cert:\CurrentUser\My\$certThumbprint -FilePath "$env:Temp\$certThumbprint\PublicCert.cer" -Type CERT -Force start "$env:Temp\$certThumbprint" This will generate both .pfx (private) and .cer (public) files in a temporary directory. Step 4: Copy Certificate to Install Share On the master (head) node, copy the newly created certificate to the following path: C:\Program Files\Microsoft HPC Pack 2019\Data\InstallShare\Certificates This ensures the certificate is available to all compute nodes in the cluster. Step 5: Rotate Certificates on Compute Nodes Important: Always rotate certificates on compute nodes first, before the head node. If you update the head node first, compute nodes will reject the new certificate, forcing manual reconfiguration. After rotating compute node certificates, expect them to appear as Offline in HPC Cluster Manager — this is normal until the head node certificate is updated. Download the PowerShell script Update-HpcNodeCertificate.ps1 and place it in your HPC install share: \\<headnode>\REMINST On each compute node, open PowerShell as Administrator and run: PowerShell.exe -ExecutionPolicy ByPass -Command "\\<headnode>\REMINST\Update-HpcNodeCertificate.ps1 -PfxFilePath \\headnode>\REMINST\Certificates\HpcCnCommunication.pfx -Password <password> " This updates the certificate on each compute node. Step 6: Update Certificate on the Master (Head) Node On the head node, run the following commands in PowerShell as Administrator: $certPassword = ConvertTo-SecureString -String "YourPassword" -AsPlainText -Force Import-PfxCertificate -FilePath "C:\Program Files\Microsoft HPC Pack 2019\Data\InstallShare\Certificates\PrivateCert.pfx" -CertStoreLocation "Cert:\LocalMachine\My" -Password $certPassword PowerShell.exe -ExecutionPolicy ByPass -Command "Import-certificate -FilePath \\master\REMINST\Certificates\PublicCert.cer -CertStoreLocation cert:\LocalMachine\Root" Set-ItemProperty -Path "HKLM:\SOFTWARE\Microsoft\HPC" -Name SSLThumbprint -Value <Thumbprint> Set-ItemProperty -Path "HKLM:\SOFTWARE\Wow6432Node\Microsoft\HPC" -Name SSLThumbprint -Value <Thumbprint> Step 7: Update Thumbprint in SQL Database You’ll also need to update the certificate thumbprint stored in the HPCHAStorage database. Install SQL Server Management Studio (SSMS) (latest version). pen SSMS and connect to the HPC database. 3. Navigate to: 4. HPCHAStorage → Tables → dbo.DataTable 5. Right-click and select “Select Top 1000 Rows” to view the current SSL thumbprint. 6. Use the new query window and run the following command with the updated thumbprint: Update dbo.DataTable set dvalue='<NewThumbrpint>' where dpath = 'HKEY_LOCAL_MACHINE\Software\Microsoft\HPC' and dkey = 'SSLThumbprint' This updates the stored certificate reference used by the HPC services. Step 8: Reboot the Master Node Once everything is updated, reboot the head node to apply the changes. After the system restarts, open HPC Cluster Manager — your cluster should now be fully functional with the new certificate in place. Summary By following these steps, you can safely renew an expired HPC Pack 2019 certificate and restore secure communication across your cluster — without needing to reinstall or reconfigure HPC Pack components. This guide helps administrators handle expired certificates with confidence and maintain business continuity for HPC workloads. If this guide helped you resolve your certificate issues, please give it a 👍 thumbs up and share your feedback or questions in the comments section below.
vinilv
Oct 30, 2025 Place Azure High Performance Computing (HPC) Blog
118Views
1like
0Comments
Use Entra IDs to run jobs on your HPC cluster
Introduction This blog demonstrates the practical implementation of System Security Services Daemon (SSSD) with the recently introduced “idp” provider that can be used on Azure Linux 3.0 HPC clusters to provide consistent Usernames, UIDs and GIDs across the cluster all rooted in Microsoft Entra ID. Having consistent Identities across the cluster is a fundamental requirement that is commonly met using SSSD and a provider such as LDAP, FreeIPA, or ADDS, or if no IdP is available by managing local accounts across all nodes. SSSD 2.11.0 introduced a new generic “idp” provider that can integrate Linux systems with Microsoft Entra ID via OAuth2/OpenID Connect. This means we can now define a domain in sssd.conf with id_provider = idp and idp_type = entra_id, along with Entra tenant and app credentials. With SSSD configured and running, getent can now resolve Entra users and groups via Entra ID, fetching the Entra user’s POSIX info consistently across the cluster. As this new capability is very new (it’s being included in the Fedora 43 pre-release) this blog intends to cover the steps required to implement it on Azure Linux 3.0 for those that would like to explore this on their own VMs and Clusters. Implementation 1. Build RPMs As we are deploying on Azure Linux 3.0 and RPMs are not available in packages.microsoft.com (PMC) we must download the release package 2.11.0 from Releases · SSSD/sssd and follow the guidance from Building SSSD - sssd.io A virtual machine running Azure Linux 3.0 HPC edition which provides many of the build tools required (and is our target operating system) was used. A number of dependencies must still be installed to perform the make but these are all available from PMC and the make runs without issue. # Install dependencies sudo tdnf -y install \ c-ares-devel \ cifs-utils-devel \ curl-devel \ cyrus-sasl-devel \ dbus-devel \ jansson-devel \ krb5-devel \ libcap-devel \ libdhash-devel \ libldb-devel \ libini_config-devel \ libjose-devel \ libnfsidmap-devel \ libsemanage-devel \ libsmbclient-devel \ libtalloc-devel \ libtdb-devel \ libtevent-devel \ libunistring-devel \ libwbclient-devel \ p11-kit-devel \ samba-devel \ samba-winbind sudo ln -s /etc/alternatives/libwbclient.so-64 /usr/lib/libwbclient.so.0 # Build SSSD from source wget https://github.com/SSSD/sssd/releases/download/2.11.0/sssd-2.11.0.tar.gz tar -xvf sssd-2.11.0.tar.gz cd sssd-2.11.0 autoreconf -if ./configure --enable-nsslibdir=/lib64 --enable-pammoddir=/lib64/security --enable-silent-rules --with-smb-idmap-interface-version=6 make # Success!! Building the RPMs is more complex as there are many more dependencies, some dependencies not available on PMC and we are also reusing the generic sssd.spec file. However, this can be performed to create a working set of required SSSD RPMs. First install the dependencies available from PMC: # Add dependencies for rpmbuild sudo tdnf -y install \ doxygen \ libcmocka-devel \ nss_wrapper \ pam_wrapper \ po4a \ shadow-utils-subid-devel \ softhsm \ systemtap-sdt-devel \ uid_wrapper The remaining four dependencies are sourced from Fedora 42 builds and may be installed using tdnf: # gdm-pam-extensions-devel wget https://kojipkgs.fedoraproject.org//packages/gdm/48.0/3.fc42/x86_64/gdm-pam-extensions-devel-48.0-3.fc42.x86_64.rpm sudo tdnf install ./gdm-pam-extensions-devel-48.0-3.fc42.x86_64.rpm # libfido2-devel wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/l/libcbor-0.7.0-6.el8.x86_64.rpm wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/l/libfido2-1.11.0-2.el8.x86_64.rpm wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/l/libfido2-devel-1.11.0-2.el8.x86_64.rpm sudo tdnf install ./libcbor-0.7.0-6.el8.x86_64.rpm --nogpgcheck sudo tdnf install ./libfido2-1.11.0-2.el8.x86_64.rpm --nogpgcheck sudo tdnf install ./libfido2-devel-1.11.0-2.el8.x86_64.rpm --nogpgcheck The sudo make rpms can now be initiated. It will fail but establishes much of what we need for a successful rpmbuild using the following steps: # rpmbuild sudo make rpms # will error with: File /rpmbuild/SOURCES/sssd-2.11.0.tar.gz: No such file or directory sudo cp ../sssd-2.11.0.tar.gz /rpmbuild/SOURCES/ cd /rpmbuild sudo vi SPECS/sssd.spec # edit build_passkey 1 in SPECS/sssd.spec to 0 to skip passkey support sudo rpmbuild --define "_topdir /rpmbuild" -ba SPECS/sssd.spec And we have RPMs #RPMS!!! libipa_hbac-2.11.0-0.azl3.x86_64.rpm libipa_hbac-devel-2.11.0-0.azl3.x86_64.rpm libsss_autofs-2.11.0-0.azl3.x86_64.rpm libsss_certmap-2.11.0-0.azl3.x86_64.rpm libsss_certmap-devel-2.11.0-0.azl3.x86_64.rpm libsss_idmap-2.11.0-0.azl3.x86_64.rpm libsss_idmap-devel-2.11.0-0.azl3.x86_64.rpm libsss_nss_idmap-2.11.0-0.azl3.x86_64.rpm libsss_nss_idmap-devel-2.11.0-0.azl3.x86_64.rpm libsss_sudo-2.11.0-0.azl3.x86_64.rpm python3-libipa_hbac-2.11.0-0.azl3.x86_64.rpm python3-libsss_nss_idmap-2.11.0-0.azl3.x86_64.rpm python3-sss-2.11.0-0.azl3.x86_64.rpm python3-sss-murmur-2.11.0-0.azl3.x86_64.rpm python3-sssdconfig-2.11.0-0.azl3.noarch.rpm sssd-2.11.0-0.azl3.x86_64.rpm sssd-ad-2.11.0-0.azl3.x86_64.rpm sssd-client-2.11.0-0.azl3.x86_64.rpm sssd-common-2.11.0-0.azl3.x86_64.rpm sssd-common-pac-2.11.0-0.azl3.x86_64.rpm sssd-dbus-2.11.0-0.azl3.x86_64.rpm sssd-debuginfo-2.11.0-0.azl3.x86_64.rpm sssd-idp-2.11.0-0.azl3.x86_64.rpm sssd-ipa-2.11.0-0.azl3.x86_64.rpm sssd-kcm-2.11.0-0.azl3.x86_64.rpm sssd-krb5-2.11.0-0.azl3.x86_64.rpm sssd-krb5-common-2.11.0-0.azl3.x86_64.rpm sssd-ldap-2.11.0-0.azl3.x86_64.rpm sssd-nfs-idmap-2.11.0-0.azl3.x86_64.rpm sssd-proxy-2.11.0-0.azl3.x86_64.rpm sssd-tools-2.11.0-0.azl3.x86_64.rpm sssd-winbind-idmap-2.11.0-0.azl3.x86_64.rpm 2. Deploy RPMs With the RPMs created we can now move to installing them on our Cluster. In my case I am using a customised image with other tunings and packages so these can be included in my Ansible Playbook and an updated image produced. The following details the rpms (a subset of the 30 or so created) installed into the image: # Pre install sssd rpms - name: Copy sssd rpms onto host ansible.builtin.copy: src: sssd-2.11.0/ dest: /tmp/sssd/ - name: Install sssd rpms ansible.builtin.shell: | tdnf -y install /tmp/sssd/libsss_certmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_certmap-devel-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_idmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_nss_idmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-client-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_sudo-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-nfs-idmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-common-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-common-pac-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-idp-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-krb5-common-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-ad-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libipa_hbac-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-ipa-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-krb5-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-ldap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-proxy-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-2.11.0-0.azl3.x86_64.rpm 3. Create an App Registration For the SSSD “idp” provider to be able to read Entra ID user and group attributes we must create an Application ID with Secret in our Entra tenant. The Application will require the following API permissions: Additionally the Application must be assigned the Directory Readers permissions over the directory. This can be done through the Graph API using the following template: POST https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignments { "principalId": "<ObjectId of your SPN>", "roleDefinitionId": "<RoleDefinitionId for Directory Readers>", "directoryScopeId": "/" } Note the Application (client) ID and its secret as these will be required to for the SSSD configuration. 4. Configure SSSD & NSSWITCH For these I have used cloud init to add the sssd.conf and amend the nsswitch.conf during deployment across Slurm Controllers, Login nodes and Compute nodes. The SSSD service is also enabled and started. The resulting files should look like the following customized to your own domain, app Id and secret. /etc/sssd/sssd.conf [sssd] config_file_version = 2 services = nss, pam domains = mydomain.onmicrosoft.com [domain/mydomain.onmicrosoft.com] id_provider = idp idp_type = entra_id idp_client_id = ########-####-####-####-############ idp_client_secret = ######################################## idp_token_endpoint = https://login.microsoftonline.com/937d5829-df9d-46b6-ad5a-718ebc33371e/oauth2/v2.0/token idp_userinfo_endpoint = https://graph.microsoft.com/v1.0/me idp_device_auth_endpoint = https://login.microsoftonline.com/937d5829-df9d-46b6-ad5a-718ebc33371e/oauth2/v2.0/devicecode idp_id_scope = https%3A%2F%2Fgraph.microsoft.com%2F.default idp_auth_scope = openid profile email auto_private_groups = true use_fully_qualified_names = false cache_credentials = true entry_cache_timeout = 5400 entry_cache_nowait_percentage = 50 refresh_expired_interval = 4050 enumerate = false debug_level = 2 [nss] debug_level = 2 default_shell = /bin/bash fallback_homedir = /shared/home/%u [pam] debug_level = 2 /etc/nsswitch.conf # Begin /etc/nsswitch.conf passwd: files sss group: files sss shadow: files sss hosts: files dns networks: files protocols: files services: files ethers: files rpc: files # End /etc/nsswitch.conf 5. Create User home directories The use of Device Auth for Entra users over SSH is not currently supported so for now my Entra users will authenticate using SSH Public Key Auth. For that to work their $HOME directories must be pre-created, and their public keys added to .ssh/authorized_keys. This is simplified by having SSSD in place as we can use getent passwd to get a user’s $HOME and set directory and file permissions using the usual chown command. The following example script will create the users directory, add their public key, and creates a keypair for internal use across the cluster: #!/bin/bash # Script to create a user home directory and populate it with a given SSH public key. # Must be executed as root or via sudo. USER_NAME=$1 USER_PUBKEY=$2 if [ -z "${USER_NAME}" ] || [ -z "${USER_PUBKEY}" ]; then echo "Usage: $0 " exit 1 fi entry=$(getent passwd "${USER_NAME}") export USER_UID=$(echo "$entry" | awk -F: '{print $3}') export USER_HOME=$(echo "$entry" | awk -F: '{print $6}') #if directory exists, we're good if [ -d "${USER_HOME}" ]; then echo "Directory ${USER_HOME} exists, do not modify." else mkdir -p "${USER_HOME}" chown $USER_UID:$USER_UID $USER_HOME chmod 700 $USER_HOME cp -r /etc/skel/. $USER_HOME mkdir -p $USER_HOME/.ssh chmod 700 $USER_HOME/.ssh touch $USER_HOME/.ssh/authorized_keys chmod 644 $USER_HOME/.ssh/authorized_keys echo "${USER_PUBKEY}" >> $USER_HOME/.ssh/authorized_keys { echo "# Automatically generated - StrictHostKeyChecking is disabled to allow for passwordless SSH between Azure nodes" echo "Host *" echo " StrictHostKeyChecking no" } >> "$USER_HOME/.ssh/config" chmod 644 "$USER_HOME/.ssh/config" chown -R $USER_UID:$USER_UID $USER_HOME sudo -u $USER_NAME ssh-keygen -f $USER_HOME/.ssh/id_ed25519 -N "" -q cat $USER_HOME/.ssh/id_ed25519.pub >> $USER_HOME/.ssh/authorized_keys fi 6. Run jobs as Entra user Logged in Entra user: john.doe@tst4-login-0 [ ~ ]$ id uid=1137116670(john.doe) gid=1137116670(john.doe) groups=1137116670(john.doe) john.doe@tst4-login-0 [ ~ ]$ getent passwd john.doe john.doe:*:1137116670:1137116670::/shared/home/john.doe:/bin/bash And running an MPI job: john.doe@tst4-login-0 [ ~ ]$ sbatch -p hbv4 /cvmfs/az.pe/1.2.6/tests/imb/imb-env-intel-oneapi.sh Submitted batch job 330 john.doe@tst4-login-0 [ ~ ]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 330 hbv4 imb-env- john.doe R 0:06 2 tst4-hbv4-[114-115] john.doe@tst4-login-0 [ ~ ]$ cat slurm-imb-env-intel-oneapi-330.out Testing IMB using Spack environment intel-oneapi ... Setting up Azure PE version 1.2.6 for azurelinux3.0 on x86_64 Testing IMB using srun... #---------------------------------------------------------------- # Intel(R) MPI Benchmarks 2021.7, MPI-1 part #---------------------------------------------------------------- # Date : Mon Sep 29 15:24:24 2025 # Machine : x86_64 # System : Linux # Release : 6.6.96.1-1.azl3 # Version : #1 SMP PREEMPT_DYNAMIC Tue Jul 29 02:44:24 UTC 2025 # MPI Version : 4.1 # MPI Thread Environment: Summary So, early days and requires a little prep but hopefully this demonstrates that using the new SSSD “idp” provider we can finally use Entra and the source of user identities on our HPC clusters.
trcooper
Oct 01, 2025 Place Azure High Performance Computing (HPC) Blog
223Views
1like
0Comments
Explore HPC & AI Innovation: Microsoft + AMD at HPC Roundtable 2025
The HPC Roundtable 2025 in Turin brings together industry leaders, engineers, and technologists to explore the future of high-performance computing (HPC) and artificial intelligence (AI) infrastructure. Hosted by DoITNow, the event features Microsoft and AMD as key participants, with sessions highlighting real-world innovations such as Polestar’s adoption of Microsoft Azure HPC for Computer-Aided Engineering (CAE). Attendees will gain insights into cloud-native HPC, hybrid compute environments, and the convergence of simulation and machine learning. The roundtable offers networking opportunities, strategic discussions, and showcases how Microsoft Azure and AMD are accelerating engineering innovation and intelligent workloads in automotive and other industries.
Fernando_Aznar
Sep 29, 2025 Place Azure High Performance Computing (HPC) Blog
149Views
0likes
0Comments
CycleCloud + Hammerspace
Abstract The theme of this blog is “Simplicity”. Today’s HPC user has an overabundance of choices when it comes to HPC Schedulers, clouds, infrastructure in those clouds, and data management solutions. Let's simplify it! Using CycleCloud as the nucleus, my intent is to show how simple it is to deploy a Slurm cluster on the Hammerspace data platform while using a standard NFS protocol. And for good measure, we will use a new feature in CycleCloud called Scheduled Events – which will automatically unmount the NFS share when the VM’s are shutdown. CycleCloud and SLURM Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that delivers a fully managed SLURM workload environment on Azure. This occurs without requiring manually configured infrastructure or Slurm settings. To get started, go to the Azure marketplace and type “Azure CycleCloud for Slurm” I have not provided a detailed breakdown of the steps for Azure CycleCloud for Slurm as Kiran Buchetti does an excellent job of that in the blog here. It is a worthwhile read so please take a minute to review. Getting back to the theme of this blog, simplicity of Azure CycleCloud Workspace for Slurm is one of its most important value propositions. Please see below for my top reasons why: CycleCloud Workplace for Slurm is a simple template for entire cluster creations. Without the above, a user would have to manually install CycleCloud, install Slurm, configure the compute partitions, attach storage, etc. Instead, you fill out a marketplace template and a working cluster is live in 15-20 minutes. Preconfigured best practices, prebuilt Slurm nodes, partitions, network and security rules are done for the end user. No deep knowledge of HPC or SLURM is required! Automatic Cost control: Workplace for Slurm is designed to deploy only when a job is submitted. From there, the solution will auto shutdown after a job is complete. Moreover, workplace for Slurm comes with preconfigured partitions (GPU partition, HTC spot partition) – so end users can submit jobs to the right partition based on performance and budget. Now that we have a cluster built – let's turn our attention to data management. I have chosen to highlight the Hammerspace Data Platform in this blog. Why? Namely, because it is a powerful solution that provides high performance and global access to CycleCloud HPC/AI nodes. Sticking true to our theme... it is also incredibly simple to integrate with CycleCloud. Who is Hammerspace ? Before discussing integration, let's take a minute to introduce you to Hammerspace. Hammerspace is a software-defined data orchestration platform that provides a global file system across on-premises infrastructure and public clouds. It enables users and applications to access and manage unstructured data anywhere at any time. This all comes without the need to copy, migrate, or manually manage data. Hammerspace’s core philosophy is that “Data should follow the user, not the other way around”. Great information on Hammerspace at the following link: Hammerspace Whitepapers Linux Native Hammerspace's foundation as a data platform is built natively into the Linux kernel, requiring no additional software installation on any nodes. The company’s goal is to deliver a High-Performance Plug and Play model – using standard NFS protocols (v3, v4, pNFS) – that make high performance & scalable file access familiar to any Linux system administrator. Let’s break down why the native Kernel approach is important to a CycleCloud Workplace on SLURM user: POSIX compliant high performance file access with no changes in code required. No agents needed on the hosts, no additional CycleCloud templates needed. From a CycleCloud perspective, Hammerspace is simply an “external NFS” No re-staging of jobs required. Its NFS – all the compute nodes can access the same data (regardless of where it resides). The days of copying / moving data between compute nodes are over. Seamless Mounting. Native NFS mounts can be added easily in CycleCloud and files are instantly available for SLURM jobs with no unnecessary job prep time. We will take a deeper dive into this topic in the next section. How to export NFS Native NFS mounts can be added easily to CycleCloud such as the example below... NFS mounts can be entered on the Marketplace template or alternatively via the scheduler. For Hammerspace – click on External NFS. Put in the IP of the Hammerspace Anvil Metadata server, add in your mount options, and that’s it. The example below uses NFS mounts for /sched and /data Once the nodes are provisioned, log into any of the nodes and they will be mounted. On the Hammerspace user interface, we see the /sched share deployed with any relevant IOPS, growth, and files That’s it. That’s all it takes to mount a powerful parallel file system to CycleCloud. Now let's look at the benefits of a Hammerspace/CycleCloud implementation Simplified data management: CycleCloud orchestrates HPC infrastructure on demand – Hammerspace ensures that the data is immediately available whenever the compute comes up. Hammerspace will also place data in the right location or tier based on its policy driven management. This reduces the need for manual scripting to put data on lower cost tiers of storage. No application refactoring: Applications do not need to add additional agents, nor do they have to change to benefit from using a Global Access system like Hammerspace. CycleCloud Scheduled Events The last piece of the story is the shutdown/termination process. The HPC jobs are complete, now it is time to shut down the nodes and save costs. What happens to the NFS mounts that are on each node? Prior to CycleCloud 8.2.2 – if nodes were not unmounted properly, NFS mounts could hang indefinitely waiting for IO. Users can now take advantage of “Scheduled Events” in CycleCloud – a feature that lets you put a script on your HPC nodes to automatically be executed when a supported event occurs. In our case, our supported event is a node termination. The following is taken straight from the CycleCloud Main page here. CycleCloud supports enabling Terminate Notification on scaleset VMs (e.g., execute nodes). To do this, set EnableTerminateNotification to true on the nodearray. This will enable it for scalesets created for this nodearray. To override the timeout allowed, you can set TerminateNotificationTimeout to a new time. For example, in a cluster template: The script to unmount a NFS share during a terminate event is not trivial: Add it to your project project.spec Attach it to the shutdown task: Simple! Now a user can run a job and terminate the nodes after job completion without worrying about what it does to the backend storage. No more cleanup! This is cost savings, operational efficiency, and resource cleanliness (no more stale azure resources like IP’s, NICs, and disks cluttering up a subscription). Conclusion Azure CycleCloud along with Slurm and the Hammerspace Data Platform provides a powerful, scalable and cost-efficient solution for HPC in the cloud. CycleCloud automates the provisioning (and the elastic scaling up and down) of the Infrastructure, SLURM manages the task of job scheduling, and Hammerspace delivers a global data environment with high performance parallel NFS. Ultimately, the most important element of the solution is the simplicity. Hammerspace enables HPC organizations to focus on solving core problems vs the headache of managing infrastructure, setup, and unpredictable storage mounts. By reducing the administrative overhead needed to run HPC environments, the solution described in this blog will help organizations accelerate time to results, lower costs, and drive innovation across all industries.
anhoward
Sep 25, 2025 Place Azure High Performance Computing (HPC) Blog
290Views
1like
0Comments
Performance analysis of DeepSeek R1 AI Inference using vLLM on ND-H100-v5
Introduction The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—but not without significant infrastructure demands. To deploy DeepSeek R1 effectively in an inference service like vLLM, high-performance hardware is essential. Specifically, the model requires two Azure ND_H100_v5 nodes, each equipped with 8 NVIDIA H100 GPUs, totaling 16 H100s. These nodes are interconnected via InfiniBand and NVLink, ensuring the bandwidth and latency characteristics necessary to support the model’s massive memory footprint and parallel processing needs. In this post, we’ll present inference benchmark results for DeepSeek R1, measuring performance across GPU utilization, memory throughput, and interconnect efficiency. While R1 excels in reasoning tasks, it’s important to recognize that such models are not universally optimal. For many general-purpose AI applications, smaller models like Llama 3.1 8B offer a compelling alternative, delivering sufficient accuracy and performance at a fraction of the cost. We explore the performance characteristics of DeepSeek R1 and help you decide when a large reasoning model is worth the investment—and when a leaner solution might be the better choice. Benchmark environment 2 ND_H100_v5 nodes (16 H100) were required to load the DeepSeek R1 model into GPU memory. vLLM was used to set-up the online inference server using a single API server endpoint. vLLM bench (specifically hugging face AI-MO/aimo-validation-aime) was used to run the inference benchmark. Detailed specifications are available in the tables below. Results Reasoning model’s additional chain of thought tokens We show that DeepSeek R1 generates many additional tokens (for chain of thought), which helps reasoning models answer complex logical questions. In this example the DeepSeek R1 generates 1162 completion tokens, but Llama 3.1 8B only generates 37. This example also demonstrates that while non-reasoning models may struggle with this seemingly simple logical question, reasoning models like DeepSeek R1 can solve it—though they require significantly more intermediate tokens to do so. #!/bin/bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ {"role": "user", "content": "9.11 and 9.8, which is greater? Explain your reasoning"} ] }' Note: The REST API json output is formatted to make the output more readable. (./run_restapi_query.sh 2> /dev/null | python3 ./convert_json_2_txt.py >& run_query.out_$$) Reasoning: Okay, so I need to figure out whether 9.11 is greater than 9.8 or if 9.8 is greater. Hmm, let me start by looking at both numbers. They both have a 9 in the units place, so that part is equal. Now, I need to compare the decimal parts: .11 and .8. Wait, but decimals can be tricky because the places matter. The first number is 9.11, which has a 1 in the tenths place and a 1 in the hundredths place. The second number is 9.8, which is the same as 9.80 when you add a zero to the hundredths place for comparison. So, 9.80 versus 9.11. Now, looking at the tenths place: 8 is greater than 1. Therefore, 9.80 should be greater than 9.11. But let me double-check because sometimes when numbers have different decimal lengths, it's easy to make a mistake. If I convert both to fractions, maybe that will help. 9.11 is the same as 9 + 0.11, and 0.11 is 11/100. On the other hand, 9.8 is 9 + 0.8, which is 9 + 8/10. To compare 8/10 and 11/100, I can convert them to have the same denominator. 8/10 is equal to 80/100. Now, 80/100 is much larger than 11/100. So yes, 9.8 is indeed greater than 9.11. Another way to see it is by subtracting the two numbers. If I subtract 9.11 from 9.8, what do I get? Let's see: 9.8 - 9.11. Since 9.8 is the same as 9.80, subtracting 9.11 gives 0.69. Since the result is positive, that means 9.8 is larger. Wait, maybe I can also think about money to make it clearer. If I have $9.11 and you have $9.80, which is more? $9.80 is $9 and 80 cents, whereas $9.11 is $9 and 11 cents. Definitely, 80 cents is more than 11 cents, so $9.80 is more. That helps visualize it. Alternatively, using a number line. If I imagine a number line with 9.0, 9.1, 9.2,... up to 10.0. 9.11 is just a little past 9.1, while 9.8 is almost at 10.0. So clearly, 9.8 is much further to the right, meaning it's a larger number. I think that covers different methods: decimal place comparison, converting to fractions, subtraction, real-life money examples, and number line visualization. All of these methods point to 9.8 being greater than 9.11. I don't see any contradictions here. Maybe I can also consider writing both numbers with the same number of decimal places to make it even clearer. 9.11 is already two decimal places, and 9.8 can be written as 9.80. Now, comparing 9.80 and 9.11 digit by digit: - The units place: 9 vs. 9 (equal) - Tenths place: 8 vs. 1 (8 is greater) - Since the tenths place of 9.80 is greater, the entire number is larger, regardless of the hundredths place. So yes, that confirms it again. The key idea here is that when comparing decimals, you start from the leftmost digit after the decimal point and move right until you find a difference. In this case, the tenths place already shows that 9.8 is larger. I guess another common mistake could be thinking that because 11 is more than 8, 0.11 is more than 0.8, but that's not correct because the places are different. Tenths versus hundredths. 0.8 is equivalent to 0.80, which is way bigger than 0.11. To prevent that mistake, always align the decimal points and add trailing zeros if necessary. Then compare each digit one by one from left to right. That method avoids confusion. So, all in all, through multiple approaches, 9.8 is definitely greater than 9.11. Answer: To determine which number is greater between **9.11** and **9.8**, follow these steps: **Align the decimal places** by rewriting 9.8 as **9.80**. **Compare digit by digit** from left to right: - **Units place**: Both have **9** (equal). - **Tenths place**: **8** (in 9.80) vs. **1** (in 9.11). Since \(8 > 1\), **9.80 is greater** at this point. - No need to compare hundredths place once a difference is found. **Conclusion**: \(9.8\) (or \(9.80\)) is greater than \(9.11\) because **8 tenths** (\(0.8\)) is larger than **1 tenth + 1 hundredth** (\(0.11\)). \(\boxed{9.8 \text{ is greater}}\) Completion Info: Finish reason: stop Prompt tokens: 21 Completion tokens: 1162 Total tokens: 1183 When running using Llama 3.1 8B (Not reasoning model). Answer: To determine which is greater, we need to compare the two numbers. 9.11 is greater than 9.8 because 9.11 is a larger decimal value. Completion Info: Finish reason: stop Prompt tokens: 51 Completion tokens: 37 Total tokens: 88 Throughput and latency results Cost comparison In this cost analysis we use the ND-H100-v5 and ND-H200-v4 pay as you go pricing in south central US region and the measured total throughput tokens/sec to compute the $/(1K tokens). Note: ND-H200-v5 pricing was estimated at 20% more than ND-H100-v5 pricing. Analysis The DeepSeek R1 is a large, complex reasoning model that is costlier and slower than smaller models. It needs 16 H100 GPUs for FP8 precision and generates many more intermediate tokens in its chain of thought process—about 31 times more than Llama 3.1 8B—but at a much slower rate (~54 times slower). Its latency is also higher, with TTFT and ITL being roughly 6 and 3 times slower, respectively. The DeepSeek R1 model has small intranode and internode network requirements (~14% of available InfiniBand network bandwidth was used, and < 1% of available NVLink bandwidth is used. GPUs with higher memory bandwidth and higher FLOPS would help improve its performance. The cost analysis shows that the cost to generate DeepSeek R1 tokens is ~54 times more expensive than Llama 3.1 8B on the same 16 H100 GPU’s and ~34 times more expensive on 8 H200 GPU’s. DeepSeek R1 model is very capability, but due to its higher TCO it should be only used in specific AI applications that require its strong reasoning abilities. Conclusion The DeepSeek R1 model demonstrates exceptional reasoning capabilities, but its deployment demands substantial infrastructure and incurs high latency and cost. While it excels in generating detailed chains of thought, its throughput and efficiency lag significantly behind smaller models like Llama 3.1 8B. For applications requiring deep logical analysis, DeepSeek R1 is a powerful tool. However, for general-purpose inference tasks, more lightweight models offer better performance and cost-effectiveness. Strategic use of DeepSeek R1 should be reserved for scenarios where its advanced reasoning justifies the resource investment. References Deepseek R1 model on Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-R1 vLLM GitHub repository https://github.com/vllm-project/vllm Azure ND H100 v5 documentation https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series FlashInfer GitHub repository https://github.com/flashinfer-ai/flashinfer DeepGEMM GitHub repository https://github.com/deepseek-ai/DeepGEMM AI-MO validation dataset on Hugging Face https://huggingface.co/datasets/AI-MO/aimo-validation-aime Appendix Install vLLM curl -LsSf https://astral.sh/uv/install.sh | sh uv venv myvllm --python 3.11 --seed source myvllm/bin/activate uv pip install vllm --torch-backend=auto git clone https://github.com/flashinfer-ai/flashinfer.git --recursive uv pip install ninja cd flashinfer uv pip install --no-build-isolation --verbose . Install DeepSeek DeepEP git clone https://github.com/vllm-project/vllm.git cd ~/vllm/tools/ep_kernels export CUDA_HOME=/usr/local/cuda-12.8 TORCH_CUDA_ARCH_LIST="9.0" (For Hopper) bash install_python_libraries.sh 2.&1 | tee install_python_libraries.log_$$ sudo bash configure_system_drivers.sh 2>&1 | tee configure_system_drivers.log_$$ sudo reboot Install DeepSeek DeepGEMM git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git cd deepGEMM ./install.sh 2>&1 | tee install.log_$$ Configure DeepSeek R1 with vLLM on 2 ND_H100_v5 Second node configuration Execute this script on second node before the script on the primary node. #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-start-rank 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --headless --max-model-len 32768 --reasoning-parser deepseek_r1 Primary node configuration #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --api-server-count 1 --max-model-len 32768 --reasoning-parser deepseek_r1 Install vLLM benchmark environment cd vllm uv pip install vllm[bench] Run vLLM benchmark #!/bin/bash vllm bench serve \ --backend vllm \ --model deepseek-ai/DeepSeek-R1 \ --endpoint /v1/completions \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --ramp-up-strategy linear \ --ramp-up-start-rps 1 \ --ramp-up-end-rps 10 \ --num-prompts 400 \ --seed 42
CormacGarvey
Aug 29, 2025 Place Azure High Performance Computing (HPC) Blog
401Views
0likes
0Comments
Teamcenter Simulation Process Data Management Architecture on Azure CycleCloud- Slurm cluster
Introduction: Many customers run multiple Teamcenter-SPDM solutions across the enterprise, mixing multiple instances, multiple ISV vendors, and hybrid cloud/on-prem implementations. This fragmentation reduces the customer’s ability to uniformly access data. Consolidating Teamcenter-SPDM on Azure can speed the shift to one consistent, harmonized PLM experience, enterprise wide. What is Teamcenter Simulation? Teamcenter Simulation integrates simulation data, processes, and results into the broader PLM (Product Lifecycle Management) environment. Instead of engineers running simulations in silos on local drives, it provides: A single source of truth for CAD, simulation models, inputs, and results. Traceability across design, analysis, and manufacturing. Support for multi-CAD, multi-CAE tools (e.g., NX Nastran, ANSYS, Abaqus, Star-CCM+). Primary benefit Teamcenter Simulation SPDM gives you full traceability from source to solution. SPDM is a single source of truth where CAE analysis of a product design testing is related to a corresponding item in original CAD. This relationship of CAD and SIM data is a key to determine which CAD revision is captured in a particular CAE analysis. Architecture: Siemens Teamcenter SPDM baseline architecture has two major blocks of architectures which are connected. Teamcenter PLM core deployment StarCCM deployed on HPC Cyclecloud Slurm Workspace Teamcenter PLM Core Deployment: It has four distributed tiers (client, web, enterprise, and resource) in a single availability zone. Each tier aligns to function and communication flows between these tiers. All four tiers use their own virtual machines in a single virtual network. The Teamcenter Simulation aka CAE manage is core business functionality of SPDM runs on a central server in the enterprise tier and users access it through a web-based or thick-client interface. You can deploy multiple instances in Dev and Test environments by adding extra virtual machines and storage on virtual networks separate from production virtual networks. StarCCM HPC Cyclecloud slurm cluster architecture: Siemens StarCCM simulation software will be deployed on Azure Cyclecloud HPC Scheduler node. CAE Analyst fires the simulation jobs from Teamcenter Active workspace or Rich client UI. Azure HPC will then spin up and HPC nodes, these nodes will process the jobs submitted by CAE Analyst based on the runtime parameter. StarCCM will processed complete the simulation iteration and .sim file output will be generated. Workflow CAE Analysts, SPDM & Teamcenter users access the Teamcenter application via an HTTPS-based endpoint Public URL. Users access the application through two user interfaces: (1) a Rich client and (2) an Active workspace client, CAE engineer/Simulation Analysts access the Teamcenter through the Teamcenter Simulation client. Teamcenter Simulation client is lightweight thin client runs on users’ desktop. User access will be authenticated via Company’s Azure Entra ID. Azure Entra ID with SAML configuration allows single sign on(SSO) to the Teamcenter application. Azure Firewall & Azure backbone Security component which filter the traffic and threat intelligence feeds directly from Microsoft Cyber Security. Https traffic directed to the Azure Application gateway. The Hub virtual network and Spoke virtual network are peered so they can communicate over the Azure backbone network. Azure Application Gateway routes traffic to the Teamcenter’s web server virtual machines (VMs) in the Web tier. Siemens PLM Teamcenter deployment on Azure. For detailed information about Teamcenter Architecture on Azure refer this url. Teamcenter Simulation Client runs on Teamcenter User’s desktop. CAE manager is deployed as integral part of the Teamcenter package. Teamcenter Simulation on Azure HPC: CAE Engineer executes the following typical workflow with Azure HPC cluster Step 1: CAD Data & Product Structures CAD models (e.g., from NX, CATIA, SolidWorks) are managed in Teamcenter. Simulation engineer links simulation models directly to Teamcenter product structures. Ensures simulation always uses the latest or correct version of the design. Step 2: Build Simulation Model (Pre-processing) Simulation templates define solver type (FEA, CFD, Multiphysics) and required inputs. Engineers use tools like NX CAE, Simcenter 3D, ANSYS, Abaqus, or Star-CCM+ integrated with Teamcenter. Meshes, boundary conditions, loads, and materials are associated with the correct design revision. Step 3: Manage Simulation Data All input decks, scripts, and models stored in Teamcenter for version control. Metadata (e.g., load case, solver settings) captured for searchability & re-use. Supports process automation: simulation workflows can be pre-configured for repeatable tasks. Step 4: Run Simulation Jobs (Enhanced with Azure CycleCloud Benefits) Jobs submitted to local HPC clusters or cloud HPC (Azure CycleCloud,) directly from Teamcenter. Teamcenter stores solver logs, job status, and output files. Following diagram show end to end workflow starts with Teamcenter CAE manager--> StarCCM -->HPC cluster ->Simulation processing Sim file -->Sim file back to Teamcenter Teamcenter CAE manager--> StarCCM running on HPC cluster Teamcenter generates the job file on the HPC node HPC Cluster creating HPC nodes Squeue monitoring on HPC node Job monitoring on Teamcenter UI Simulation output file generated by Sbatch job File copied over to Teamcenter shared file location Step 5: Post-processing & Results Management Results imported back into Teamcenter: stress plots, temperature distributions, flow fields, etc. Visualization via Simcenter 3D, JT format (lightweight 3D), or web-based viewers. Results tied back to: Design versions Simulation setup Load cases This creates a traceable digital thread from requirements → design → simulation → results. Step 6: Review, Sign-off, and Collaboration Results shared with design, manufacturing, and management teams in Teamcenter. Review workflows, e-signatures, and approvals integrated into PLM processes. Simulation results influence design changes and product validation reports. Azure CycleCloud adds several key advantages: On-demand scaling: Automatically provisions Azure compute nodes when workloads spike, then scales down when jobs complete to reduce costs. HPC Slurm scheduler integration: Supports popular schedulers like Slurm enabling smooth job submission from Teamcenter. Multi-VM sizes & GPU support: Allows selecting the right mix of CPU/GPU VMs for different simulation workloads (e.g., CFD, FEA, ML-driven simulations). Hybrid flexibility: Combine on-prem HPC with Azure bursting to handle peak demand without over-provisioning local hardware. Cost governance: Built-in cost controls, job quotas, and reporting to track simulation expenses. Security & compliance: Leverages Azure security, VNet isolation, and role-based access control for simulation data and compute resources. Integration with Azure Storage: Simplifies access to input/output files using Azure Blob, Azure NetApp Files, or Lustre for HPC-grade throughput. Conclusion: Siemens Teamcenter SPDM, when deployed on Azure HPC CycleCloud Workspaces, delivers a scalable and high-performance simulation data management solution. The integration with Azure CycleCloud enables dynamic provisioning of compute resources, allowing simulation workloads to scale elastically based on demand. This ensures optimal resource utilization and cost efficiency, especially during peak simulation cycles. With support for Slurm scheduling, multi-VM configurations, and GPU acceleration, SPDM on HPC CCWs empowers engineering teams to run complex simulations faster and more reliably. The architecture’s hybrid flexibility—combining on-premises and cloud bursting—further enhances throughput without overcommitting infrastructure, making it a robust foundation for enterprise-wide digital thread and product validation workflows.
Sunita_AZ0708
Aug 28, 2025 Place Azure High Performance Computing (HPC) Blog
173Views
0likes
0Comments
Inference performance of Llama 3.1 8B using vLLM across various GPUs and CPUs
Introduction Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance across a range of GPU and CPU platforms. Using the Hugging Face inference benchmarker, we assess not only throughput and latency but also the cost-efficiency of each configuration—an increasingly critical factor for enterprise deployment. As organizations seek scalable and budget-conscious solutions for deploying large language models (LLMs), understanding the trade-offs between compute-bound and memory-bound stages of inference becomes essential. Smaller models like Llama 3.1 8B offer a compelling balance between capability and resource demand, but the underlying hardware and software stack can dramatically influence both performance and operational cost. This report presents a comparative analysis of inference performance across multiple hardware platforms, factoring in: Token throughput and latency across chat, classification, and code generation workloads. Resource utilization, including KV cache utilization and efficiency. Cost per token, derived from cloud pricing models and hardware utilization metrics. By combining performance metrics with cost analysis, we aim to identify the most effective deployment strategies for enterprise-grade LLMs, whether optimizing for speed, scalability, or budget. Benchmark environment Inference benchmark The Hugging face Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined. Chat: Probably the most common use case, question and answer format on a wide range of topics. Classification: Providing various documents and requesting a summary of its contents. Code generation: Providing code and requesting code generation, e.g. create a new function. Profile Data set Input prompt Output prompt Chat hlarcher/inference-benchmarker/share_gpt_turns.json N/A min=50, max=800, variance=100 Classification hlarcher/inference-benchmarker/classification.json Min=8000, max=12000, variance=5000 Min=30, max=80, variance=10 Code generation hlarcher/inference-benchmarker/github_code.json Min=3000, max=6000, variance=1000 Min=30, max=80, variance=10 Huggingface Lama 3.1 8B models used Precision Model Size (GiB) meta-llama/Llama-3.1-8B-Instruct FP16 14.9 vLLM parameters Default value gpu_memory_utilization 0.9 max_num_seqs 1024 max_num_batched_tokens 2048 (A100), 8192 (H100,H200) enable_chunked_prefill True enable_prefix_caching True VM Configuration GPU ND-H100-v5, ND-H200-v5, HD-A100-v4 (8 H100 80GB &40GB) running HPC Ubuntu 22.04 (Pytorch 2.7.0+cu128, GPU driver: 535.161.08 and NCCL 2.21.5-1). 1 GPU was used in benchmark tests. CPU Ubuntu 22.02 (HPC and Canonical/jammy) Results GPU Profile Avg prompt throughput Avg generation throughput Max # Requests waiting Max KV Cache usage % Avg KV Cache hit rate % H100 Chat ~2667 ~6067 0 ~14% ~75% Classification ~254149 ~1291 0 ~46% ~98% Code generation ~22269 ~266 ~111 ~93% ~1% H200 Chat ~3271 ~7464 0 ~2% ~77% Classification ~337301 ~1635 0 ~24% ~99% Code generation ~22726 ~274 ~57 ~46% ~1% A100 Chat ~1177 ~2622 0 ~2% ~75% Classification ~64526 ~333 0 ~45% ~97% Code generation ~7926 ~95 ~106 ~21% ~1% A100_40G Chat ~1069 ~2459 0 ~27% ~75% Classification ~7846 ~39 ~116 ~68% ~5% Code generation ~7836 ~94 ~123 ~66% ~1% Cost analysis Cost analysis used pay-as-you-go pricing for the south-central region and measured throughput in tokens per second to calculate the metric $/(1K tokens). CPU performance and takeaways The Huggingface AI-MO/aimo-validation-aime data was by vllm bench to test the performance of Llama 3.1 8B on various VM types (left graph below). It is a struggle (insufficient FLOPs and memory bandwidth) to run Llama 3.1 8B on CPU VM’s, even the best performing CPU VM (HB176-96_v4) throughput and latency is significantly slower than the A100_40GB GPU. Tips Enable/use AVX512 (avx512f, avx512_bf16, avx512_vnni etc) (See what is supported/available via lscpu) Put AI model on single socket (if it has sufficient memory). For larger models you can use tensor parallel to split the model across sockets. Use pinning to specify which cores the threads will run on (in vLLM, VLLM_CPU_OMP_THREADS_BIND=0-22) Specify large enough KVCache (on CPU memory). In vLLM, VLLM_CPU_KVCACHE_SPACE=100) Analysis Throughput & Latency H200 outperforms all other GPUs across all workloads, with the highest prompt and generation throughput. H100 is a close second, showing strong performance especially in classification and code generation. A100 and A100_40G lag significantly behind, particularly in classification tasks where throughput drops by an order of magnitude (on A100_40G, due to smaller GPU memory and lower KV Cache hit percentage). KV Cache Utilization H200 and H100 show efficient cache usage with high hit rates (up to 99%) and low waiting requests. (The exception is code generation which has low hit rates (~1%)) A100_40G suffers from high KV cache usage and low hit rates, especially in classification and code generation, indicating memory bottlenecks. The strain on the inference server is observed by the higher number of waiting requests. Cost Efficiency Chat profiles: The A100 GPU (40G) offers the best value. Classification profiles: The H200 is most cost-effective. Code-generation profiles: The H100 provides the greatest cost efficiency. CPU vs GPU Llama 3.1 3B can run on CPU VM’s but the throughput and latency are so poor compared to GPU’s if does not make an practical or financial sense to do so. Smaller AI models (<= 1B parameters) may be OK on CPU’s for some light weight inference serves (like Chat). Conclusion The benchmarking results clearly demonstrate that hardware choice significantly impacts the inference performance and cost-efficiency of Llama 3.1 8B deployments. The H200 GPU consistently delivers the highest throughput and cache efficiency across workloads, making it the top performer overall. H100 follows closely, especially excelling in code generation tasks. While A100 and A100_40G offer budget-friendly options for chat workloads, their limitations in memory and cache performance make them less suitable for more demanding tasks. CPU virtual machines do not offer adequate performance—in terms of throughput and latency—for running AI models comparable in size to Llama 3.1 8B. These insights provide a practical foundation for selecting optimal infrastructure based on inference workload type and cost constraints. References Hugging Face Inference Benchmarker https://github.com/huggingface/inference-benchmarker Datasets used for benchmarking: Chat: hlarcher/inference-benchmarker/share_gpt_turns.json Classification: hlarcher/inference-benchmarker/classification.json Code Generation: hlarcher/inference-benchmarker/github_code.json Model: meta-llama/Llama-3.1-8B-Instruct on Hugging Face https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct vLLM Inference Engine https://github.com/vllm-project/vllm Azure ND-Series GPU Infrastructure https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series PyTorch 2.7.0 + CUDA 12.8 https://pytorch.org NVIDIA GPU Drivers and NCCL Driver: 535.161.08 NCCL: 2.21.5-1 https://developer.nvidia.com/nccl Azure Pricing Calculator (South-Central US Region) https://azure.microsoft.com/en-us/pricing/calculator CPU - vLLM Appendix Install vLLM on CPU VM’s git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source edit Dockerfiles (vllm_source/docker/Dockerfile.cpu) cp Dockerfile.cpu Dockerfile_serve.cpu change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","serve"]” cp Dockerfile.cpu Dockerfile_bench.cpu change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","bench","serve"]” Build images (enable AVX512 supported features (see lscpu)) docker build -f docker/Dockerfile_serve.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-serve-cpu-env --target vllm-openai . docker build -f docker/Dockerfile_bench.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-bench-cpu-env --target vllm-openai . Start vllm server Remember to set <YOUR HF TOKEN> and <CPU CORE RANGE> docker run --rm --privileged=true --shm-size=8g -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=<SIZE in GiB> -e VLLM_CPU_OMP_THREADS_BIND=<CPU CORE RANGE> -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-serve-cpu-env meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype=bfloat16 Run vLLM benchmark Remember to set <YOUR HF TOKEN> docker run --rm --privileged=true --shm-size=4g -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-bench-cpu-env --backend vllm --model=meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path AI-MO/aimo-validation-aime --ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 2 --num-prompts 200 --seed 42 --host 10.0.0.4
CormacGarvey
Aug 26, 2025 Place Azure High Performance Computing (HPC) Blog
560Views
0likes
0Comments
Ansys Minerva Simulation & Process Data Management Architecture on Azure
Architecture Ansys Minerva baseline architecture has four distributed tiers (client, web, enterprise, and resource) in a single Azure availability zone. Each tier aligns to function and communication flows between these tiers. All four tiers use their own virtual machines in a single virtual network. The Minerva core business functionality runs on a central core server in the enterprise tier and users access it through a web-based url client. You can deploy multiple instances in Dev and Test environments on virtual machines and storage on Dev/Test virtual networks separate from production virtual networks. Workflow SPDM users access the Minerva application via HTTPS-based endpoint Public URL. Users access the application through the web URL via internet. Azure Entra ID with SAML configuration allows single sign on authentication to the Minerva application. User is authenticated using a Minerva credential that a Minerva administrator creates in Minerva. Azure Firewall Azure backbone component which filters traffic and threat intelligence feeds directly from Microsoft Cyber Security. Https traffic directed to the Azure Application gateway. The Hub virtual network and spoke virtual network are peered to communicate over the Azure backbone network. Azure Application Gateway routes traffic to Minerva’s web server virtual machines (VMs) in the Web tier. Azure Application Gateway with Web Application firewall inspects the incoming Http traffic to continuously monitor Minerva against exploits. Seamlessly integrates with other Azure services (App Service, VMSS, AKS, etc.), making it easier to build cloud-native solutions. Application Gateway supports sticky sessions for applications that require session persistence. Web tier subnet: Users access the core component of Minerva via Web tier running IIS application server. To ensure consistent and reliable performance for your application, all virtual machines should have the recommended VM size, disk configuration. Depending on your needs, you may want to use HPC (High Performance Computing) VM SKUs. Make sure all VM instances are created from the same base OS image and configuration. The Enterprise subnet runs the following core Minerva components: Individual user access is granted based on valid Minerva and Aras Innovator feature licenses. These feature licenses are separate from the Aras Innovator server licenses. Enterprise tier VMs run the core business logic components of Minerva. These components include Minerva Simulation Product Data Management- core server, Agent server, Vault server, Meta data extraction server & license servers. Core components: Minerva ‘s central processing server is IIS application server. Agent server runs the agent services that are responsible for various platform orchestration activities. All the core components must be deployed in Azure proximity placement group to minimize the latency. Distributed components: Vault server and Meta data extraction server. Vault server stores the files, paired with other servers dedicated to processing Meta data extraction. An IIS Web Server acts as a frontend to the file repository There can be any number of Data Vaults distributed throughout the organization, based upon specific needs and criteria, and all Vaults communicate with the centralized Core Components. Scope of the Minerva vault server can be expanded to interact with any HPC cluster. Extraction server: Metadata Extraction is very memory, processor, and disk intensive, potentially opening large files. Sufficient capacity for Azure virtual machines or storage is required for this activity. SKU recommendation is given below. MS SQL Server: You can deploy the SQL server standard or enterprise version based on your company’s requirements. Minerva SQL server stores metadata objects only and no binary files are stored in the database. Database subnet runs a SQL Server database using an infrastructure-as-a-service deployment. It uses SQL Server Always On availability groups for asynchronous replication. Minerva deployment could run an Oracle Database server on this IAAS deployment. Storage subnet uses Azure Files Premium and/or Azure NetApp Files. On-premises network allows the customer support team and system administrators to connect to Azure via Azure VPN connection to gain access to any virtual machine instances via Remote Desktop Protocol (RDP) from Azure Bastion. Minerva Core Component & Vault reliability Use multiple VMs in web tier. To enhance resiliency and scalability of the Ansys Minerva application running on Azure distributes the four logical tiers across multiple virtual machines. It is recommended to run multiple parallel web servers for either load balancing and/or increased reliability. Use multiple VMs in Enterprise tier. You should install the Enterprise tier on multiple Azure virtual machines. This setup ensures fail-over support and enables load balancing to optimize performance. Application gateway load balances between VMs in the Web subnet web servers. By distributing software functions over a network, the application can achieve high availability and improve overall system reliability. This configuration is particularly beneficial for production environments where uninterrupted operation and efficient resource utilization are crucial. With the ability to distribute the workload across multiple virtual machines, the Minerva application can handle increased demand and provide a robust and responsive user experience. By following this recommended architecture, you can leverage the scalability and resilience capabilities of Azure to optimize the performance of Ansys Minerva application. It helps ensure uninterrupted access to critical product lifecycle management functionalities. Resource tier reliability Configure database backups. For SQL Server, one approach is to use Azure Backup using Recovery Services Vault to back up SQL Server databases that run on VMs. With this solution, you can perform most of the key backup management operations without being limited to the scope of an individual vault. For more information on Oracle, see Oracle Database in Azure Virtual Machines backup strategies. Use the native backup utility. It’s recommended to use the Azure backups. When performing server-level backups, you should avoid backing up the active database files directly. This is because the backup may not capture the complete state of the database files at the time of backup. Instead, server-level backups should focus on backing up the backup file generated by using the database backup utility. This approach ensures a more reliable and consistent backup of the application's database. By following this recommendation, you can effectively protect the integrity and availability of your Minerva application data, safeguarding critical information and enabling efficient recovery in case of any unforeseen issues or data loss. Configure volume backups. Azure Files provides the capability to take snapshots of file shares, creating point-in-time, read-only copies of your data. By using Azure Files or Azure NetApp Files snapshots, establish a general-purpose backup solution that safeguards against accidental deletions or unintended changes to the data. For the Minerva volume server, use File volume backups. This configuration ensures effective backup of the data stored in the volume server, enabling easy recovery in case of data loss or system failures. Implementing these recommendations enhances the data protection and resilience of the Minerva application, mitigating the risks associated with data loss or unauthorized modifications. Test database and storage backups. You should carefully plan, document, and test the backup and recovery strategy for the Minerva database and file manager servers. Configure backup frequency. Determine backup needs based on business requirements, considering the increasing number of users. A daily backup may not be sufficient for optimal protection, so adjust the frequency accordingly. Coordinate volume data with database backups. Ensure that backups for the volume servers are coordinated with database backups. This allows you to sync the actual files with the file metadata. Enhance database reliability. Provision SQL Server VMs in Availability Sets to improve database reliability. Availability Sets deploy virtual machines across fault domains and update domains, mitigating downtime events within the datacenter. Create an availability set during VM provisioning. Additionally, consider replicating Azure storage across different Azure datacenters for additional redundancy. For Oracle databases, Azure offers availability zones and availability sets. You should only use availability sets in regions where availability zones are unavailable. In addition to Azure tools, Oracle provides Oracle Data Guard and Goldengate solutions. Use Always On availability group. Configure the database server with an "Always On" availability group for SQL Server on Azure Virtual Machines. This option uses the underlying Windows Server Failover Clustering (WSFC) service and helps ensure high availability. For more information, see Overview of SQL Server Always On availability groups and Windows Server Failover Clustering (WSFC). Security Azure Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar. Recommended SKUs for Minerva to run on Azure Role of the Server SKUs Core server Standard_F16s_v2 Agent Server Standard_F8s_v2 License server Standard_D4d_v5 Extraction Server Standard_F8s_v2 Database servers Standard E32-16ds v4 Volume server Standard_L32s_v3
Sunita_AZ0708
Jul 30, 2025 Place Azure High Performance Computing (HPC) Blog
271Views
0likes
0Comments
Creating a Slurm Job Submission App in Open OnDemand with Copilot Agent
High Performance Computing (HPC) environments are essential for research, engineering, and data-intensive workloads. To efficiently manage compute resources and job submissions, organizations rely on robust scheduling and orchestration tools. In this blog post, we'll explore how to use Copilot Agent in Visual Studio Code (VSCode) to build an Open OnDemand application that submits Slurm jobs in a CycleCloud Workspace for Slurm (CCWS) environment. We'll start with a brief overview of CCWS and Open OnDemand, then dive into the integration workflow.
xpillons
Jul 21, 2025 Place Azure High Performance Computing (HPC) Blog
442Views
0likes
0Comments