hpc
250 TopicsPerformance analysis of DeepSeek R1 AI Inference using vLLM on ND-H100-v5
Introduction The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—but not without significant infrastructure demands. To deploy DeepSeek R1 effectively in an inference service like vLLM, high-performance hardware is essential. Specifically, the model requires two Azure ND_H100_v5 nodes, each equipped with 8 NVIDIA H100 GPUs, totaling 16 H100s. These nodes are interconnected via InfiniBand and NVLink, ensuring the bandwidth and latency characteristics necessary to support the model’s massive memory footprint and parallel processing needs. In this post, we’ll present inference benchmark results for DeepSeek R1, measuring performance across GPU utilization, memory throughput, and interconnect efficiency. While R1 excels in reasoning tasks, it’s important to recognize that such models are not universally optimal. For many general-purpose AI applications, smaller models like Llama 3.1 8B offer a compelling alternative, delivering sufficient accuracy and performance at a fraction of the cost. We explore the performance characteristics of DeepSeek R1 and help you decide when a large reasoning model is worth the investment—and when a leaner solution might be the better choice. Benchmark environment 2 ND_H100_v5 nodes (16 H100) were required to load the DeepSeek R1 model into GPU memory. vLLM was used to set-up the online inference server using a single API server endpoint. vLLM bench (specifically hugging face AI-MO/aimo-validation-aime) was used to run the inference benchmark. Detailed specifications are available in the tables below. Results Reasoning model’s additional chain of thought tokens We show that DeepSeek R1 generates many additional tokens (for chain of thought), which helps reasoning models answer complex logical questions. In this example the DeepSeek R1 generates 1162 completion tokens, but Llama 3.1 8B only generates 37. This example also demonstrates that while non-reasoning models may struggle with this seemingly simple logical question, reasoning models like DeepSeek R1 can solve it—though they require significantly more intermediate tokens to do so. #!/bin/bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ {"role": "user", "content": "9.11 and 9.8, which is greater? Explain your reasoning"} ] }' Note: The REST API json output is formatted to make the output more readable. (./run_restapi_query.sh 2> /dev/null | python3 ./convert_json_2_txt.py >& run_query.out_$$) Reasoning: Okay, so I need to figure out whether 9.11 is greater than 9.8 or if 9.8 is greater. Hmm, let me start by looking at both numbers. They both have a 9 in the units place, so that part is equal. Now, I need to compare the decimal parts: .11 and .8. Wait, but decimals can be tricky because the places matter. The first number is 9.11, which has a 1 in the tenths place and a 1 in the hundredths place. The second number is 9.8, which is the same as 9.80 when you add a zero to the hundredths place for comparison. So, 9.80 versus 9.11. Now, looking at the tenths place: 8 is greater than 1. Therefore, 9.80 should be greater than 9.11. But let me double-check because sometimes when numbers have different decimal lengths, it's easy to make a mistake. If I convert both to fractions, maybe that will help. 9.11 is the same as 9 + 0.11, and 0.11 is 11/100. On the other hand, 9.8 is 9 + 0.8, which is 9 + 8/10. To compare 8/10 and 11/100, I can convert them to have the same denominator. 8/10 is equal to 80/100. Now, 80/100 is much larger than 11/100. So yes, 9.8 is indeed greater than 9.11. Another way to see it is by subtracting the two numbers. If I subtract 9.11 from 9.8, what do I get? Let's see: 9.8 - 9.11. Since 9.8 is the same as 9.80, subtracting 9.11 gives 0.69. Since the result is positive, that means 9.8 is larger. Wait, maybe I can also think about money to make it clearer. If I have $9.11 and you have $9.80, which is more? $9.80 is $9 and 80 cents, whereas $9.11 is $9 and 11 cents. Definitely, 80 cents is more than 11 cents, so $9.80 is more. That helps visualize it. Alternatively, using a number line. If I imagine a number line with 9.0, 9.1, 9.2,... up to 10.0. 9.11 is just a little past 9.1, while 9.8 is almost at 10.0. So clearly, 9.8 is much further to the right, meaning it's a larger number. I think that covers different methods: decimal place comparison, converting to fractions, subtraction, real-life money examples, and number line visualization. All of these methods point to 9.8 being greater than 9.11. I don't see any contradictions here. Maybe I can also consider writing both numbers with the same number of decimal places to make it even clearer. 9.11 is already two decimal places, and 9.8 can be written as 9.80. Now, comparing 9.80 and 9.11 digit by digit: - The units place: 9 vs. 9 (equal) - Tenths place: 8 vs. 1 (8 is greater) - Since the tenths place of 9.80 is greater, the entire number is larger, regardless of the hundredths place. So yes, that confirms it again. The key idea here is that when comparing decimals, you start from the leftmost digit after the decimal point and move right until you find a difference. In this case, the tenths place already shows that 9.8 is larger. I guess another common mistake could be thinking that because 11 is more than 8, 0.11 is more than 0.8, but that's not correct because the places are different. Tenths versus hundredths. 0.8 is equivalent to 0.80, which is way bigger than 0.11. To prevent that mistake, always align the decimal points and add trailing zeros if necessary. Then compare each digit one by one from left to right. That method avoids confusion. So, all in all, through multiple approaches, 9.8 is definitely greater than 9.11. Answer: To determine which number is greater between **9.11** and **9.8**, follow these steps: **Align the decimal places** by rewriting 9.8 as **9.80**. **Compare digit by digit** from left to right: - **Units place**: Both have **9** (equal). - **Tenths place**: **8** (in 9.80) vs. **1** (in 9.11). Since \(8 > 1\), **9.80 is greater** at this point. - No need to compare hundredths place once a difference is found. **Conclusion**: \(9.8\) (or \(9.80\)) is greater than \(9.11\) because **8 tenths** (\(0.8\)) is larger than **1 tenth + 1 hundredth** (\(0.11\)). \(\boxed{9.8 \text{ is greater}}\) Completion Info: Finish reason: stop Prompt tokens: 21 Completion tokens: 1162 Total tokens: 1183 When running using Llama 3.1 8B (Not reasoning model). Answer: To determine which is greater, we need to compare the two numbers. 9.11 is greater than 9.8 because 9.11 is a larger decimal value. Completion Info: Finish reason: stop Prompt tokens: 51 Completion tokens: 37 Total tokens: 88 Throughput and latency results Cost comparison In this cost analysis we use the ND-H100-v5 and ND-H200-v4 pay as you go pricing in south central US region and the measured total throughput tokens/sec to compute the $/(1K tokens). Note: ND-H200-v5 pricing was estimated at 20% more than ND-H100-v5 pricing. Analysis The DeepSeek R1 is a large, complex reasoning model that is costlier and slower than smaller models. It needs 16 H100 GPUs for FP8 precision and generates many more intermediate tokens in its chain of thought process—about 31 times more than Llama 3.1 8B—but at a much slower rate (~54 times slower). Its latency is also higher, with TTFT and ITL being roughly 6 and 3 times slower, respectively. The DeepSeek R1 model has small intranode and internode network requirements (~14% of available InfiniBand network bandwidth was used, and < 1% of available NVLink bandwidth is used. GPUs with higher memory bandwidth and higher FLOPS would help improve its performance. The cost analysis shows that the cost to generate DeepSeek R1 tokens is ~54 times more expensive than Llama 3.1 8B on the same 16 H100 GPU’s and ~34 times more expensive on 8 H200 GPU’s. DeepSeek R1 model is very capability, but due to its higher TCO it should be only used in specific AI applications that require its strong reasoning abilities. Conclusion The DeepSeek R1 model demonstrates exceptional reasoning capabilities, but its deployment demands substantial infrastructure and incurs high latency and cost. While it excels in generating detailed chains of thought, its throughput and efficiency lag significantly behind smaller models like Llama 3.1 8B. For applications requiring deep logical analysis, DeepSeek R1 is a powerful tool. However, for general-purpose inference tasks, more lightweight models offer better performance and cost-effectiveness. Strategic use of DeepSeek R1 should be reserved for scenarios where its advanced reasoning justifies the resource investment. References Deepseek R1 model on Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-R1 vLLM GitHub repository https://github.com/vllm-project/vllm Azure ND H100 v5 documentation https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series FlashInfer GitHub repository https://github.com/flashinfer-ai/flashinfer DeepGEMM GitHub repository https://github.com/deepseek-ai/DeepGEMM AI-MO validation dataset on Hugging Face https://huggingface.co/datasets/AI-MO/aimo-validation-aime Appendix Install vLLM curl -LsSf https://astral.sh/uv/install.sh | sh uv venv myvllm --python 3.11 --seed source myvllm/bin/activate uv pip install vllm --torch-backend=auto git clone https://github.com/flashinfer-ai/flashinfer.git --recursive uv pip install ninja cd flashinfer uv pip install --no-build-isolation --verbose . Install DeepSeek DeepEP git clone https://github.com/vllm-project/vllm.git cd ~/vllm/tools/ep_kernels export CUDA_HOME=/usr/local/cuda-12.8 TORCH_CUDA_ARCH_LIST="9.0" (For Hopper) bash install_python_libraries.sh 2.&1 | tee install_python_libraries.log_$$ sudo bash configure_system_drivers.sh 2>&1 | tee configure_system_drivers.log_$$ sudo reboot Install DeepSeek DeepGEMM git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git cd deepGEMM ./install.sh 2>&1 | tee install.log_$$ Configure DeepSeek R1 with vLLM on 2 ND_H100_v5 Second node configuration Execute this script on second node before the script on the primary node. #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-start-rank 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --headless --max-model-len 32768 --reasoning-parser deepseek_r1 Primary node configuration #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --api-server-count 1 --max-model-len 32768 --reasoning-parser deepseek_r1 Install vLLM benchmark environment cd vllm uv pip install vllm[bench] Run vLLM benchmark #!/bin/bash vllm bench serve \ --backend vllm \ --model deepseek-ai/DeepSeek-R1 \ --endpoint /v1/completions \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --ramp-up-strategy linear \ --ramp-up-start-rps 1 \ --ramp-up-end-rps 10 \ --num-prompts 400 \ --seed 42Teamcenter Simulation Process Data Management Architecture on Azure CycleCloud- Slurm cluster
Introduction: Many customers run multiple Teamcenter-SPDM solutions across the enterprise, mixing multiple instances, multiple ISV vendors, and hybrid cloud/on-prem implementations. This fragmentation reduces the customer’s ability to uniformly access data. Consolidating Teamcenter-SPDM on Azure can speed the shift to one consistent, harmonized PLM experience, enterprise wide. What is Teamcenter Simulation? Teamcenter Simulation integrates simulation data, processes, and results into the broader PLM (Product Lifecycle Management) environment. Instead of engineers running simulations in silos on local drives, it provides: A single source of truth for CAD, simulation models, inputs, and results. Traceability across design, analysis, and manufacturing. Support for multi-CAD, multi-CAE tools (e.g., NX Nastran, ANSYS, Abaqus, Star-CCM+). Primary benefit Teamcenter Simulation SPDM gives you full traceability from source to solution. SPDM is a single source of truth where CAE analysis of a product design testing is related to a corresponding item in original CAD. This relationship of CAD and SIM data is a key to determine which CAD revision is captured in a particular CAE analysis. Architecture: Siemens Teamcenter SPDM baseline architecture has two major blocks of architectures which are connected. Teamcenter PLM core deployment StarCCM deployed on HPC Cyclecloud Slurm Workspace Teamcenter PLM Core Deployment: It has four distributed tiers (client, web, enterprise, and resource) in a single availability zone. Each tier aligns to function and communication flows between these tiers. All four tiers use their own virtual machines in a single virtual network. The Teamcenter Simulation aka CAE manage is core business functionality of SPDM runs on a central server in the enterprise tier and users access it through a web-based or thick-client interface. You can deploy multiple instances in Dev and Test environments by adding extra virtual machines and storage on virtual networks separate from production virtual networks. StarCCM HPC Cyclecloud slurm cluster architecture: Siemens StarCCM simulation software will be deployed on Azure Cyclecloud HPC Scheduler node. CAE Analyst fires the simulation jobs from Teamcenter Active workspace or Rich client UI. Azure HPC will then spin up and HPC nodes, these nodes will process the jobs submitted by CAE Analyst based on the runtime parameter. StarCCM will processed complete the simulation iteration and .sim file output will be generated. Workflow CAE Analysts, SPDM & Teamcenter users access the Teamcenter application via an HTTPS-based endpoint Public URL. Users access the application through two user interfaces: (1) a Rich client and (2) an Active workspace client, CAE engineer/Simulation Analysts access the Teamcenter through the Teamcenter Simulation client. Teamcenter Simulation client is lightweight thin client runs on users’ desktop. User access will be authenticated via Company’s Azure Entra ID. Azure Entra ID with SAML configuration allows single sign on(SSO) to the Teamcenter application. Azure Firewall & Azure backbone Security component which filter the traffic and threat intelligence feeds directly from Microsoft Cyber Security. Https traffic directed to the Azure Application gateway. The Hub virtual network and Spoke virtual network are peered so they can communicate over the Azure backbone network. Azure Application Gateway routes traffic to the Teamcenter’s web server virtual machines (VMs) in the Web tier. Siemens PLM Teamcenter deployment on Azure. For detailed information about Teamcenter Architecture on Azure refer this url. Teamcenter Simulation Client runs on Teamcenter User’s desktop. CAE manager is deployed as integral part of the Teamcenter package. Teamcenter Simulation on Azure HPC: CAE Engineer executes the following typical workflow with Azure HPC cluster Step 1: CAD Data & Product Structures CAD models (e.g., from NX, CATIA, SolidWorks) are managed in Teamcenter. Simulation engineer links simulation models directly to Teamcenter product structures. Ensures simulation always uses the latest or correct version of the design. Step 2: Build Simulation Model (Pre-processing) Simulation templates define solver type (FEA, CFD, Multiphysics) and required inputs. Engineers use tools like NX CAE, Simcenter 3D, ANSYS, Abaqus, or Star-CCM+ integrated with Teamcenter. Meshes, boundary conditions, loads, and materials are associated with the correct design revision. Step 3: Manage Simulation Data All input decks, scripts, and models stored in Teamcenter for version control. Metadata (e.g., load case, solver settings) captured for searchability & re-use. Supports process automation: simulation workflows can be pre-configured for repeatable tasks. Step 4: Run Simulation Jobs (Enhanced with Azure CycleCloud Benefits) Jobs submitted to local HPC clusters or cloud HPC (Azure CycleCloud,) directly from Teamcenter. Teamcenter stores solver logs, job status, and output files. Following diagram show end to end workflow starts with Teamcenter CAE manager--> StarCCM -->HPC cluster ->Simulation processing Sim file -->Sim file back to Teamcenter Teamcenter CAE manager--> StarCCM running on HPC cluster Teamcenter generates the job file on the HPC node HPC Cluster creating HPC nodes Squeue monitoring on HPC node Job monitoring on Teamcenter UI Simulation output file generated by Sbatch job File copied over to Teamcenter shared file location Step 5: Post-processing & Results Management Results imported back into Teamcenter: stress plots, temperature distributions, flow fields, etc. Visualization via Simcenter 3D, JT format (lightweight 3D), or web-based viewers. Results tied back to: Design versions Simulation setup Load cases This creates a traceable digital thread from requirements → design → simulation → results. Step 6: Review, Sign-off, and Collaboration Results shared with design, manufacturing, and management teams in Teamcenter. Review workflows, e-signatures, and approvals integrated into PLM processes. Simulation results influence design changes and product validation reports. Azure CycleCloud adds several key advantages: On-demand scaling: Automatically provisions Azure compute nodes when workloads spike, then scales down when jobs complete to reduce costs. HPC Slurm scheduler integration: Supports popular schedulers like Slurm enabling smooth job submission from Teamcenter. Multi-VM sizes & GPU support: Allows selecting the right mix of CPU/GPU VMs for different simulation workloads (e.g., CFD, FEA, ML-driven simulations). Hybrid flexibility: Combine on-prem HPC with Azure bursting to handle peak demand without over-provisioning local hardware. Cost governance: Built-in cost controls, job quotas, and reporting to track simulation expenses. Security & compliance: Leverages Azure security, VNet isolation, and role-based access control for simulation data and compute resources. Integration with Azure Storage: Simplifies access to input/output files using Azure Blob, Azure NetApp Files, or Lustre for HPC-grade throughput. Conclusion: Siemens Teamcenter SPDM, when deployed on Azure HPC CycleCloud Workspaces, delivers a scalable and high-performance simulation data management solution. The integration with Azure CycleCloud enables dynamic provisioning of compute resources, allowing simulation workloads to scale elastically based on demand. This ensures optimal resource utilization and cost efficiency, especially during peak simulation cycles. With support for Slurm scheduling, multi-VM configurations, and GPU acceleration, SPDM on HPC CCWs empowers engineering teams to run complex simulations faster and more reliably. The architecture’s hybrid flexibility—combining on-premises and cloud bursting—further enhances throughput without overcommitting infrastructure, making it a robust foundation for enterprise-wide digital thread and product validation workflows.Inference performance of Llama 3.1 8B using vLLM across various GPUs and CPUs
Introduction Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance across a range of GPU and CPU platforms. Using the Hugging Face inference benchmarker, we assess not only throughput and latency but also the cost-efficiency of each configuration—an increasingly critical factor for enterprise deployment. As organizations seek scalable and budget-conscious solutions for deploying large language models (LLMs), understanding the trade-offs between compute-bound and memory-bound stages of inference becomes essential. Smaller models like Llama 3.1 8B offer a compelling balance between capability and resource demand, but the underlying hardware and software stack can dramatically influence both performance and operational cost. This report presents a comparative analysis of inference performance across multiple hardware platforms, factoring in: Token throughput and latency across chat, classification, and code generation workloads. Resource utilization, including KV cache utilization and efficiency. Cost per token, derived from cloud pricing models and hardware utilization metrics. By combining performance metrics with cost analysis, we aim to identify the most effective deployment strategies for enterprise-grade LLMs, whether optimizing for speed, scalability, or budget. Benchmark environment Inference benchmark The Hugging face Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined. Chat: Probably the most common use case, question and answer format on a wide range of topics. Classification: Providing various documents and requesting a summary of its contents. Code generation: Providing code and requesting code generation, e.g. create a new function. Profile Data set Input prompt Output prompt Chat hlarcher/inference-benchmarker/share_gpt_turns.json N/A min=50, max=800, variance=100 Classification hlarcher/inference-benchmarker/classification.json Min=8000, max=12000, variance=5000 Min=30, max=80, variance=10 Code generation hlarcher/inference-benchmarker/github_code.json Min=3000, max=6000, variance=1000 Min=30, max=80, variance=10 Huggingface Lama 3.1 8B models used Precision Model Size (GiB) meta-llama/Llama-3.1-8B-Instruct FP16 14.9 vLLM parameters Default value gpu_memory_utilization 0.9 max_num_seqs 1024 max_num_batched_tokens 2048 (A100), 8192 (H100,H200) enable_chunked_prefill True enable_prefix_caching True VM Configuration GPU ND-H100-v5, ND-H200-v5, HD-A100-v4 (8 H100 80GB &40GB) running HPC Ubuntu 22.04 (Pytorch 2.7.0+cu128, GPU driver: 535.161.08 and NCCL 2.21.5-1). 1 GPU was used in benchmark tests. CPU Ubuntu 22.02 (HPC and Canonical/jammy) Results GPU Profile Avg prompt throughput Avg generation throughput Max # Requests waiting Max KV Cache usage % Avg KV Cache hit rate % H100 Chat ~2667 ~6067 0 ~14% ~75% Classification ~254149 ~1291 0 ~46% ~98% Code generation ~22269 ~266 ~111 ~93% ~1% H200 Chat ~3271 ~7464 0 ~2% ~77% Classification ~337301 ~1635 0 ~24% ~99% Code generation ~22726 ~274 ~57 ~46% ~1% A100 Chat ~1177 ~2622 0 ~2% ~75% Classification ~64526 ~333 0 ~45% ~97% Code generation ~7926 ~95 ~106 ~21% ~1% A100_40G Chat ~1069 ~2459 0 ~27% ~75% Classification ~7846 ~39 ~116 ~68% ~5% Code generation ~7836 ~94 ~123 ~66% ~1% Cost analysis Cost analysis used pay-as-you-go pricing for the south-central region and measured throughput in tokens per second to calculate the metric $/(1K tokens). CPU performance and takeaways The Huggingface AI-MO/aimo-validation-aime data was by vllm bench to test the performance of Llama 3.1 8B on various VM types (left graph below). It is a struggle (insufficient FLOPs and memory bandwidth) to run Llama 3.1 8B on CPU VM’s, even the best performing CPU VM (HB176-96_v4) throughput and latency is significantly slower than the A100_40GB GPU. Tips Enable/use AVX512 (avx512f, avx512_bf16, avx512_vnni etc) (See what is supported/available via lscpu) Put AI model on single socket (if it has sufficient memory). For larger models you can use tensor parallel to split the model across sockets. Use pinning to specify which cores the threads will run on (in vLLM, VLLM_CPU_OMP_THREADS_BIND=0-22) Specify large enough KVCache (on CPU memory). In vLLM, VLLM_CPU_KVCACHE_SPACE=100) Analysis Throughput & Latency H200 outperforms all other GPUs across all workloads, with the highest prompt and generation throughput. H100 is a close second, showing strong performance especially in classification and code generation. A100 and A100_40G lag significantly behind, particularly in classification tasks where throughput drops by an order of magnitude (on A100_40G, due to smaller GPU memory and lower KV Cache hit percentage). KV Cache Utilization H200 and H100 show efficient cache usage with high hit rates (up to 99%) and low waiting requests. (The exception is code generation which has low hit rates (~1%)) A100_40G suffers from high KV cache usage and low hit rates, especially in classification and code generation, indicating memory bottlenecks. The strain on the inference server is observed by the higher number of waiting requests. Cost Efficiency Chat profiles: The A100 GPU (40G) offers the best value. Classification profiles: The H200 is most cost-effective. Code-generation profiles: The H100 provides the greatest cost efficiency. CPU vs GPU Llama 3.1 3B can run on CPU VM’s but the throughput and latency are so poor compared to GPU’s if does not make an practical or financial sense to do so. Smaller AI models (<= 1B parameters) may be OK on CPU’s for some light weight inference serves (like Chat). Conclusion The benchmarking results clearly demonstrate that hardware choice significantly impacts the inference performance and cost-efficiency of Llama 3.1 8B deployments. The H200 GPU consistently delivers the highest throughput and cache efficiency across workloads, making it the top performer overall. H100 follows closely, especially excelling in code generation tasks. While A100 and A100_40G offer budget-friendly options for chat workloads, their limitations in memory and cache performance make them less suitable for more demanding tasks. CPU virtual machines do not offer adequate performance—in terms of throughput and latency—for running AI models comparable in size to Llama 3.1 8B. These insights provide a practical foundation for selecting optimal infrastructure based on inference workload type and cost constraints. References Hugging Face Inference Benchmarker https://github.com/huggingface/inference-benchmarker Datasets used for benchmarking: Chat: hlarcher/inference-benchmarker/share_gpt_turns.json Classification: hlarcher/inference-benchmarker/classification.json Code Generation: hlarcher/inference-benchmarker/github_code.json Model: meta-llama/Llama-3.1-8B-Instruct on Hugging Face https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct vLLM Inference Engine https://github.com/vllm-project/vllm Azure ND-Series GPU Infrastructure https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series PyTorch 2.7.0 + CUDA 12.8 https://pytorch.org NVIDIA GPU Drivers and NCCL Driver: 535.161.08 NCCL: 2.21.5-1 https://developer.nvidia.com/nccl Azure Pricing Calculator (South-Central US Region) https://azure.microsoft.com/en-us/pricing/calculator CPU - vLLM Appendix Install vLLM on CPU VM’s git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source edit Dockerfiles (vllm_source/docker/Dockerfile.cpu) cp Dockerfile.cpu Dockerfile_serve.cpu change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","serve"]” cp Dockerfile.cpu Dockerfile_bench.cpu change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","bench","serve"]” Build images (enable AVX512 supported features (see lscpu)) docker build -f docker/Dockerfile_serve.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-serve-cpu-env --target vllm-openai . docker build -f docker/Dockerfile_bench.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-bench-cpu-env --target vllm-openai . Start vllm server Remember to set <YOUR HF TOKEN> and <CPU CORE RANGE> docker run --rm --privileged=true --shm-size=8g -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=<SIZE in GiB> -e VLLM_CPU_OMP_THREADS_BIND=<CPU CORE RANGE> -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-serve-cpu-env meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype=bfloat16 Run vLLM benchmark Remember to set <YOUR HF TOKEN> docker run --rm --privileged=true --shm-size=4g -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-bench-cpu-env --backend vllm --model=meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path AI-MO/aimo-validation-aime --ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 2 --num-prompts 200 --seed 42 --host 10.0.0.4Ansys Minerva Simulation & Process Data Management Architecture on Azure
Architecture Ansys Minerva baseline architecture has four distributed tiers (client, web, enterprise, and resource) in a single Azure availability zone. Each tier aligns to function and communication flows between these tiers. All four tiers use their own virtual machines in a single virtual network. The Minerva core business functionality runs on a central core server in the enterprise tier and users access it through a web-based url client. You can deploy multiple instances in Dev and Test environments on virtual machines and storage on Dev/Test virtual networks separate from production virtual networks. Workflow SPDM users access the Minerva application via HTTPS-based endpoint Public URL. Users access the application through the web URL via internet. Azure Entra ID with SAML configuration allows single sign on authentication to the Minerva application. User is authenticated using a Minerva credential that a Minerva administrator creates in Minerva. Azure Firewall Azure backbone component which filters traffic and threat intelligence feeds directly from Microsoft Cyber Security. Https traffic directed to the Azure Application gateway. The Hub virtual network and spoke virtual network are peered to communicate over the Azure backbone network. Azure Application Gateway routes traffic to Minerva’s web server virtual machines (VMs) in the Web tier. Azure Application Gateway with Web Application firewall inspects the incoming Http traffic to continuously monitor Minerva against exploits. Seamlessly integrates with other Azure services (App Service, VMSS, AKS, etc.), making it easier to build cloud-native solutions. Application Gateway supports sticky sessions for applications that require session persistence. Web tier subnet: Users access the core component of Minerva via Web tier running IIS application server. To ensure consistent and reliable performance for your application, all virtual machines should have the recommended VM size, disk configuration. Depending on your needs, you may want to use HPC (High Performance Computing) VM SKUs. Make sure all VM instances are created from the same base OS image and configuration. The Enterprise subnet runs the following core Minerva components: Individual user access is granted based on valid Minerva and Aras Innovator feature licenses. These feature licenses are separate from the Aras Innovator server licenses. Enterprise tier VMs run the core business logic components of Minerva. These components include Minerva Simulation Product Data Management- core server, Agent server, Vault server, Meta data extraction server & license servers. Core components: Minerva ‘s central processing server is IIS application server. Agent server runs the agent services that are responsible for various platform orchestration activities. All the core components must be deployed in Azure proximity placement group to minimize the latency. Distributed components: Vault server and Meta data extraction server. Vault server stores the files, paired with other servers dedicated to processing Meta data extraction. An IIS Web Server acts as a frontend to the file repository There can be any number of Data Vaults distributed throughout the organization, based upon specific needs and criteria, and all Vaults communicate with the centralized Core Components. Scope of the Minerva vault server can be expanded to interact with any HPC cluster. Extraction server: Metadata Extraction is very memory, processor, and disk intensive, potentially opening large files. Sufficient capacity for Azure virtual machines or storage is required for this activity. SKU recommendation is given below. MS SQL Server: You can deploy the SQL server standard or enterprise version based on your company’s requirements. Minerva SQL server stores metadata objects only and no binary files are stored in the database. Database subnet runs a SQL Server database using an infrastructure-as-a-service deployment. It uses SQL Server Always On availability groups for asynchronous replication. Minerva deployment could run an Oracle Database server on this IAAS deployment. Storage subnet uses Azure Files Premium and/or Azure NetApp Files. On-premises network allows the customer support team and system administrators to connect to Azure via Azure VPN connection to gain access to any virtual machine instances via Remote Desktop Protocol (RDP) from Azure Bastion. Minerva Core Component & Vault reliability Use multiple VMs in web tier. To enhance resiliency and scalability of the Ansys Minerva application running on Azure distributes the four logical tiers across multiple virtual machines. It is recommended to run multiple parallel web servers for either load balancing and/or increased reliability. Use multiple VMs in Enterprise tier. You should install the Enterprise tier on multiple Azure virtual machines. This setup ensures fail-over support and enables load balancing to optimize performance. Application gateway load balances between VMs in the Web subnet web servers. By distributing software functions over a network, the application can achieve high availability and improve overall system reliability. This configuration is particularly beneficial for production environments where uninterrupted operation and efficient resource utilization are crucial. With the ability to distribute the workload across multiple virtual machines, the Minerva application can handle increased demand and provide a robust and responsive user experience. By following this recommended architecture, you can leverage the scalability and resilience capabilities of Azure to optimize the performance of Ansys Minerva application. It helps ensure uninterrupted access to critical product lifecycle management functionalities. Resource tier reliability Configure database backups. For SQL Server, one approach is to use Azure Backup using Recovery Services Vault to back up SQL Server databases that run on VMs. With this solution, you can perform most of the key backup management operations without being limited to the scope of an individual vault. For more information on Oracle, see Oracle Database in Azure Virtual Machines backup strategies. Use the native backup utility. It’s recommended to use the Azure backups. When performing server-level backups, you should avoid backing up the active database files directly. This is because the backup may not capture the complete state of the database files at the time of backup. Instead, server-level backups should focus on backing up the backup file generated by using the database backup utility. This approach ensures a more reliable and consistent backup of the application's database. By following this recommendation, you can effectively protect the integrity and availability of your Minerva application data, safeguarding critical information and enabling efficient recovery in case of any unforeseen issues or data loss. Configure volume backups. Azure Files provides the capability to take snapshots of file shares, creating point-in-time, read-only copies of your data. By using Azure Files or Azure NetApp Files snapshots, establish a general-purpose backup solution that safeguards against accidental deletions or unintended changes to the data. For the Minerva volume server, use File volume backups. This configuration ensures effective backup of the data stored in the volume server, enabling easy recovery in case of data loss or system failures. Implementing these recommendations enhances the data protection and resilience of the Minerva application, mitigating the risks associated with data loss or unauthorized modifications. Test database and storage backups. You should carefully plan, document, and test the backup and recovery strategy for the Minerva database and file manager servers. Configure backup frequency. Determine backup needs based on business requirements, considering the increasing number of users. A daily backup may not be sufficient for optimal protection, so adjust the frequency accordingly. Coordinate volume data with database backups. Ensure that backups for the volume servers are coordinated with database backups. This allows you to sync the actual files with the file metadata. Enhance database reliability. Provision SQL Server VMs in Availability Sets to improve database reliability. Availability Sets deploy virtual machines across fault domains and update domains, mitigating downtime events within the datacenter. Create an availability set during VM provisioning. Additionally, consider replicating Azure storage across different Azure datacenters for additional redundancy. For Oracle databases, Azure offers availability zones and availability sets. You should only use availability sets in regions where availability zones are unavailable. In addition to Azure tools, Oracle provides Oracle Data Guard and Goldengate solutions. Use Always On availability group. Configure the database server with an "Always On" availability group for SQL Server on Azure Virtual Machines. This option uses the underlying Windows Server Failover Clustering (WSFC) service and helps ensure high availability. For more information, see Overview of SQL Server Always On availability groups and Windows Server Failover Clustering (WSFC). Security Azure Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar. Recommended SKUs for Minerva to run on Azure Role of the Server SKUs Core server Standard_F16s_v2 Agent Server Standard_F8s_v2 License server Standard_D4d_v5 Extraction Server Standard_F8s_v2 Database servers Standard E32-16ds v4 Volume server Standard_L32s_v3Creating a Slurm Job Submission App in Open OnDemand with Copilot Agent
High Performance Computing (HPC) environments are essential for research, engineering, and data-intensive workloads. To efficiently manage compute resources and job submissions, organizations rely on robust scheduling and orchestration tools. In this blog post, we'll explore how to use Copilot Agent in Visual Studio Code (VSCode) to build an Open OnDemand application that submits Slurm jobs in a CycleCloud Workspace for Slurm (CCWS) environment. We'll start with a brief overview of CCWS and Open OnDemand, then dive into the integration workflow.Performance at Scale: The Role of Interconnects in Azure HPC & AI Infrastructure
Microsoft Azure’s high-performance computing (HPC) & AI infrastructure is designed from the ground up to support the world’s most demanding workloads. High-performance AI workloads are bandwidth-hungry and latency-sensitive. As models scale in size and complexity, the efficiency of the interconnect fabric—how CPUs, GPUs, and storage communicate—becomes a critical factor in overall system performance. Even with the fastest GPUs, poor interconnect design can lead to bottlenecks, underutilized hardware, and extended time-to-results. In this blog post, we will highlight one of the key enabling features for running large-scale distributed workloads on Azure: a highly tuned HPC-class interconnect. Azure has invested years of system-level engineering of the InfiniBand interconnect, into ready-to-use configurations for customers available on Azure’s HB-series and N-series virtual machine (VMs).Using Gromacs through EESSI on NC_A100_v4
We have been showing in earlier blogs how to use EESSI for getting access to highly optimized applications for different cpu architectures, e.g.: Accessing the EESSI Common Stack of Scientific Software using Azure HPC-on-demand | Microsoft Community Hub. Over the last few months, we have been working with the EESSI team and Inuits (Inuits - Open Source Innovators) to extend the EESSI software stack and to make the use of optimized codes for Nvidia gpu’s as simple. The EESSI install script below takes care of installing cvmfs and configuration of the software.eessi.io repository. Additionally, it will create a common /opt/eessi/nvidia directory, where links to compatible local drivers and CUDA installation are made. This will allow CUDA compiled software to interact with the Nvidia gpu. The below script needs to be run as root, e.g. during start-up of the vm: yum install -y https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest.noarch.rpm yum install -y cvmfs yum install -y https://github.com/EESSI/filesystem-layer/releases/download/latest/cvmfs-config-eessi-latest.noarch.rpm bash -c "echo 'CVMFS_CLIENT_PROFILE="single"' > /etc/cvmfs/default.local" bash -c "echo 'CVMFS_QUOTA_LIMIT=10000' >> /etc/cvmfs/default.local" cvmfs_config setup /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Once the vm is prepared, a user can access the EESSI applications by running the source command; after this, not only the optimized cpu-architecture modules are loaded, but also the gpu enabled applications, see below: $> source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash $> module avail As can be seen, GROMACS (https://www.gromacs.org/) is one of the available software packages. Now it is time to run a model, Carsten Kutzner at the Max Planck Institute provides a set of GROMACS benchmarks that are extremely useful for validation: https://www.mpinat.mpg.de/grubmueller/bench $> ml avail GROMACS $> ml load GROMACS/2024.4-foss-2023b-CUDA-12.4.0 $> wget https://www.mpinat.mpg.de/benchPEP-h; unzip benchPEP-h $> gmx mdrun -s benchPEP-h.tpr -maxh 0.50 -resethway -noconfout -nsteps 100000 -g logfile -pme gpu -update gpu -bonded gpu During the run, nvidia-smi will show that the gpu is fully utilized: I hope this shows how accessible Nvidia gpu’s have become for running scientific codes through the use of the EESSI software stack. As this blog is written, only a few combinations of cpu-gpu are fully supported, demonstrating the zen3 + cc80 of the NCas_A100_v4 in this blog. Some more info can be found at the EESSI blog: GPU Support in EESSI: From Zero to Science in Seconds - European Environment for Scientific Software Installations (EESSI)Benchmark Different Capacities for EDA Workloads on Microsoft HPC Storages
Overview Semiconductor (or Electronic Design Automation [EDA]) companies prioritize reducing time to market (TTM), which depends on how quickly tasks such as chip design validation and pre-foundry work can be completed. Faster TTM also helps save on EDA licensing costs, as less time spent on work means more time available for the licenses. To achieve shorter TTM, storage solutions are crucial. As illustrated in the article “Benefits of using Azure NetApp Files for Electronic Design Automation (EDA)” (1*), with Large Volume feature, which requires a minimum size of 50TB, Azure NetApp Files can be boosted to reach up to 652,260 I/O rate at 2ms latency, and 826,379 at performance edge (~7 ms) for one Large Volume. Objective In real-world production, EDA files—such as tools, libraries, temporary files, and output—are usually stored in different volumes with varying capacities. Not every EDA job needs extremely high I/O rates or throughput. Additionally, cost is a key consideration, since larger volumes are more expensive. The objective of this article is to share benchmark results for different storage volume sizes: 50TB, 100TB, and 500TB, all using the Large Volume features. We also included a 32TB case—where Large Volume features aren't available on ANF—for comparison with Azure Managed Lustre File System (AMLFS), another Microsoft HPC storage solution. These benchmark results can help customers evaluate their real-world needs, considering factors like capacity size, I/O rate, throughput, and cost. Testing Method EDA workloads are classified into two primary types—Frontend and Backend, each with distinct requirements for the underlying storage and compute infrastructure. Frontend workloads focus on logic design and functional aspects of chip design and consist of thousands of short-duration parallel jobs with an I/O pattern characterized by frequent random reads and writes across millions of small files. Backend workloads focus on translating logic design to physical design for manufacturing and consists of hundreds of jobs involving sequential read/write of fewer larger files. The choice of a storage solution to meet this unique mix of frontend and backend workload patterns is non-trivial. Frontend and backend EDA workloads are very demanding on storage solutions – standard industry benchmarks indicate a high I/O profile of the workloads described above that include a substantial amount of NFS access, lookup, create, getattrs, link and unlink operations, as well as small and large file read and write operations. This blog contains the output from the performance testing of an industry standard benchmark for EDA. For this particular workload, the benchmark represents the I/O blend typical of a company running both front- and backend EDA workloads in parallel. Testing Environment We used 10 E64dsv5 as client VMs connecting to one single ANF or AMFLS volume with nconnect mount option (for ANF) to ensure generate enough workloads for benchmark. The client VM’s tuning and configuration are the same that specified on (1*). ANF mount option: nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,noatime,nconnect=8 AMLFS mount: sudo mount -t lustre -o noatime,flock All resources reside in the same VNET and same Proximity Placement Group when possible to ensure low network latency. Figure 1. High level architecture of the testing environment Benchmark Results As EDA jobs are highly latency sensitive. For today’s more complex chip designs, 2 milliseconds of latency per EDA operation is generally seen as the ideal target, while edge performance limit is around 7 milliseconds. We listed the I/O rates achieved at both latency points for easier reference. Throughput (in MB/s) is also included, as it is essential for many back-end tasks and the output phase. (Figure 2., Figure 3,. Figure 4, and Table 1.) For cases where the Large Volume feature is enabled, we observe the following: 100TB with Ultra tier and 500TB with Standard, Premium or Ultra tier can reach to over 640,000 I/O rate at 2ms latency. This is consistent to the 652,260 as stated in (*1). For Ultra 500TB volume can even reach 705,500 I/O rate at 2ms latency. For workloads not requiring much I/O rate, either 50TB with Ultra tier or 100TB with Premium tier can reach 500,000 I/O rate. For an even smaller job, 50TB with Premium tier can reach 255,000 and more inexpensive. For scenarios throughput is critical, 500TB with Standard, Premium or Ultra tier can all reach 10~12TB/s throughput. Figure 2. Latency vs. I/O rate: Azure NetApp Files- one Large Volume Figure 3. Achieved I/O rate at 2ms latency & performance edge (~7ms): Azure NetApp Files- one Large Volume Figure 4. Achieved throughput (MB/s) at 2ms latency & performance edge (~7ms): Azure NetApp Files- one Large Volume Table 1. Achieved I/O rate and Throughput at both latency: Azure NetApp Files- one Large Volume For cases with less than 50TB of capacity, where the Large Volume feature not available for ANF, we included Azure Managed Lustre File System (AMLFS) for comparison. With the same 32TB volume size, a regular ANF volume achieves about 90,000 I/O at 2ms latency, while an AMLFS Ultra volume (500 MB/s/TiB) can reach roughly double that, around 195,000. This shows that AMLFS is a better choice for performance when the Large Volume feature isn't available on ANF. (Figure 5.) Figure 5. Achieved I/O rate at 2ms latency: ANF regular volume vs. AMLFS Summary This article shared benchmark results for different storage capacities needed for EDA workloads, including 50TB, 100TB, and 500TB volumes with the Large Volume feature enabled. It also compared a 32TB volume—where the Large Volume feature isn’t available on ANF—to Azure Managed Lustre File System (AMLFS), another Microsoft HPC storage option. These results can help customers choose or design storage that best fits their needs by balancing capacity, I/O rate, throughput, and cost. With the Large Volume feature, 100TB Ultra and 500TB Standard, Premium, or Ultra tiers can achieve over 640,000 I/O at 2ms latency. For jobs that need less I/O, 50TB Ultra or 100TB Premium can reach 500,000, while 50TB Premium offers 255,000 at a lower cost. When throughput matters most, 500TB volumes across all tiers can deliver 10–12TB/s. If you have a smaller job or can’t use the Large Volume feature, Azure Managed Lustre File System (AMLFS) gives you better performance than a regular ANF volume. A final reminder, this article primarily provided benchmark results to help semiconductor customers in designing their storage solutions, considering capacity size, I/O rate, throughput, and cost. It did not address other important criteria such as heterogeneous integration or legacy compliance, which are also important when selecting an appropriate storage solution. References Benefits of using Azure NetApp Files for Electronic Design Automation (EDA) Learn more about Azure Managed LustreAnnouncing Azure HBv5 Virtual Machines: A Breakthrough in Memory Bandwidth for HPC
Discover the new Azure HBv5 Virtual Machines, unveiled at Microsoft Ignite, designed for high-performance computing applications. With up to 7 TB/s of memory bandwidth and custom 4th Generation EPYC processors, these VMs are optimized for the most memory-intensive HPC workloads. Sign up for the preview starting in the first half of 2025 and see them in action at Supercomputing 2024 in AtlantaDeploying Siemens NX/X on Azure Virtual Desktop: Multi-Session GPU Sharing for CAD Workloads
Workflow NX users access the NX application deployed on Azure Virtual Desktop via Remote Desktop Application (RDP). User can access the NX application either login into the Session Desktop or via remote application(streaming). Users' identity and credential must be set in Siemens Azure subscription to access the valid NX X licenses. Use accesses are granted through their companies Microsoft Entra ID. Siemens cloud subscription is required for NXX Saas application. When a user connects to a workspace, they authenticate via Entra ID, which enables modern security controls such as Conditional Access policies, Multi-Factor Authentication (MFA). This Azure AD integration means you can enforce who can access the virtual desktops under what conditions (for example, requiring MFA or compliant devices). Azure Virtual Desktop control plane seamlessly manages the web access, Gateway, broker & diagnostic and extensibility components such as REST APIs. Azure Virtual Desktop Host Pools manages the session hosts, application groups, user assignments. Session Hosts: Session hosts are GPU enabled Azure virtual machines. Administrator will deploy the NX on session hosts with the common NX product image. Application groups: AVD supports multiple Application groups. Application group is collection of Session desktop and NX remote application, M360 application or any other CAD applications. Workspace: All these application groups should be grouped under a workspace which will be displayed in Remote Desktop App. in, this case Session desktop & Remote NX app are appearing on RDP client. User Assignments: CAD user is assigned to the workspace and application groups NX CAD files are stored in Azure Files or in Azure NetApp files storage. User profiles are managed by FsLogiX which is stored in Azure storage- Fileshare/Azure Netapp file. See the detail instructions to setup AVD on multi session hosts. How to configure NX/X on Azure Virtual Desktop for Multisession host Azure Compute Sizing for AVD VMs For NX deployment and certification, we have used the following NVadsA10v5-series Virtual Machines. These virtual machines are powered by NVIDIA A10 GPUs/CPUs with a base frequency of 3.2 GHz, all-cores peak frequency of 4.0 GHz. With NVadsA10v5-series Azure is introducing virtual machines with partial NVIDIA GPUs. The below two configurations are used for testing NX. VM Size Name vCPUs (Qty.) Memory (GB) GPU Accelerator Standard_NV12ads_A10_v5 12 110 1/3 Standard_NV18ads_A10_v5 18 220 1/2 Deployment OS & image Windows 11 Enterprise multi-session Remote Desktop Client & Connection: Users subscribe their ID in the Remote Desktop Application. Upon subscribing, users can see the Session Desktop and NX application. Users can access Session Desktop with their credentials and separate Desktop session will be opened from where they can access NX. Users can create Part, drawing and save it in the corresponding session desktop, which will save in the corresponding VMs where the user was logged in Users can access NX Remote app from Remote Desktop application after entering their credentials. NX application will be open locally, users can create part and drawing and save it to the corresponding VM from the where the NX is launched Testing & Workloads: NX/NX testing was performed using CAD industry standard NXCP- NX certification pack & ATS test suits provided by Siemens. Testing method was chosen as auto, interactive & manual mode. To test the Performance of NX on an Azure Virtual Desktop Host with multiple users, we have followed the below methodology. Considering the NX hardware requirements and the Virtual Machine SKU capacity with dedicated 5 users per AVD Session host. Breadth-first load balancing assigns new user sessions to the session host that has the least number of connections, which helps create an even distribution. Depth-first load balancing fills up one session host before moving on to the next, which can provide efficiency during low-usage periods. For more information, see Configure host pool load balancing in Azure Virtual Desktop. We have selected Breadth & Depth-first both the options for the horizontal & vertical assignments of session hosts for the user load distribution. To check the performance of NX on AVD host with multiple users, we have tested the ATS tests (non-interactive tests) with multiple users at a time. Test Scenarios: NXCP/ATS & Manual listed here GPU utilization Conclusion: All the listed tests have been passed successfully in Azure virtual desktop multi session set up. The result is that the NV_A10_v5 through AVD is now certified for multi-session use by the Siemens NX team. You can view the of NX on AVD here.