virtual machines
152 TopicsAnnouncing preview of new Azure Dasv7, Easv7, Fasv7-series VMs based on AMD EPYC™ ‘Turin’ processor
Today, Microsoft is announcing preview of the new Azure AMD-based Virtual Machines (VMs), powered by 5th Generation AMD EPYC™ (Turin) processors. The preview includes general purpose (Dasv7 & Dalsv7 series), memory-optimized (Easv7 series) and compute-optimized (Fasv7, Falsv7, Famsv7 series) VMs, available with and without local disks. These VMs are in preview in the following Azure regions: East US 2, North Europe, and West US 3. To request access to the preview, please fill out the Preview-Signup. The latest Azure AMD-based VMs deliver significant enhancements over the previous generation (v6) AMD-based VMs: improved CPU performance, greater scalability, and expanded configuration options to meet the needs of a wide range of workloads. Key improvements include: Up to 35% CPU performance improvement compared to equivalent sized (v6) AMD-based VMs. Significant performance gains on other workloads: Up to 25% for Java-based workloads Up to 65% for in-memory cache applications Up to 80% for crypto workloads Up to 130% for web server applications Maximum boost CPU frequency of 4.5 GHz, enabling faster operations for compute-intensive workloads. Expanded VM sizes: Dasv7-series, Dalsv7-series and Easv7-series now scale up to 160 vCPUs. Fasv7-series supports up to 80 vCPUs, with a new 1-core size. Increased memory capacity: Dasv7-series now offers up to 640 GiB of memory. Easv7-series scales up to 1280 GiB and is ideal for memory-intensive applications. Enhanced remote storage performance: VMs offer up to 20% higher IOPS and up to 50% greater throughput compared to similar sized previous generation (v6) VMs. New VM families introduced: Fadsv7, Faldsv7, and Famdsv7 are now available with local disk support. Expanded constrained-core offerings: New constrained-core sizes for Easv7 and Famsv7, available with and without local disks, helping to optimize licensing costs for core-based software licensing. These enhancements make these latest VMs a compelling choice for customers seeking high performance, cost efficiency, and workload flexibility on Azure. Additionally, these VMs leverage the latest Azure Boost technology enhancements to performance and security of these new VMs. The new VMs utilize the Microsoft Azure Network Adapter (MANA), a next-generation network interface that provides stable, forward-compatible drivers for Windows and Linux operating systems. These VMs also support the NVMe protocol for both local and remote disks. The 5th Generation AMD EPYC™ processor family, based on the newest ‘Zen 5’ core, provides enhanced capabilities for these new Azure AMD-based VM series such as AVX-512 with a full 512-bit data path for vector and floating-point operations, higher memory bandwidth, and improved instructions per clock compared to the previous generation. These updates provide increased throughput and ability to scale for compute-intensive tasks like AI and machine learning, scientific simulations, and financial analytics, among others. AMD Infinity Guard hardware-based security features, such as Transparent Secure Memory Encryption (TSME), continue in this generation to ensure sensitive information remains secure. These VMs support three memory (GiB)-to-vCPU ratios such as 2:1 (Dalsv7-series, Daldsv7-series, Falsv7-series and Faldsv7-series), 4:1 (Dasv7-series, Dadsv7-series, Fasv7-series and Fadsv7-series), and 8:1 (Easv7-series, Eadsv7-series, Famsv7-series and Famdsv7-series). The Dalsv7-series are ideal for workloads that require less RAM per vCPU that can reduce costs when running non-memory intensive applications, including web servers, video encoding, batch processing and more. The Dasv7-series VMs work well for many general computing workloads, such as e-commerce systems, web front ends, desktop virtualization solutions, customer relationship management applications, entry-level and mid-range databases, application servers, and more. The Easv7-series VMs are ideal for workloads such as memory-intensive enterprise applications, data warehousing, business intelligence, in-memory analytics, and financial transactions. The new Falsv7-series, Fasv7-series and Famsv7-series VM series do not have Simultaneous Multithreading (SMT), meaning a vCPU equals a full core, which makes these VMs well-suited for compute-intensive workloads needing the highest CPU performance, such as scientific simulations, financial modeling and risk analysis, gaming, and more. In addition to the standard sizes, the latest VM series are available in constrained-core sizes, with vCPU count constrained to one-half or one-quarter of the original VM size, giving you the flexibility to select the core and memory configuration that best fits your workloads. In addition to the new VM capabilities, the previously announced Azure Integrated HSM (Hardware Security Module), will be in Preview soon with the latest Azure AMD-based VMs. Azure Integrated HSM is an ephemeral HSM cache that enables secure key management within Azure virtual machines by ensuring that cryptographic keys remain protected inside a FIPS 140-3 Level 3-compliant boundary throughout their lifecycle. To explore this new feature, please sign up using the form provided below. These latest Azure AMD-based VMs will be charged during preview; pricing information will be shared with access to the VMs. Eligible new Azure customers can sign up for a free account and receive a $200 Azure credit. The new VMs support all remote disk types. To learn more about the disk types and their regional availability, please refer to Azure managed disk type. Disk storage is billed separately from virtual machines. You can learn more about these latest Azure AMD-based VMs by visiting the specification pages at Dasv7-series, Dadsv7-series, Dalsv7-series, Daldsv7-series, Easv7-series, Eadsv7-series, Fasv7-series, Fadsv7-series, Falsv7-series, Faldsv7-series, Famsv7-series and Famdsv7-series. The latest Azure AMD-based VMs provide options for your wide range of computing needs. Explore the new VMs today and discover how these VMs can enhance your workload performance and lower your costs. To request access to the preview, please fill out the Preview-Signup form. Have questions? Please reach us at Azure Support and our experts will be there to help you with your Azure journey.648Views1like0CommentsIncrease security for Azure VMs: Trusted launch in-place upgrade support now available!
Introduction We’re excited to announce that Trusted Launch in-place upgrade support is now available to help you strengthen the security of your Azure virtual machines and scale set resources—without the need for complex migrations or rebuilds. Generally available for existing Gen1 & Gen2 virtual machines (VMs), and for Gen1 & Gen2 VM Uniform scale sets In private preview for Gen1 & Gen2 VM Flex scale sets Trusted launch is strongly recommended by Microsoft as the secure path from the Unified Extensible Firmware Interface (UEFI) through the Windows kernel Trusted Boot sequence. It helps prevent bootkit malware in the boot process, ensuring your workloads start in a verified and uncompromised state. Disabling Trusted launch puts your infrastructure at risk of bootkit infections, making this upgrade not just beneficial—but essential. By leveraging in-place upgrade support, you can seamlessly enhance foundational security for your existing virtual machine and scale set resources with Trusted launch at no additional cost, ensuring protection against modern threats and readiness for future compliance needs. What is Trusted launch? Trusted Launch is a built-in Azure virtual machine and scale set capability that helps protect your virtual machines from advanced threats—right from the moment they start. It adds a layer of foundational security to your VMs by enabling: Secure Boot: Prevents unauthorized code like rootkits and bootkits from loading during startup. vTPM: Acts as a secure vault for encryption keys and boot measurements, enabling attestation of your VM’s integrity. Boot Integrity Monitoring: Guest attestation extension continuously checks that your VM boots into a trusted, uncompromised state. Trusted Launch enhances the security posture of a VM through cryptographic verification and ensures the VM boots to a desired secure state protecting it from attacks that modify operating system processes. This maintains the trust of the guest OS and adds defense-in-depth. It is essential for maintaining compliance with various regulatory requirements, including Azure Security Benchmark, FedRAMP, Cloud Computing SRG (STIG), HIPAA, PCI-DSS, and others. It’s a simple yet powerful way to enhance foundational security of your virtual machine and scale set resources—without changing how you deploy or manage your workloads. Upgrade security of existing VMs and Scale sets to Trusted launch Following table summarizes high level steps associated with Trusted launch upgrade of Gen1 and Gen2 VMs and Scale set including link to public documentation which contains detailed steps. Resource type High level steps Gen1 virtual machine Learn more: Upgrade existing Azure Gen1 VMs to Trusted launch Gen2 virtual machine Learn more: Enable Trusted launch on existing Azure Gen2 VMs Virtual machine scale set Learn more: Upgrade existing Azure Scale set to Trusted launch Conclusion We take the security of our cloud computing platform as priority, and this change is an important step towards ensuring that Azure VMs provide more secure environment for your applications and services. Upgrading your Azure VMs and Scale Sets to Trusted Launch is a simple yet powerful way to strengthen foundational infrastructure security—without disrupting your existing workloads. With in-place upgrade support now available, you can take advantage of foundational security features like Secure Boot and vTPM to protect against modern threats and meet compliance requirements—all at no additional cost. Next steps Whether you're running Gen1 (BIOS) or Gen2 (UEFI) VM resources, don’t wait to secure your infrastructure—upgrade your VMs and Scale-sets to Trusted Launch today. This upgrade can be completed with minimal effort and downtime. Upgrade your Gen1 VMs to Trusted Launch using generally available upgrade support with step-by-step guide. Upgrade your Gen2 VMs to Trusted Launch using generally available upgrade support with step-by-step guide. Upgrade your Gen1 or Gen2 Uniform Scale sets to Trusted launch using generally available upgrade support with step-by-step guide. For Gen1 or Gen2 Flex Scale sets, private preview access is now open – sign-up for preview and get early access to Trusted launch upgrade experience for Flex scale sets. Trusted launch is your first line of defence against bootkit malware, and upgrading ensures your VMs meet modern security and compliance standards. Act now to protect your workloads and make them resilient against future threats. Frequently Asked Questions Are all upgrade features generally available? Following table summarizes the status of each upgrade feature: Trusted launch upgrade support for resource type Status Learn more Gen1 virtual machine Generally available Upgrade existing Azure Gen1 VMs to Trusted launch Gen2-only virtual machine Generally available Enable Trusted launch on existing Azure Gen2 VMs Scale set (Uniform) Generally available Upgrade existing Azure Scale set to Trusted launch Scale set (Flex) Private preview Sign-up for preview at Enable Trusted Launch on Existing Flex Scale Sets (PREVIEW) What are the pre-requisites to enable Trusted launch? Before planning to upgrade of existing VM or Scale set to Trusted launch, ensure that: VM size of given VM or Scale set is supported for Trusted launch. Change the VM size to Trusted launch supported VM size if needed to support the upgrade. VM or Scale set is running operating system supported with Trusted launch. For Scale set resources, you can change the OS image reference to supported OS version along with Trusted launch upgrade. VM or Scale set is not dependent on Azure features currently not supported with Trusted launch. Azure Backup, if enabled for VMs, should be configured with the Enhanced Backup policy. Existing Azure VM backup can be migrated from the Standard to the Enhanced policy. Azure site recovery (ASR), if enabled for VMs, should be disabled prior to upgrade. You can re-enable ASR replication post completion of Trusted launch upgrade. What are the best practices to consider before upgrade? We recommend following certain best practices before you execute the upgrade to Trusted launch for VMs and Scale set hosting production workloads: Review the step-by-step guide published for Gen1 and Gen2 VM and Scale set including known limitations, issues, roll-back steps. Enable Trusted launch on a test VM or Scale set and determine if any changes are required to meet the prerequisites. Create restore points for VMs associated with production workloads before you enable the Trusted launch security type. You can use the restore points to re-create the disks and VM with the previous well-known state. Can I enable Trusted launch without changing OS from Gen1 (BIOS) to Gen2 (UEFI)? Trusted launch security capabilities (Secure boot, vTPM) can be enabled for Gen2 UEFI-based operating system only, it cannot be enabled for Gen1 BIOS-based operating system. How will my new or other VMs or Scale set be affected? The upgrade is executed on specific VM or Scale set resource only. It does not impact new or other existing Azure VMs, Scale set clusters already running in your environment. Can I roll back Trusted launch upgrade to Gen1 (BIOS) configuration? For virtual machines, you can roll back the Trusted launch upgrade to Gen2 VM without Trusted launch. You cannot in-place roll back from Trusted launch to Gen1 VM. For restoring Gen1 configuration, you’ll need to restore entire VM and disks from the backup or restore point of VM taken prior to upgrade. For scale sets, you can roll back the changes made to previous known good configuration including Gen1 configuration.496Views2likes0CommentsAzure’s ND GB200 v6 Delivers Record Performance for Inference Workloads
Achieving peak AI performance requires both cutting-edge hardware and a finely optimized infrastructure. Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs, have already demonstrated world record performance of 865,000 tokens/s for inferencing on the industry standard LLAMA2 70BPerformance analysis of DeepSeek R1 AI Inference using vLLM on ND-H100-v5
Introduction The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—but not without significant infrastructure demands. To deploy DeepSeek R1 effectively in an inference service like vLLM, high-performance hardware is essential. Specifically, the model requires two Azure ND_H100_v5 nodes, each equipped with 8 NVIDIA H100 GPUs, totaling 16 H100s. These nodes are interconnected via InfiniBand and NVLink, ensuring the bandwidth and latency characteristics necessary to support the model’s massive memory footprint and parallel processing needs. In this post, we’ll present inference benchmark results for DeepSeek R1, measuring performance across GPU utilization, memory throughput, and interconnect efficiency. While R1 excels in reasoning tasks, it’s important to recognize that such models are not universally optimal. For many general-purpose AI applications, smaller models like Llama 3.1 8B offer a compelling alternative, delivering sufficient accuracy and performance at a fraction of the cost. We explore the performance characteristics of DeepSeek R1 and help you decide when a large reasoning model is worth the investment—and when a leaner solution might be the better choice. Benchmark environment 2 ND_H100_v5 nodes (16 H100) were required to load the DeepSeek R1 model into GPU memory. vLLM was used to set-up the online inference server using a single API server endpoint. vLLM bench (specifically hugging face AI-MO/aimo-validation-aime) was used to run the inference benchmark. Detailed specifications are available in the tables below. Results Reasoning model’s additional chain of thought tokens We show that DeepSeek R1 generates many additional tokens (for chain of thought), which helps reasoning models answer complex logical questions. In this example the DeepSeek R1 generates 1162 completion tokens, but Llama 3.1 8B only generates 37. This example also demonstrates that while non-reasoning models may struggle with this seemingly simple logical question, reasoning models like DeepSeek R1 can solve it—though they require significantly more intermediate tokens to do so. #!/bin/bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ {"role": "user", "content": "9.11 and 9.8, which is greater? Explain your reasoning"} ] }' Note: The REST API json output is formatted to make the output more readable. (./run_restapi_query.sh 2> /dev/null | python3 ./convert_json_2_txt.py >& run_query.out_$$) Reasoning: Okay, so I need to figure out whether 9.11 is greater than 9.8 or if 9.8 is greater. Hmm, let me start by looking at both numbers. They both have a 9 in the units place, so that part is equal. Now, I need to compare the decimal parts: .11 and .8. Wait, but decimals can be tricky because the places matter. The first number is 9.11, which has a 1 in the tenths place and a 1 in the hundredths place. The second number is 9.8, which is the same as 9.80 when you add a zero to the hundredths place for comparison. So, 9.80 versus 9.11. Now, looking at the tenths place: 8 is greater than 1. Therefore, 9.80 should be greater than 9.11. But let me double-check because sometimes when numbers have different decimal lengths, it's easy to make a mistake. If I convert both to fractions, maybe that will help. 9.11 is the same as 9 + 0.11, and 0.11 is 11/100. On the other hand, 9.8 is 9 + 0.8, which is 9 + 8/10. To compare 8/10 and 11/100, I can convert them to have the same denominator. 8/10 is equal to 80/100. Now, 80/100 is much larger than 11/100. So yes, 9.8 is indeed greater than 9.11. Another way to see it is by subtracting the two numbers. If I subtract 9.11 from 9.8, what do I get? Let's see: 9.8 - 9.11. Since 9.8 is the same as 9.80, subtracting 9.11 gives 0.69. Since the result is positive, that means 9.8 is larger. Wait, maybe I can also think about money to make it clearer. If I have $9.11 and you have $9.80, which is more? $9.80 is $9 and 80 cents, whereas $9.11 is $9 and 11 cents. Definitely, 80 cents is more than 11 cents, so $9.80 is more. That helps visualize it. Alternatively, using a number line. If I imagine a number line with 9.0, 9.1, 9.2,... up to 10.0. 9.11 is just a little past 9.1, while 9.8 is almost at 10.0. So clearly, 9.8 is much further to the right, meaning it's a larger number. I think that covers different methods: decimal place comparison, converting to fractions, subtraction, real-life money examples, and number line visualization. All of these methods point to 9.8 being greater than 9.11. I don't see any contradictions here. Maybe I can also consider writing both numbers with the same number of decimal places to make it even clearer. 9.11 is already two decimal places, and 9.8 can be written as 9.80. Now, comparing 9.80 and 9.11 digit by digit: - The units place: 9 vs. 9 (equal) - Tenths place: 8 vs. 1 (8 is greater) - Since the tenths place of 9.80 is greater, the entire number is larger, regardless of the hundredths place. So yes, that confirms it again. The key idea here is that when comparing decimals, you start from the leftmost digit after the decimal point and move right until you find a difference. In this case, the tenths place already shows that 9.8 is larger. I guess another common mistake could be thinking that because 11 is more than 8, 0.11 is more than 0.8, but that's not correct because the places are different. Tenths versus hundredths. 0.8 is equivalent to 0.80, which is way bigger than 0.11. To prevent that mistake, always align the decimal points and add trailing zeros if necessary. Then compare each digit one by one from left to right. That method avoids confusion. So, all in all, through multiple approaches, 9.8 is definitely greater than 9.11. Answer: To determine which number is greater between **9.11** and **9.8**, follow these steps: **Align the decimal places** by rewriting 9.8 as **9.80**. **Compare digit by digit** from left to right: - **Units place**: Both have **9** (equal). - **Tenths place**: **8** (in 9.80) vs. **1** (in 9.11). Since \(8 > 1\), **9.80 is greater** at this point. - No need to compare hundredths place once a difference is found. **Conclusion**: \(9.8\) (or \(9.80\)) is greater than \(9.11\) because **8 tenths** (\(0.8\)) is larger than **1 tenth + 1 hundredth** (\(0.11\)). \(\boxed{9.8 \text{ is greater}}\) Completion Info: Finish reason: stop Prompt tokens: 21 Completion tokens: 1162 Total tokens: 1183 When running using Llama 3.1 8B (Not reasoning model). Answer: To determine which is greater, we need to compare the two numbers. 9.11 is greater than 9.8 because 9.11 is a larger decimal value. Completion Info: Finish reason: stop Prompt tokens: 51 Completion tokens: 37 Total tokens: 88 Throughput and latency results Cost comparison In this cost analysis we use the ND-H100-v5 and ND-H200-v4 pay as you go pricing in south central US region and the measured total throughput tokens/sec to compute the $/(1K tokens). Note: ND-H200-v5 pricing was estimated at 20% more than ND-H100-v5 pricing. Analysis The DeepSeek R1 is a large, complex reasoning model that is costlier and slower than smaller models. It needs 16 H100 GPUs for FP8 precision and generates many more intermediate tokens in its chain of thought process—about 31 times more than Llama 3.1 8B—but at a much slower rate (~54 times slower). Its latency is also higher, with TTFT and ITL being roughly 6 and 3 times slower, respectively. The DeepSeek R1 model has small intranode and internode network requirements (~14% of available InfiniBand network bandwidth was used, and < 1% of available NVLink bandwidth is used. GPUs with higher memory bandwidth and higher FLOPS would help improve its performance. The cost analysis shows that the cost to generate DeepSeek R1 tokens is ~54 times more expensive than Llama 3.1 8B on the same 16 H100 GPU’s and ~34 times more expensive on 8 H200 GPU’s. DeepSeek R1 model is very capability, but due to its higher TCO it should be only used in specific AI applications that require its strong reasoning abilities. Conclusion The DeepSeek R1 model demonstrates exceptional reasoning capabilities, but its deployment demands substantial infrastructure and incurs high latency and cost. While it excels in generating detailed chains of thought, its throughput and efficiency lag significantly behind smaller models like Llama 3.1 8B. For applications requiring deep logical analysis, DeepSeek R1 is a powerful tool. However, for general-purpose inference tasks, more lightweight models offer better performance and cost-effectiveness. Strategic use of DeepSeek R1 should be reserved for scenarios where its advanced reasoning justifies the resource investment. References Deepseek R1 model on Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-R1 vLLM GitHub repository https://github.com/vllm-project/vllm Azure ND H100 v5 documentation https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series FlashInfer GitHub repository https://github.com/flashinfer-ai/flashinfer DeepGEMM GitHub repository https://github.com/deepseek-ai/DeepGEMM AI-MO validation dataset on Hugging Face https://huggingface.co/datasets/AI-MO/aimo-validation-aime Appendix Install vLLM curl -LsSf https://astral.sh/uv/install.sh | sh uv venv myvllm --python 3.11 --seed source myvllm/bin/activate uv pip install vllm --torch-backend=auto git clone https://github.com/flashinfer-ai/flashinfer.git --recursive uv pip install ninja cd flashinfer uv pip install --no-build-isolation --verbose . Install DeepSeek DeepEP git clone https://github.com/vllm-project/vllm.git cd ~/vllm/tools/ep_kernels export CUDA_HOME=/usr/local/cuda-12.8 TORCH_CUDA_ARCH_LIST="9.0" (For Hopper) bash install_python_libraries.sh 2.&1 | tee install_python_libraries.log_$$ sudo bash configure_system_drivers.sh 2>&1 | tee configure_system_drivers.log_$$ sudo reboot Install DeepSeek DeepGEMM git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git cd deepGEMM ./install.sh 2>&1 | tee install.log_$$ Configure DeepSeek R1 with vLLM on 2 ND_H100_v5 Second node configuration Execute this script on second node before the script on the primary node. #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-start-rank 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --headless --max-model-len 32768 --reasoning-parser deepseek_r1 Primary node configuration #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --api-server-count 1 --max-model-len 32768 --reasoning-parser deepseek_r1 Install vLLM benchmark environment cd vllm uv pip install vllm[bench] Run vLLM benchmark #!/bin/bash vllm bench serve \ --backend vllm \ --model deepseek-ai/DeepSeek-R1 \ --endpoint /v1/completions \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --ramp-up-strategy linear \ --ramp-up-start-rps 1 \ --ramp-up-end-rps 10 \ --num-prompts 400 \ --seed 42Inference performance of Llama 3.1 8B using vLLM across various GPUs and CPUs
Introduction Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance across a range of GPU and CPU platforms. Using the Hugging Face inference benchmarker, we assess not only throughput and latency but also the cost-efficiency of each configuration—an increasingly critical factor for enterprise deployment. As organizations seek scalable and budget-conscious solutions for deploying large language models (LLMs), understanding the trade-offs between compute-bound and memory-bound stages of inference becomes essential. Smaller models like Llama 3.1 8B offer a compelling balance between capability and resource demand, but the underlying hardware and software stack can dramatically influence both performance and operational cost. This report presents a comparative analysis of inference performance across multiple hardware platforms, factoring in: Token throughput and latency across chat, classification, and code generation workloads. Resource utilization, including KV cache utilization and efficiency. Cost per token, derived from cloud pricing models and hardware utilization metrics. By combining performance metrics with cost analysis, we aim to identify the most effective deployment strategies for enterprise-grade LLMs, whether optimizing for speed, scalability, or budget. Benchmark environment Inference benchmark The Hugging face Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined. Chat: Probably the most common use case, question and answer format on a wide range of topics. Classification: Providing various documents and requesting a summary of its contents. Code generation: Providing code and requesting code generation, e.g. create a new function. Profile Data set Input prompt Output prompt Chat hlarcher/inference-benchmarker/share_gpt_turns.json N/A min=50, max=800, variance=100 Classification hlarcher/inference-benchmarker/classification.json Min=8000, max=12000, variance=5000 Min=30, max=80, variance=10 Code generation hlarcher/inference-benchmarker/github_code.json Min=3000, max=6000, variance=1000 Min=30, max=80, variance=10 Huggingface Lama 3.1 8B models used Precision Model Size (GiB) meta-llama/Llama-3.1-8B-Instruct FP16 14.9 vLLM parameters Default value gpu_memory_utilization 0.9 max_num_seqs 1024 max_num_batched_tokens 2048 (A100), 8192 (H100,H200) enable_chunked_prefill True enable_prefix_caching True VM Configuration GPU ND-H100-v5, ND-H200-v5, HD-A100-v4 (8 H100 80GB &40GB) running HPC Ubuntu 22.04 (Pytorch 2.7.0+cu128, GPU driver: 535.161.08 and NCCL 2.21.5-1). 1 GPU was used in benchmark tests. CPU Ubuntu 22.02 (HPC and Canonical/jammy) Results GPU Profile Avg prompt throughput Avg generation throughput Max # Requests waiting Max KV Cache usage % Avg KV Cache hit rate % H100 Chat ~2667 ~6067 0 ~14% ~75% Classification ~254149 ~1291 0 ~46% ~98% Code generation ~22269 ~266 ~111 ~93% ~1% H200 Chat ~3271 ~7464 0 ~2% ~77% Classification ~337301 ~1635 0 ~24% ~99% Code generation ~22726 ~274 ~57 ~46% ~1% A100 Chat ~1177 ~2622 0 ~2% ~75% Classification ~64526 ~333 0 ~45% ~97% Code generation ~7926 ~95 ~106 ~21% ~1% A100_40G Chat ~1069 ~2459 0 ~27% ~75% Classification ~7846 ~39 ~116 ~68% ~5% Code generation ~7836 ~94 ~123 ~66% ~1% Cost analysis Cost analysis used pay-as-you-go pricing for the south-central region and measured throughput in tokens per second to calculate the metric $/(1K tokens). CPU performance and takeaways The Huggingface AI-MO/aimo-validation-aime data was by vllm bench to test the performance of Llama 3.1 8B on various VM types (left graph below). It is a struggle (insufficient FLOPs and memory bandwidth) to run Llama 3.1 8B on CPU VM’s, even the best performing CPU VM (HB176-96_v4) throughput and latency is significantly slower than the A100_40GB GPU. Tips Enable/use AVX512 (avx512f, avx512_bf16, avx512_vnni etc) (See what is supported/available via lscpu) Put AI model on single socket (if it has sufficient memory). For larger models you can use tensor parallel to split the model across sockets. Use pinning to specify which cores the threads will run on (in vLLM, VLLM_CPU_OMP_THREADS_BIND=0-22) Specify large enough KVCache (on CPU memory). In vLLM, VLLM_CPU_KVCACHE_SPACE=100) Analysis Throughput & Latency H200 outperforms all other GPUs across all workloads, with the highest prompt and generation throughput. H100 is a close second, showing strong performance especially in classification and code generation. A100 and A100_40G lag significantly behind, particularly in classification tasks where throughput drops by an order of magnitude (on A100_40G, due to smaller GPU memory and lower KV Cache hit percentage). KV Cache Utilization H200 and H100 show efficient cache usage with high hit rates (up to 99%) and low waiting requests. (The exception is code generation which has low hit rates (~1%)) A100_40G suffers from high KV cache usage and low hit rates, especially in classification and code generation, indicating memory bottlenecks. The strain on the inference server is observed by the higher number of waiting requests. Cost Efficiency Chat profiles: The A100 GPU (40G) offers the best value. Classification profiles: The H200 is most cost-effective. Code-generation profiles: The H100 provides the greatest cost efficiency. CPU vs GPU Llama 3.1 3B can run on CPU VM’s but the throughput and latency are so poor compared to GPU’s if does not make an practical or financial sense to do so. Smaller AI models (<= 1B parameters) may be OK on CPU’s for some light weight inference serves (like Chat). Conclusion The benchmarking results clearly demonstrate that hardware choice significantly impacts the inference performance and cost-efficiency of Llama 3.1 8B deployments. The H200 GPU consistently delivers the highest throughput and cache efficiency across workloads, making it the top performer overall. H100 follows closely, especially excelling in code generation tasks. While A100 and A100_40G offer budget-friendly options for chat workloads, their limitations in memory and cache performance make them less suitable for more demanding tasks. CPU virtual machines do not offer adequate performance—in terms of throughput and latency—for running AI models comparable in size to Llama 3.1 8B. These insights provide a practical foundation for selecting optimal infrastructure based on inference workload type and cost constraints. References Hugging Face Inference Benchmarker https://github.com/huggingface/inference-benchmarker Datasets used for benchmarking: Chat: hlarcher/inference-benchmarker/share_gpt_turns.json Classification: hlarcher/inference-benchmarker/classification.json Code Generation: hlarcher/inference-benchmarker/github_code.json Model: meta-llama/Llama-3.1-8B-Instruct on Hugging Face https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct vLLM Inference Engine https://github.com/vllm-project/vllm Azure ND-Series GPU Infrastructure https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series PyTorch 2.7.0 + CUDA 12.8 https://pytorch.org NVIDIA GPU Drivers and NCCL Driver: 535.161.08 NCCL: 2.21.5-1 https://developer.nvidia.com/nccl Azure Pricing Calculator (South-Central US Region) https://azure.microsoft.com/en-us/pricing/calculator CPU - vLLM Appendix Install vLLM on CPU VM’s git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source edit Dockerfiles (vllm_source/docker/Dockerfile.cpu) cp Dockerfile.cpu Dockerfile_serve.cpu change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","serve"]” cp Dockerfile.cpu Dockerfile_bench.cpu change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","bench","serve"]” Build images (enable AVX512 supported features (see lscpu)) docker build -f docker/Dockerfile_serve.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-serve-cpu-env --target vllm-openai . docker build -f docker/Dockerfile_bench.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-bench-cpu-env --target vllm-openai . Start vllm server Remember to set <YOUR HF TOKEN> and <CPU CORE RANGE> docker run --rm --privileged=true --shm-size=8g -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=<SIZE in GiB> -e VLLM_CPU_OMP_THREADS_BIND=<CPU CORE RANGE> -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-serve-cpu-env meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype=bfloat16 Run vLLM benchmark Remember to set <YOUR HF TOKEN> docker run --rm --privileged=true --shm-size=4g -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-bench-cpu-env --backend vllm --model=meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path AI-MO/aimo-validation-aime --ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 2 --num-prompts 200 --seed 42 --host 10.0.0.4Performance of Llama 3.1 8B AI Inference using vLLM on ND-H100-v5
Introduction The pace of development in large language models (LLMs) has continued to accelerate as the global AI community races toward the goal of artificial general intelligence (AGI). Today’s most advanced models boast trillions of parameters, pushing the boundaries of what machines can understand and generate. However, this scale comes at a steep cost—both in terms of training and inference—due to the immense GPU resources required to host and operate these models. Yet, innovation is not limited to those with access to the largest AI supercomputers. DeepSeek have demonstrated that it is possible to build highly competitive models without relying on the latest, most expensive infrastructure. At the same time, a renewed wave of open-source collaboration is challenging the closed-source strategies of leading AI companies, offering more accessible and customizable alternatives. For enterprise customers, the focus is shifting toward practical, cost-effective solutions. Rather than deploying trillion-parameter giants, many organizations are turning to smaller models—such as those with around 8 billion parameters—that strike a balance between accuracy and efficiency. These models are not only easier to fine-tune and deploy but also significantly reduce the cost per token, making them ideal for real-world business applications. In this paper, we explore the capabilities of the Llama 3.1 8B model as a representative example of a modern, enterprise-ready LLM. We benchmark its inference performance on Azure’s ND-H100 v5 infrastructure using the vLLM engine and present our findings along with recommendations for enterprise deployment. AI Inference architecture Inference in transformer-based large language models (LLMs) is typically divided into two primary stages: prefill and decode. Understanding the characteristics and resource demands of each stage is essential for optimizing performance, especially when deploying models like Llama 3.1 8B in enterprise environments. Prefill Stage: Compute-Bound Initialization The prefill stage is responsible for processing the input prompt. It involves tokenizing the input and performing a forward pass through the model to populate the key-value (KV) cache. This stage is compute-intensive, as it requires full attention computation across all input tokens. The performance bottleneck here is typically the GPU's compute throughput, especially for long prompts or large batch sizes. Decode Stage: Memory-Bound Token Generation Once the KV cache is populated, the decode stage begins. This stage generates one token at a time, using the cached context to predict the next token. The decode step is memory-bound, as it relies heavily on fast access to the KV cache. When the model achieves a KV cache hit, it can skip re-computation of earlier tokens, significantly reducing latency and compute load. This makes cache efficiency a critical factor in real-time inference performance. Fig 1. High level architecture of AI Inference, showing efficient use of KV cache can increase token throughput and reduce AI inference latency. Overall Inference Characteristics In general, AI inference is memory-bound, particularly during the decode phase. The ability to serve multiple concurrent requests efficiently depends on how well the system can manage memory bandwidth and cache locality. As such, optimizing memory access patterns and minimizing cache misses are key to achieving high throughput and low latency. Techniques for Optimization To maximize GPU utilization and token throughput while minimizing latency, several architectural strategies are employed: KV Cache Management: Efficient reuse and eviction policies to maintain high cache hit rates. vLLM uses PagedAttention (which is inspired by virtual memory and paging techniques used in operating systems), to manage the KV cache using blocks/pages. This allows vLLM to efficiently/dynamically utilize HBM memory and minimizes memory fragmentation. Batching and Scheduling: Grouping similar requests to improve parallelism and reduce idle GPU time. VLLM has a few parameters to control batching/parallelism. MAX_NUM_SEQS: How many input requests to process in parallel. MAX_NUM_BATCHED_TOKENS: The number of tokens to process in parallel (Forward pass in Deep neural network) Note: Larger values may not always be optimal; you could improve token throughput at the expense of latency. Weight and Activation Quantization (fp8): Reducing the precision of AI model weights and activations can give more memory to load AI models or have a larger KV cache. Lowering the precision also allows computations to be performed on more efficient GPU (higher FLOPS) computational units. Parallelization techniques: Tensor parallelism, Pipeline parallelism, Expert parallelism or Data parallelism can be used to split larger models across multiple Nodes/GPUs. Tensor parallelism distributes the model across the GPUs, with each GPU handling multiple layers of the model. Pipeline parallelism involves dividing the model, where a group of nodes (or GPUs) is responsible for processing its assigned DDN layer. Expert parallelism supports Mixture of Experts (MoE) models where different expert networks can be distributed across GPU’s. Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel Speculative Decoding: Predicting multiple tokens ahead to reduce the number of forward passes. Prefill/Decode Decoupling: Recent advancements, such as those implemented in vLLM (and NVIDIA Dynamo), decouple the prefill and decode stages, allowing each to be assigned dedicated GPU or CPU resources. This separation enables better resource allocation and parallelism, especially in multi-tenant or high-throughput environments. By leveraging these techniques, vLLM provides a highly efficient inference engine that is well-suited for serving modern LLMs like Llama 3.1 8B. This makes it a compelling choice for enterprise applications where cost, latency, and scalability are critical considerations. Benchmark environment Inference benchmark The Huggingface Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined. Chat: Probably the most common use case, question and answer format on a wide range of topics. Classification: Providing various documents and requesting a summary of its contents. Code generation: Providing code and requesting code generation, e.g. create a new function. Profile Data set Input prompt Output prompt Chat hlarcher/inference-benchmarker/share_gpt_turns.json N/A min=50, max=800, variance=100 Classification hlarcher/inference-benchmarker/classification.json Min=8000, max=12000, variance=5000 Min=30, max=80, variance=10 Code generation hlarcher/inference-benchmarker/github_code.json Min=3000, max=6000, variance=1000 Min=30, max=80, variance=10 Huggingface Lama 3.1 8B models used Precision Model Size (GiB) meta-llama/Llama-3.1-8B-Instruct FP16 14.9 neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 FP8 8.4 nvidia/Llama-3.1-8B-Instruct-FP8 FP8 8.4 vLLM parameters Default value gpu_memory_utilization 0.9 max_num_seqs 1024 max_num_batched_tokens 8192 enable_chunked_prefill True enable_prefix_caching True Dynamo parameters Default values block-size 64 max-model-len 16384 kv_connector DynamoNixlConnector router round-robin remote-prefill True conditional-disagg True max-local-prefill-length 10 max-prefill-queue-size 2 max-num-batched-tokens 16384 Environment Local No-operation (Planner) True Results Fig 2: AI Inference performance comparison between the chat, classification and code generation profiles on ND-H100-v5 (8 H100). Profile Avg prompt throughput Avg generation throughput Max # Requests waiting Max KV Cache usage % Avg KV Cache hit rate % Chat ~7500 ~17600 0 ~2% ~78% Classification ~55300 ~2900 0 ~5% ~100% Code generation ~130400 ~1450 ~ 37 ~10% ~2% Fig 3: Show the impact of modifying MAX_NUM_BATCHED_TOKENS on the code-generation inference benchmark. (This parameter has a greater impact on the code generation benchmark compared to the chat/classification because of the low KV cache hit percentage.) Fig 4: Code generation inference benchmark run on 1 H100 showing the performance impact of using fp8 quantization. Fig 5: Code generation benchmark run on 1, 2, 4 & 8 H100. The results indicate that higher token throughput could be achieved running inference using 1 copy of model on each GPU instead of distributing model (via tensor parallel) amongst the 8 GPUs. Fig 6: Impact on throughput (tokens/sec) by adjusting AI Inference configuration (vLLM) on ND-H100-v5. Results comparing Dynamo vs traditional vLLM Fig 7: Dynamo vs traditional vLLM throughput (tokens/sec) comparison on 1 ND_H100_v5 (8 GPU’s). The best traditional vLLM configuration (8 x vLLM tensor_parallel=1) throughput performance is compared with various Dynamo configurations (i.e different combinations of GPU’s assigned for decode and prefill). Note: vLLM does have an experimental disaggregated prefill capability with various connector types. I attempted to run vLLM experimental disaggregated prefill using the kv-connector = LMCacheConnectorV1 (with Nixl enabled). I got mixed results, eventually running into the following issues (and deciding to switch to Nvidia Dynamo instead). Limited control over allocating GPU’s to decode vs prefill (used tensor parallel option, but limited by specific ratio with number of attention multi-heads). Memory management problems, got OOM errors even though there was plenty of HBM memory available on GPU’s (HBM was not distributed evenly amongst GPU’s). Analysis The performance of the inference prefill is determined by length and number of input prompts, this phase is compute bound (effectively a Matrix-Matrix operation) and we see that the code-generation profile does best primarily because it had the largest number of input tokens. The decode phase is a memory bound operation (effectively a Vector-Matrix operation) and performance in this phase is heavily dependent on the KV cache hit percentage. The code-generation profile only had ~1.7% KV cache hit percentage (there was plenty of HBM capacity, only ~10% of available KV Cache capacity was used), which resulted in slow decode performance impacting its overall throughput and latency especially at higher QPS (the code-generation benchmark was the only one which had requests being backed-up and waiting). The classification profile did well in the decode phase primarily due to the high KV cache hit percentage (~100%), it did struggle in overall throughput due to the small length of the output tokens. Adjusting the size of MAX_NUM_BATCHED_TOKENS had very little impact on the chat and classification benchmarks probably because they had high KV Cache hit percentages, but it did impact the performance of the code-generation benchmark (a ~3% improvement in tokens/sec using a larger value). Quantization of AI model can free up HBM memory to allow a large model to load or improve the AI inference performance by providing more memory for KV caching, it can also improve performance by performing computations with a higher FLOPS lower precision (e.g. FP8) format. In this case there is plenty of HBM and all three profiles did not use all the available KV cache space available, so FP8 quantization does not improve KV caching efficiency. Improvements in compute performance are observed with quantization, especially for the code generation profile which had a low KV cache hit percentage. The code generation tokens/sec on 1 GPU improved by ~38%. Since the Llama 3.1 8B model can easily fit on 1 H100, you can get significantly better total throughput (tokens/sec) on ND-H100-v5 if a complete model is loaded into each separate GPU instead of splitting the model across multiple GPU’s (via tensor parallel). The chat, classification and code generation inference throughput improved 4.2, 5.2, 1.9 times respectively. Newer inferencing server architectures feature disaggregated prefill, which allows you to decouple prefill from decode and assign resources (GPU’s, CPU’s) to each type of worker (prefill or decode). This is especially suited for large reasoning models with large context windows, running on large GPU inference clusters, significant performance gains have been reported. In this case we have a modest size (8B parameters) NLP model running on a single ND_H100_v5 node (only 8 GPU’s), so we were not expecting any significant performance improvements. The traditional aggregated vLLM was much faster than the best Dynamo configuration, running this inference benchmark on llama 3 8B model on ND_H100_v5. In this case the model can fit in a single GPU and the overhead of disaggregation might outweigh any parallelism gains when one GPU can already handle both phases efficiently. Conclusions When analyzing the performance of AI inference, it’s important not only to focus on the number of input and output tokens but also on the type of AI inference application, which can have a big impact on KV Cache effectiveness. Smaller AI models provide opportunities and more options to configure your environment in an optimal way to maximize token/sec throughput. References Welcome to vLLM — vLLM Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog huggingface/inference-benchmarker: Inference server benchmarking tool meta-llama/llama-models: Utilities intended for use with Llama models. [2309.06180] Efficient Memory Management for Large Language Model Serving with PagedAttention ND-H100-v5 size series - Azure Virtual Machines | Microsoft Learn Hugging Face – The AI community building the future. Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models | NVIDIA Technical Blog Appendix Installation of hugging face inference benchmarker Install Rust curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh Build and install inference benchmarker cargo install --git https://github.com/huggingface/inference-benchmarker/ Installation of vLLM Install uv curl -LsSf https://astral.sh/uv/install.sh | sh Set-up python workspace uv venv --python 3.10 --seed source .venv/bin/activate Install vLLM uv pip install vllm --torch-backend=auto Install FlashInfer git clone https://github.com/flashinfer-ai/flashinfer.git --recursive pip install ninja cd flashinfer pip install --no-build-isolation --verbose . Start vLLM server #!/bin/bash NUM_GPUS=8 #NUM_GPUS=4 #NUM_GPUS=2 #NUM_GPUS=1 MODEL=meta-llama/Llama-3.1-8B-Instruct #MODEL=nvidia/Llama-3.1-8B-Instruct-FP8 #MODEL=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 PORT=5000 export VLLM_LOGGING_LEVEL=INFO python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port $PORT --model $MODEL --tensor-parallel-size $NUM_GPUS --dtype auto Run Inference benchmark TOKENIZER_NAME="meta-llama/Llama-3.1-8B-Instruct" #TOKENIZER_NAME="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8" #TOKENIZER_NAME="nvidia/Llama-3.1-8B-Instruct-FP8" inference-benchmarker \ --tokenizer-name $TOKENIZER_NAME \ --url http://localhost:5000 \ --profile code-generation -n Nvidia Dynamo installation Install python virtual environments sudo apt install python3.10-venv Create dynamo virtual workspace python3 -m venv mydynamo Activate virtual environment source /home/azureuser/mydynamo/bin/activate Check out dynamo github code. git clone https://github.com/ai-dynamo/dynamo.git cd dynamo git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) Install dynamo prerequisites pip install "ai-dynamo[all]" Install some additional python modules and packages pip install tensorboardX pip install pynvml sudo apt install etcd systemctl status nats Restart etcd systemctl restart etcd Set-up nats-server wget https://github.com/nats-io/nats-server/releases/download/v2.10.22/nats-server-v2.10.22-linux-amd64.zip unzip nats-server-v2.10.22-linux-amd64.zip sudo mv nats-server-v2.10.22-linux-amd64/nats-server /usr/local/bin/ sudo chmod +x /usr/local/bin/nats-server Create/edit /etc/systemd/system/nats.service [Unit] Description=NATS Server After=network.target [Service] Type=simple ExecStart=/usr/local/bin/nats-server -js Restart=always RestartSec=10s LimitNOFILE=40000 [Install] WantedBy=multi-user.target export NATS_SERVER="nats://localhost:4222" Start Dynamo server/service cd $DYNAMO_HOME/examples/llm edit disagg.yaml file to modify parameters. dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml Set Dynamo environmental variables export DYNAMO_HOME=$(pwd) cd $DYNAMO_HOME/examples/llmOptimizing Large-Scale AI Performance with Pretraining Validation on a Single Azure ND GB200 v6
Small performance gaps on a single virtual machine lead to large and costly performance losses at scale. Running small-scale pretraining jobs enables single-VM validation and allows for fine-grained identification of issues such as performance degradation, hardware bottlenecks, or software inefficiencies ahead of large-scale runs. In this post, we present a practical methodology for benchmarking the ND GB200 v6 virtual machines (VMs). A single ND GB200 v6 VM on Azure is powered by two NVIDIA Grace CPUs and four NVIDIA Blackwell GPUs. To ensure reliability in production workloads we used automated pretraining of lightweight Llama models with the NVIDIA NeMo framework. By systematically exploring and tuning key performance parameters and rigorously cross-checking results with detailed telemetry, we identify conditions that most significantly stress the GPUs. You can reproduce and reuse these pretraining workloads from our fully automated Azure AI Benchmarking guide.Announcing the General Availability of Azure FXv2-series Virtual Machines
Today, Microsoft is excited to announce the General Availability of the new Azure FXv2-series Virtual Machines (VMs), powered by the 5th Generation Intel® Xeon® Platinum 8573C (Emerald Rapids) processor. This release includes Compute-optimized VMs FXmsv2-series and FXmdsv2-series. The FXv2-series VMs are optimally designed and purpose-built for compute-intensive workloads such as databases, data analytics, and electronic design automation (EDA), demanding substantial memory, high-performance storage, and I/O bandwidth. FXv2-series VMs offer key advantages for SQL Server workloads, including enhanced performance that ensures efficient management of large volumes of data and transactions. This is essential for applications requiring high I/O operations per second (IOPS) and low latency. Additionally, FXv2-series VMs offer EDA customers an enhanced Intel-based option with larger L3 caches, higher instructions per clock, increased memory capacity, and faster storage compared to the previous generation FXv1 VMs. These improvements are especially beneficial for customers in the semiconductor industry, as they help to reduce chip design turnaround time, accelerate time-to-market, and enhance license utilization to lower overall TCO. Azure FXv2-series VMs have been engineered to deliver exceptional CPU performance of up to 50% better than the previous generation FXv1-series VMs. Azure FXv2-series VMs feature an all-core-turbo frequency up to 4.0 GHz. The FXv2-series offers VM sizes up to 96 vCPUs, which is twice the number of vCPUs compared to the previous generation. Furthermore, these new VMs offer up to 1,832 GiB memory with a memory to core ratio of 21GiB/vCPU. FXv2-series VMs support the NVMe protocol for both local and remote disks, delivering fast, low-latency storage performance. NVMe enables up to 2x IOPS and up to 5x throughput in remote storage performances compared to the previous generation. These VMs support all remote disk types, including Premium SSD v2 and Ultra Disk, offering up to 400K IOPS and up to 11.25 GBps throughput. Additionally, these VMs offer up to 70 Gbps of network bandwidth. Enhancements from Azure Boost significantly improve networking, storage, overall CPU performance, and security of the VMs. It utilizes the new Microsoft Azure Network Adapter (MANA), a next-generation network interface that provides stable, forward-compatible drivers for Windows and Linux operating systems. These VMs have enhanced AI capabilities with Intel® Advanced Matrix Extensions (AMX), delivering higher inference and training performance, and enhanced security capabilities with Intel® Total Memory Encryption (TME) technology, providing enhanced protection to data in system memory. These VMs are offered in standard sizes and constrained-core configurations, which gives you flexibility to choose the core and memory setup that best suits your workload. You may also choose VMs with local disk or without, with the FXmdsv2-series and FXmsv2-series, respectively. You can learn more about the new Azure FXv2-series VMs by visiting the specification pages at FXmdsv2-series , FXmsv2-series and the associated constrained-core sizes. The new FXv2-series VMs are broadly available in the following Azure regions: Australia East, Canada Central, Central US, East US, East US 2, Germany West Central, Japan East, Korea Central, South Africa North, South Central US, Sweden Central, Switzerland North, West Europe, West US 3. Additional regions will be coming in 2025. To learn more, refer to Product Availability by Region. Our customers have shared their perspective: “The new Azure Fxv2 series VMs offer a perfect mix of capabilities for running Oracle databases and RAC clusters. Database workloads benefit from large number of CPU cores, large memory size, high network bandwidth, and fast storage. The Fxv2 have all of the above! The most notable, however, is the 400,000 IOPS and 11,250 MBPS storage throughput. In a 3-node FlashGrid Cluster running Oracle RAC this means 1,200,000 IOPS and 33,750 MBPS, elevating the database performance to new levels. We expect the Fxv2 series soon to become the #1 choice for FlashGrid customers running high-performance Oracle databases on Azure.” —Art Danielov, CEO, FlashGrid. “At OMP, we are committed to optimizing our Unison Planning™ solution that drives supply chain excellence. In our partnership with Microsoft, we have always extensively used the FXv1 series. We have recently partnered with Microsoft to preview the FXv2 series and the upgraded hardware has already shown a performance increase of up to 40%. Moreover, the ability to scale up to 96 vCPUs with 1.8TB of memory doubles our per-VM performance, significantly improving capacity management for our solution.” – Michiel De Palmeneire, Platform Manager, OMP. “The new Azure FXv2-Series Virtual Machines represent a huge step forward for optimizing performance. Through our collaboration, we’ve seen performance gains up to 5X that will deliver transformative impact to customers taking advantage of our premium storage services with Pure Storage Cloud on Azure. Combining Microsoft’s high performance VMs with Pure Storage’s rich feature set accelerates the adoption of mission critical applications in Azure for our mutual customers.” - Cody Hosterman, Senior Director, Cloud Product Management, Pure Storage. “With the performance results we have achieved on the new FXv2-series, we have confidence in providing our largest Unisys ClearPath MCP System clients a seamless transition to Azure, without needing to modify or migrate any of their code.” --- Ken Henry, VP, Unisys Solutions Development, Enterprise Computing Solutions. Here’s what our technology partners are saying: “Canonical is pleased to see the significant performance enhancements with Microsoft Azure's new FXv2-series virtual machines. Ubuntu and Ubuntu Pro are fully compatible with these new instances allowing customers to use a proven and widely-used platform ideal for these compute-intensive workloads. Ubuntu Pro on the new FXv2-series offers comprehensive features for enterprise security, compliance, and operational stability. This enables users to take full advantage of Ubuntu's capabilities and vast software ecosystem on Azure”-- Jehudi Sierra-Castro, Public Cloud Alliance Director. To learn more about the FXv2-series VM pricing, please visit the Azure Virtual Machines pricing. To learn more about remote disk types such as Standard SSD, Standard HDD, Premium SSD (v1), Premium SSD v2 and Ultra Disk storage please refer to Azure managed disk type. The FXv2-series VMs are precisely engineered for compute-intensive tasks, making them ideal for high-performance data processing, complex operations, and intensive analytics. They deliver exceptional performance and efficiency, providing a versatile solution for a wide range of computing needs. Explore the FXv2-series VMs today and discover how they can enhance your performance and operational efficiency. Have questions? Please reach us at Azure Support and our experts will be there to help you with your Azure journey.1.5KViews2likes2CommentsAnnouncing General Availability of Azure E128 & E192 Sizes in the Esv6 and Edsv6-series VM Families
Authored by Misha Bansal, Product Manager We’re excited to announce the addition of two new sizes, E128 and E192, in the Azure Esv6 and Edsv6-series Virtual Machine families, powered by the 5th Generation Intel® Xeon® Platinum 8573C (Emerald Rapids) processor. These sizes are generally available with diskful and diskless options in select Azure regions. These new VM sizes offer up to 192 vCPUs and 1832 GiB of RAM, making them ideal for enterprise-scale applications such as in-memory analytics, large relational databases, and in-memory cache workloads. With Intel® Total Memory Encryption (Intel TME) and NVMe-enabled local and remote storage, these VMs deliver both performance and security at scale. Key Highlights Memory-Optimized Performance: Up to 1832 GiB of RAM for memory intensive workloads. Azure Boost: Up to 400K IOPS and 12 GB/s remote storage throughput with 200 Gbps network bandwidth. Security First: Intel TME ensures data protection in system memory. NVMe Interface: 3X improvement in local storage IOPS for low-latency access. Specifications of the E128 and E192 Sizes The new Esv6-series and Edsv6-series virtual machines are built for memory-optimized workloads. Compared to the previous Ev5-series, they deliver up to 30% improved performance and feature an NVMe interface along with increased local and remote storage capacity. Below is an overview of specifications offered by the E128 and E192 Sizes. Please refer to the Azure Esv6 and Edsv6-Series pages for all sizes and specifications. VM Family Size vCPU Memory (GiB) Local Disk (GiB) Max Data Disks Network Gbps Esv6-Series Standard_E128s_v6 128 1024 N/A 64 54 Esv6-Series Standard_E192is_v6 1 192 1832 N/A 64 200 Edsv6-Series Standard_E128ds_v6 128 1024 4x1760 64 54 Edsv6-Series Standard_E192ids_v6 192 1832 6x1760 64 200 1 i refers to Azure isolated size VM Regional Availability The E128 and E192 VM sizes are broadly available in the following Azure regions: Australia East, Canada Central, Central US, East US, East US 2, Germany West Central, Japan East, Korea Central, South Africa North, South Central US, Sweden Central, Switzerland North, West Europe, West US 3. Additional regions will be coming in 2025. To learn more, refer to Product Availability by Region. Pricing To learn more about the Esv6 and Edsv6-Series VM pricing, please visit the Azure Virtual Machines pricing. To learn more about remote disk types such as Standard SSD, Standard HDD, Premium SSD (v1), Premium SSD (v2) and Ultra Disk storage please refer to Azure managed disk type. Getting Started The sizes are now available on Azure Portal in select regions. Learn more about the Azure Dsv6, Ddsv6, Dlsv6, Dldsv6, Esv6, and Edsv6 VMs here: Announcing General Availability of Azure Dl/D/E v6 VMs powered by Intel EMR processor & Azure Boost | Microsoft Community Hub551Views1like0CommentsAnsys Minerva Simulation & Process Data Management Architecture on Azure
Architecture Ansys Minerva baseline architecture has four distributed tiers (client, web, enterprise, and resource) in a single Azure availability zone. Each tier aligns to function and communication flows between these tiers. All four tiers use their own virtual machines in a single virtual network. The Minerva core business functionality runs on a central core server in the enterprise tier and users access it through a web-based url client. You can deploy multiple instances in Dev and Test environments on virtual machines and storage on Dev/Test virtual networks separate from production virtual networks. Workflow SPDM users access the Minerva application via HTTPS-based endpoint Public URL. Users access the application through the web URL via internet. Azure Entra ID with SAML configuration allows single sign on authentication to the Minerva application. User is authenticated using a Minerva credential that a Minerva administrator creates in Minerva. Azure Firewall Azure backbone component which filters traffic and threat intelligence feeds directly from Microsoft Cyber Security. Https traffic directed to the Azure Application gateway. The Hub virtual network and spoke virtual network are peered to communicate over the Azure backbone network. Azure Application Gateway routes traffic to Minerva’s web server virtual machines (VMs) in the Web tier. Azure Application Gateway with Web Application firewall inspects the incoming Http traffic to continuously monitor Minerva against exploits. Seamlessly integrates with other Azure services (App Service, VMSS, AKS, etc.), making it easier to build cloud-native solutions. Application Gateway supports sticky sessions for applications that require session persistence. Web tier subnet: Users access the core component of Minerva via Web tier running IIS application server. To ensure consistent and reliable performance for your application, all virtual machines should have the recommended VM size, disk configuration. Depending on your needs, you may want to use HPC (High Performance Computing) VM SKUs. Make sure all VM instances are created from the same base OS image and configuration. The Enterprise subnet runs the following core Minerva components: Individual user access is granted based on valid Minerva and Aras Innovator feature licenses. These feature licenses are separate from the Aras Innovator server licenses. Enterprise tier VMs run the core business logic components of Minerva. These components include Minerva Simulation Product Data Management- core server, Agent server, Vault server, Meta data extraction server & license servers. Core components: Minerva ‘s central processing server is IIS application server. Agent server runs the agent services that are responsible for various platform orchestration activities. All the core components must be deployed in Azure proximity placement group to minimize the latency. Distributed components: Vault server and Meta data extraction server. Vault server stores the files, paired with other servers dedicated to processing Meta data extraction. An IIS Web Server acts as a frontend to the file repository There can be any number of Data Vaults distributed throughout the organization, based upon specific needs and criteria, and all Vaults communicate with the centralized Core Components. Scope of the Minerva vault server can be expanded to interact with any HPC cluster. Extraction server: Metadata Extraction is very memory, processor, and disk intensive, potentially opening large files. Sufficient capacity for Azure virtual machines or storage is required for this activity. SKU recommendation is given below. MS SQL Server: You can deploy the SQL server standard or enterprise version based on your company’s requirements. Minerva SQL server stores metadata objects only and no binary files are stored in the database. Database subnet runs a SQL Server database using an infrastructure-as-a-service deployment. It uses SQL Server Always On availability groups for asynchronous replication. Minerva deployment could run an Oracle Database server on this IAAS deployment. Storage subnet uses Azure Files Premium and/or Azure NetApp Files. On-premises network allows the customer support team and system administrators to connect to Azure via Azure VPN connection to gain access to any virtual machine instances via Remote Desktop Protocol (RDP) from Azure Bastion. Minerva Core Component & Vault reliability Use multiple VMs in web tier. To enhance resiliency and scalability of the Ansys Minerva application running on Azure distributes the four logical tiers across multiple virtual machines. It is recommended to run multiple parallel web servers for either load balancing and/or increased reliability. Use multiple VMs in Enterprise tier. You should install the Enterprise tier on multiple Azure virtual machines. This setup ensures fail-over support and enables load balancing to optimize performance. Application gateway load balances between VMs in the Web subnet web servers. By distributing software functions over a network, the application can achieve high availability and improve overall system reliability. This configuration is particularly beneficial for production environments where uninterrupted operation and efficient resource utilization are crucial. With the ability to distribute the workload across multiple virtual machines, the Minerva application can handle increased demand and provide a robust and responsive user experience. By following this recommended architecture, you can leverage the scalability and resilience capabilities of Azure to optimize the performance of Ansys Minerva application. It helps ensure uninterrupted access to critical product lifecycle management functionalities. Resource tier reliability Configure database backups. For SQL Server, one approach is to use Azure Backup using Recovery Services Vault to back up SQL Server databases that run on VMs. With this solution, you can perform most of the key backup management operations without being limited to the scope of an individual vault. For more information on Oracle, see Oracle Database in Azure Virtual Machines backup strategies. Use the native backup utility. It’s recommended to use the Azure backups. When performing server-level backups, you should avoid backing up the active database files directly. This is because the backup may not capture the complete state of the database files at the time of backup. Instead, server-level backups should focus on backing up the backup file generated by using the database backup utility. This approach ensures a more reliable and consistent backup of the application's database. By following this recommendation, you can effectively protect the integrity and availability of your Minerva application data, safeguarding critical information and enabling efficient recovery in case of any unforeseen issues or data loss. Configure volume backups. Azure Files provides the capability to take snapshots of file shares, creating point-in-time, read-only copies of your data. By using Azure Files or Azure NetApp Files snapshots, establish a general-purpose backup solution that safeguards against accidental deletions or unintended changes to the data. For the Minerva volume server, use File volume backups. This configuration ensures effective backup of the data stored in the volume server, enabling easy recovery in case of data loss or system failures. Implementing these recommendations enhances the data protection and resilience of the Minerva application, mitigating the risks associated with data loss or unauthorized modifications. Test database and storage backups. You should carefully plan, document, and test the backup and recovery strategy for the Minerva database and file manager servers. Configure backup frequency. Determine backup needs based on business requirements, considering the increasing number of users. A daily backup may not be sufficient for optimal protection, so adjust the frequency accordingly. Coordinate volume data with database backups. Ensure that backups for the volume servers are coordinated with database backups. This allows you to sync the actual files with the file metadata. Enhance database reliability. Provision SQL Server VMs in Availability Sets to improve database reliability. Availability Sets deploy virtual machines across fault domains and update domains, mitigating downtime events within the datacenter. Create an availability set during VM provisioning. Additionally, consider replicating Azure storage across different Azure datacenters for additional redundancy. For Oracle databases, Azure offers availability zones and availability sets. You should only use availability sets in regions where availability zones are unavailable. In addition to Azure tools, Oracle provides Oracle Data Guard and Goldengate solutions. Use Always On availability group. Configure the database server with an "Always On" availability group for SQL Server on Azure Virtual Machines. This option uses the underlying Windows Server Failover Clustering (WSFC) service and helps ensure high availability. For more information, see Overview of SQL Server Always On availability groups and Windows Server Failover Clustering (WSFC). Security Azure Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar. Recommended SKUs for Minerva to run on Azure Role of the Server SKUs Core server Standard_F16s_v2 Agent Server Standard_F8s_v2 License server Standard_D4d_v5 Extraction Server Standard_F8s_v2 Database servers Standard E32-16ds v4 Volume server Standard_L32s_v3