Blog Post

Azure High Performance Computing (HPC) Blog
8 MIN READ

Performance analysis of DeepSeek R1 AI Inference using vLLM on ND-H100-v5

CormacGarvey's avatar
CormacGarvey
Icon for Microsoft rankMicrosoft
Aug 28, 2025

Introduction

The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—but not without significant infrastructure demands. To deploy DeepSeek R1 effectively in an inference service like vLLM, high-performance hardware is essential. Specifically, the model requires two Azure ND_H100_v5 nodes, each equipped with 8 NVIDIA H100 GPUs, totaling 16 H100s. These nodes are interconnected via InfiniBand and NVLink, ensuring the bandwidth and latency characteristics necessary to support the model’s massive memory footprint and parallel processing needs.

In this post, we’ll present inference benchmark results for DeepSeek R1, measuring performance across GPU utilization, memory throughput, and interconnect efficiency. While R1 excels in reasoning tasks, it’s important to recognize that such models are not universally optimal. For many general-purpose AI applications, smaller models like Llama 3.1 8B offer a compelling alternative, delivering sufficient accuracy and performance at a fraction of the cost.

We explore the performance characteristics of DeepSeek R1 and help you decide when a large reasoning model is worth the investment—and when a leaner solution might be the better choice.

Benchmark environment

2 ND_H100_v5 nodes (16 H100) were required to load the DeepSeek R1 model into GPU memory. vLLM was used to set-up the online inference server using a single API server endpoint. vLLM bench (specifically hugging face AI-MO/aimo-validation-aime) was used to run the inference benchmark. Detailed specifications are available in the tables below.

 

Results

Reasoning model’s additional chain of thought tokens

We show that DeepSeek R1 generates many additional tokens (for chain of thought), which helps reasoning models answer complex logical questions. In this example the DeepSeek R1 generates 1162 completion tokens, but Llama 3.1 8B only generates 37. This example also demonstrates that while non-reasoning models may struggle with this seemingly simple logical question, reasoning models like DeepSeek R1 can solve it—though they require significantly more intermediate tokens to do so.

#!/bin/bash

curl http://localhost:8000/v1/chat/completions \

    -H "Content-Type: application/json" \

    -d '{

        "model": "deepseek-ai/DeepSeek-R1",

        "messages": [

            {"role": "user", "content": "9.11 and 9.8, which is greater? Explain your reasoning"}

        ]

    }'

 

Note: The REST API json output is formatted to make the output more readable. (./run_restapi_query.sh 2> /dev/null | python3 ./convert_json_2_txt.py >& run_query.out_$$)

 

Reasoning:

Okay, so I need to figure out whether 9.11 is greater than 9.8 or if 9.8 is greater. Hmm, let me start by looking at both numbers. They both have a 9 in the units place, so that part is equal. Now, I need to compare the decimal parts: .11 and .8.

 

Wait, but decimals can be tricky because the places matter. The first number is 9.11, which has a 1 in the tenths place and a 1 in the hundredths place. The second number is 9.8, which is the same as 9.80 when you add a zero to the hundredths place for comparison. So, 9.80 versus 9.11. Now, looking at the tenths place: 8 is greater than 1. Therefore, 9.80 should be greater than 9.11.

 

But let me double-check because sometimes when numbers have different decimal lengths, it's easy to make a mistake. If I convert both to fractions, maybe that will help.

 

9.11 is the same as 9 + 0.11, and 0.11 is 11/100. On the other hand, 9.8 is 9 + 0.8, which is 9 + 8/10. To compare 8/10 and 11/100, I can convert them to have the same denominator. 8/10 is equal to 80/100. Now, 80/100 is much larger than 11/100. So yes, 9.8 is indeed greater than 9.11.

 

Another way to see it is by subtracting the two numbers. If I subtract 9.11 from 9.8, what do I get? Let's see: 9.8 - 9.11. Since 9.8 is the same as 9.80, subtracting 9.11 gives 0.69. Since the result is positive, that means 9.8 is larger.

 

Wait, maybe I can also think about money to make it clearer. If I have $9.11 and you have $9.80, which is more? $9.80 is $9 and 80 cents, whereas $9.11 is $9 and 11 cents. Definitely, 80 cents is more than 11 cents, so $9.80 is more. That helps visualize it.

 

Alternatively, using a number line. If I imagine a number line with 9.0, 9.1, 9.2,... up to 10.0. 9.11 is just a little past 9.1, while 9.8 is almost at 10.0. So clearly, 9.8 is much further to the right, meaning it's a larger number.

 

I think that covers different methods: decimal place comparison, converting to fractions, subtraction, real-life money examples, and number line visualization. All of these methods point to 9.8 being greater than 9.11. I don't see any contradictions here. Maybe I can also consider writing both numbers with the same number of decimal places to make it even clearer.

 

9.11 is already two decimal places, and 9.8 can be written as 9.80. Now, comparing 9.80 and 9.11 digit by digit:

 

- The units place: 9 vs. 9 (equal)

- Tenths place: 8 vs. 1 (8 is greater)

- Since the tenths place of 9.80 is greater, the entire number is larger, regardless of the hundredths place.

 

So yes, that confirms it again. The key idea here is that when comparing decimals, you start from the leftmost digit after the decimal point and move right until you find a difference. In this case, the tenths place already shows that 9.8 is larger.

 

I guess another common mistake could be thinking that because 11 is more than 8, 0.11 is more than 0.8, but that's not correct because the places are different. Tenths versus hundredths. 0.8 is equivalent to 0.80, which is way bigger than 0.11.

 

To prevent that mistake, always align the decimal points and add trailing zeros if necessary. Then compare each digit one by one from left to right. That method avoids confusion.

 

So, all in all, through multiple approaches, 9.8 is definitely greater than 9.11.

Answer:

To determine which number is greater between **9.11** and **9.8**, follow these steps:

 

  1. **Align the decimal places** by rewriting 9.8 as **9.80**.
  2. **Compare digit by digit** from left to right:

   - **Units place**: Both have **9** (equal).

   - **Tenths place**: **8** (in 9.80) vs. **1** (in 9.11). Since \(8 > 1\), **9.80 is greater** at this point.

   - No need to compare hundredths place once a difference is found.

 

**Conclusion**:

\(9.8\) (or \(9.80\)) is greater than \(9.11\) because **8 tenths** (\(0.8\)) is larger than **1 tenth + 1 hundredth** (\(0.11\)).

 

\(\boxed{9.8 \text{ is greater}}\)

 

Completion Info:

  Finish reason: stop

  Prompt tokens: 21

  Completion tokens: 1162

  Total tokens: 1183

 

When running using Llama 3.1 8B (Not reasoning model).

Answer:

To determine which is greater, we need to compare the two numbers.

 

9.11 is greater than 9.8 because 9.11 is a larger decimal value.

 

Completion Info:

  Finish reason: stop

  Prompt tokens: 51

  Completion tokens: 37

  Total tokens: 88

 

Throughput and latency results

Cost comparison

In this cost analysis we use the ND-H100-v5 and ND-H200-v4 pay as you go pricing in south central US region and the measured total throughput tokens/sec to compute the $/(1K tokens).

Note: ND-H200-v5 pricing was estimated at 20% more than ND-H100-v5 pricing.

Analysis

The DeepSeek R1 is a large, complex reasoning model that is costlier and slower than smaller models. It needs 16 H100 GPUs for FP8 precision and generates many more intermediate tokens in its chain of thought process—about 31 times more than Llama 3.1 8B—but at a much slower rate (~54 times slower). Its latency is also higher, with TTFT and ITL being roughly 6 and 3 times slower, respectively.

The DeepSeek R1 model has small intranode and internode network requirements (~14% of available InfiniBand network bandwidth was used, and < 1% of available NVLink bandwidth is used. GPUs with higher memory bandwidth and higher FLOPS would help improve its performance.

The cost analysis shows that the cost to generate DeepSeek R1 tokens is ~54 times more expensive than Llama 3.1 8B on the same 16 H100 GPU’s and ~34 times more expensive on 8 H200 GPU’s.

DeepSeek R1 model is very capability, but due to its higher TCO it should be only used in specific AI applications that require its strong reasoning abilities.

 

Conclusion

The DeepSeek R1 model demonstrates exceptional reasoning capabilities, but its deployment demands substantial infrastructure and incurs high latency and cost. While it excels in generating detailed chains of thought, its throughput and efficiency lag significantly behind smaller models like Llama 3.1 8B. For applications requiring deep logical analysis, DeepSeek R1 is a powerful tool. However, for general-purpose inference tasks, more lightweight models offer better performance and cost-effectiveness. Strategic use of DeepSeek R1 should be reserved for scenarios where its advanced reasoning justifies the resource investment.

 

References

  1. Deepseek R1 model on Hugging Face
    https://huggingface.co/deepseek-ai/DeepSeek-R1
  2. vLLM GitHub repository
    https://github.com/vllm-project/vllm
  3. Azure ND H100 v5 documentation
    https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series
  4. FlashInfer GitHub repository
    https://github.com/flashinfer-ai/flashinfer
  5. DeepGEMM GitHub repository
    https://github.com/deepseek-ai/DeepGEMM
  6. AI-MO validation dataset on Hugging Face
    https://huggingface.co/datasets/AI-MO/aimo-validation-aime

Appendix

Install vLLM

curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv myvllm --python 3.11 --seed

source myvllm/bin/activate

uv pip install vllm --torch-backend=auto

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive

uv pip install ninja

 

cd flashinfer

uv pip install --no-build-isolation --verbose .

Install DeepSeek DeepEP

git clone https://github.com/vllm-project/vllm.git

cd ~/vllm/tools/ep_kernels

export CUDA_HOME=/usr/local/cuda-12.8

TORCH_CUDA_ARCH_LIST="9.0"                 (For Hopper)

bash install_python_libraries.sh 2.&1 | tee install_python_libraries.log_$$

sudo bash configure_system_drivers.sh 2>&1 | tee configure_system_drivers.log_$$

sudo reboot

 

Install DeepSeek DeepGEMM

git clone --recursive  https://github.com/deepseek-ai/DeepGEMM.git

cd deepGEMM

./install.sh 2>&1 | tee install.log_$$

 

Configure DeepSeek R1 with vLLM on 2 ND_H100_v5

Second node configuration

Execute this script on second node before the script on the primary node.

#!/bin/bash

MODEL="deepseek-ai/DeepSeek-R1"

PORT=8000

export VLLM_LOGGING_LEVEL=INFO

export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache

#export VLLM_ALL2ALL_BACKEND=deepep_high_throughput

export VLLM_ALL2ALL_BACKEND=deepep_low_latency

export VLLM_USE_DEEP_GEMM=1

export GLOO_SOCKET_IFNAME=eth0

vllm serve  $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-start-rank 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --headless --max-model-len 32768 --reasoning-parser deepseek_r1

 

Primary node configuration

#!/bin/bash

MODEL="deepseek-ai/DeepSeek-R1"

PORT=8000

export VLLM_LOGGING_LEVEL=INFO

export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache

#export VLLM_ALL2ALL_BACKEND=deepep_high_throughput

export VLLM_ALL2ALL_BACKEND=deepep_low_latency

export VLLM_USE_DEEP_GEMM=1

export GLOO_SOCKET_IFNAME=eth0

vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --api-server-count 1 --max-model-len 32768 --reasoning-parser deepseek_r1

 

Install vLLM benchmark environment

cd vllm

uv pip install vllm[bench]

Run vLLM benchmark

#!/bin/bash

vllm bench serve \

        --backend vllm \

        --model deepseek-ai/DeepSeek-R1 \

        --endpoint /v1/completions \

        --dataset-name hf \

        --dataset-path AI-MO/aimo-validation-aime \

        --ramp-up-strategy linear \

        --ramp-up-start-rps 1 \

        --ramp-up-end-rps 10 \

        --num-prompts 400 \

        --seed 42

 

Updated Aug 29, 2025
Version 2.0
No CommentsBe the first to comment