Introduction
The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—but not without significant infrastructure demands. To deploy DeepSeek R1 effectively in an inference service like vLLM, high-performance hardware is essential. Specifically, the model requires two Azure ND_H100_v5 nodes, each equipped with 8 NVIDIA H100 GPUs, totaling 16 H100s. These nodes are interconnected via InfiniBand and NVLink, ensuring the bandwidth and latency characteristics necessary to support the model’s massive memory footprint and parallel processing needs.
In this post, we’ll present inference benchmark results for DeepSeek R1, measuring performance across GPU utilization, memory throughput, and interconnect efficiency. While R1 excels in reasoning tasks, it’s important to recognize that such models are not universally optimal. For many general-purpose AI applications, smaller models like Llama 3.1 8B offer a compelling alternative, delivering sufficient accuracy and performance at a fraction of the cost.
We explore the performance characteristics of DeepSeek R1 and help you decide when a large reasoning model is worth the investment—and when a leaner solution might be the better choice.
Benchmark environment
2 ND_H100_v5 nodes (16 H100) were required to load the DeepSeek R1 model into GPU memory. vLLM was used to set-up the online inference server using a single API server endpoint. vLLM bench (specifically hugging face AI-MO/aimo-validation-aime) was used to run the inference benchmark. Detailed specifications are available in the tables below.
Results
Reasoning model’s additional chain of thought tokens
We show that DeepSeek R1 generates many additional tokens (for chain of thought), which helps reasoning models answer complex logical questions. In this example the DeepSeek R1 generates 1162 completion tokens, but Llama 3.1 8B only generates 37. This example also demonstrates that while non-reasoning models may struggle with this seemingly simple logical question, reasoning models like DeepSeek R1 can solve it—though they require significantly more intermediate tokens to do so.
#!/bin/bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{"role": "user", "content": "9.11 and 9.8, which is greater? Explain your reasoning"}
]
}'
Note: The REST API json output is formatted to make the output more readable. (./run_restapi_query.sh 2> /dev/null | python3 ./convert_json_2_txt.py >& run_query.out_$$)
Reasoning:
Okay, so I need to figure out whether 9.11 is greater than 9.8 or if 9.8 is greater. Hmm, let me start by looking at both numbers. They both have a 9 in the units place, so that part is equal. Now, I need to compare the decimal parts: .11 and .8.
Wait, but decimals can be tricky because the places matter. The first number is 9.11, which has a 1 in the tenths place and a 1 in the hundredths place. The second number is 9.8, which is the same as 9.80 when you add a zero to the hundredths place for comparison. So, 9.80 versus 9.11. Now, looking at the tenths place: 8 is greater than 1. Therefore, 9.80 should be greater than 9.11.
But let me double-check because sometimes when numbers have different decimal lengths, it's easy to make a mistake. If I convert both to fractions, maybe that will help.
9.11 is the same as 9 + 0.11, and 0.11 is 11/100. On the other hand, 9.8 is 9 + 0.8, which is 9 + 8/10. To compare 8/10 and 11/100, I can convert them to have the same denominator. 8/10 is equal to 80/100. Now, 80/100 is much larger than 11/100. So yes, 9.8 is indeed greater than 9.11.
Another way to see it is by subtracting the two numbers. If I subtract 9.11 from 9.8, what do I get? Let's see: 9.8 - 9.11. Since 9.8 is the same as 9.80, subtracting 9.11 gives 0.69. Since the result is positive, that means 9.8 is larger.
Wait, maybe I can also think about money to make it clearer. If I have $9.11 and you have $9.80, which is more? $9.80 is $9 and 80 cents, whereas $9.11 is $9 and 11 cents. Definitely, 80 cents is more than 11 cents, so $9.80 is more. That helps visualize it.
Alternatively, using a number line. If I imagine a number line with 9.0, 9.1, 9.2,... up to 10.0. 9.11 is just a little past 9.1, while 9.8 is almost at 10.0. So clearly, 9.8 is much further to the right, meaning it's a larger number.
I think that covers different methods: decimal place comparison, converting to fractions, subtraction, real-life money examples, and number line visualization. All of these methods point to 9.8 being greater than 9.11. I don't see any contradictions here. Maybe I can also consider writing both numbers with the same number of decimal places to make it even clearer.
9.11 is already two decimal places, and 9.8 can be written as 9.80. Now, comparing 9.80 and 9.11 digit by digit:
- The units place: 9 vs. 9 (equal)
- Tenths place: 8 vs. 1 (8 is greater)
- Since the tenths place of 9.80 is greater, the entire number is larger, regardless of the hundredths place.
So yes, that confirms it again. The key idea here is that when comparing decimals, you start from the leftmost digit after the decimal point and move right until you find a difference. In this case, the tenths place already shows that 9.8 is larger.
I guess another common mistake could be thinking that because 11 is more than 8, 0.11 is more than 0.8, but that's not correct because the places are different. Tenths versus hundredths. 0.8 is equivalent to 0.80, which is way bigger than 0.11.
To prevent that mistake, always align the decimal points and add trailing zeros if necessary. Then compare each digit one by one from left to right. That method avoids confusion.
So, all in all, through multiple approaches, 9.8 is definitely greater than 9.11.
Answer:
To determine which number is greater between **9.11** and **9.8**, follow these steps:
- **Align the decimal places** by rewriting 9.8 as **9.80**.
- **Compare digit by digit** from left to right:
- **Units place**: Both have **9** (equal).
- **Tenths place**: **8** (in 9.80) vs. **1** (in 9.11). Since \(8 > 1\), **9.80 is greater** at this point.
- No need to compare hundredths place once a difference is found.
**Conclusion**:
\(9.8\) (or \(9.80\)) is greater than \(9.11\) because **8 tenths** (\(0.8\)) is larger than **1 tenth + 1 hundredth** (\(0.11\)).
\(\boxed{9.8 \text{ is greater}}\)
Completion Info:
Finish reason: stop
Prompt tokens: 21
Completion tokens: 1162
Total tokens: 1183
When running using Llama 3.1 8B (Not reasoning model).
Answer:
To determine which is greater, we need to compare the two numbers.
9.11 is greater than 9.8 because 9.11 is a larger decimal value.
Completion Info:
Finish reason: stop
Prompt tokens: 51
Completion tokens: 37
Total tokens: 88
Throughput and latency results
Cost comparison
In this cost analysis we use the ND-H100-v5 and ND-H200-v4 pay as you go pricing in south central US region and the measured total throughput tokens/sec to compute the $/(1K tokens).
Note: ND-H200-v5 pricing was estimated at 20% more than ND-H100-v5 pricing.
Analysis
The DeepSeek R1 is a large, complex reasoning model that is costlier and slower than smaller models. It needs 16 H100 GPUs for FP8 precision and generates many more intermediate tokens in its chain of thought process—about 31 times more than Llama 3.1 8B—but at a much slower rate (~54 times slower). Its latency is also higher, with TTFT and ITL being roughly 6 and 3 times slower, respectively.
The DeepSeek R1 model has small intranode and internode network requirements (~14% of available InfiniBand network bandwidth was used, and < 1% of available NVLink bandwidth is used. GPUs with higher memory bandwidth and higher FLOPS would help improve its performance.
The cost analysis shows that the cost to generate DeepSeek R1 tokens is ~54 times more expensive than Llama 3.1 8B on the same 16 H100 GPU’s and ~34 times more expensive on 8 H200 GPU’s.
DeepSeek R1 model is very capability, but due to its higher TCO it should be only used in specific AI applications that require its strong reasoning abilities.
Conclusion
The DeepSeek R1 model demonstrates exceptional reasoning capabilities, but its deployment demands substantial infrastructure and incurs high latency and cost. While it excels in generating detailed chains of thought, its throughput and efficiency lag significantly behind smaller models like Llama 3.1 8B. For applications requiring deep logical analysis, DeepSeek R1 is a powerful tool. However, for general-purpose inference tasks, more lightweight models offer better performance and cost-effectiveness. Strategic use of DeepSeek R1 should be reserved for scenarios where its advanced reasoning justifies the resource investment.
References
- Deepseek R1 model on Hugging Face
https://huggingface.co/deepseek-ai/DeepSeek-R1 - vLLM GitHub repository
https://github.com/vllm-project/vllm - Azure ND H100 v5 documentation
https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series - FlashInfer GitHub repository
https://github.com/flashinfer-ai/flashinfer - DeepGEMM GitHub repository
https://github.com/deepseek-ai/DeepGEMM - AI-MO validation dataset on Hugging Face
https://huggingface.co/datasets/AI-MO/aimo-validation-aime
Appendix
Install vLLM
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv myvllm --python 3.11 --seed
source myvllm/bin/activate
uv pip install vllm --torch-backend=auto
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
uv pip install ninja
cd flashinfer
uv pip install --no-build-isolation --verbose .
Install DeepSeek DeepEP
git clone https://github.com/vllm-project/vllm.git
cd ~/vllm/tools/ep_kernels
export CUDA_HOME=/usr/local/cuda-12.8
TORCH_CUDA_ARCH_LIST="9.0" (For Hopper)
bash install_python_libraries.sh 2.&1 | tee install_python_libraries.log_$$
sudo bash configure_system_drivers.sh 2>&1 | tee configure_system_drivers.log_$$
sudo reboot
Install DeepSeek DeepGEMM
git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git
cd deepGEMM
./install.sh 2>&1 | tee install.log_$$
Configure DeepSeek R1 with vLLM on 2 ND_H100_v5
Second node configuration
Execute this script on second node before the script on the primary node.
#!/bin/bash
MODEL="deepseek-ai/DeepSeek-R1"
PORT=8000
export VLLM_LOGGING_LEVEL=INFO
export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache
#export VLLM_ALL2ALL_BACKEND=deepep_high_throughput
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_USE_DEEP_GEMM=1
export GLOO_SOCKET_IFNAME=eth0
vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-start-rank 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --headless --max-model-len 32768 --reasoning-parser deepseek_r1
Primary node configuration
#!/bin/bash
MODEL="deepseek-ai/DeepSeek-R1"
PORT=8000
export VLLM_LOGGING_LEVEL=INFO
export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache
#export VLLM_ALL2ALL_BACKEND=deepep_high_throughput
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_USE_DEEP_GEMM=1
export GLOO_SOCKET_IFNAME=eth0
vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --api-server-count 1 --max-model-len 32768 --reasoning-parser deepseek_r1
Install vLLM benchmark environment
cd vllm
uv pip install vllm[bench]
Run vLLM benchmark
#!/bin/bash
vllm bench serve \
--backend vllm \
--model deepseek-ai/DeepSeek-R1 \
--endpoint /v1/completions \
--dataset-name hf \
--dataset-path AI-MO/aimo-validation-aime \
--ramp-up-strategy linear \
--ramp-up-start-rps 1 \
--ramp-up-end-rps 10 \
--num-prompts 400 \
--seed 42