Introduction
Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance across a range of GPU and CPU platforms. Using the Hugging Face inference benchmarker, we assess not only throughput and latency but also the cost-efficiency of each configuration—an increasingly critical factor for enterprise deployment.
As organizations seek scalable and budget-conscious solutions for deploying large language models (LLMs), understanding the trade-offs between compute-bound and memory-bound stages of inference becomes essential. Smaller models like Llama 3.1 8B offer a compelling balance between capability and resource demand, but the underlying hardware and software stack can dramatically influence both performance and operational cost.
This report presents a comparative analysis of inference performance across multiple hardware platforms, factoring in:
- Token throughput and latency across chat, classification, and code generation workloads.
- Resource utilization, including KV cache utilization and efficiency.
- Cost per token, derived from cloud pricing models and hardware utilization metrics.
By combining performance metrics with cost analysis, we aim to identify the most effective deployment strategies for enterprise-grade LLMs, whether optimizing for speed, scalability, or budget.
Benchmark environment
Inference benchmark
The Hugging face Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined.
- Chat: Probably the most common use case, question and answer format on a wide range of topics.
- Classification: Providing various documents and requesting a summary of its contents.
- Code generation: Providing code and requesting code generation, e.g. create a new function.
Profile |
Data set |
Input prompt |
Output prompt |
Chat |
hlarcher/inference-benchmarker/share_gpt_turns.json |
N/A |
min=50, max=800, variance=100 |
Classification |
hlarcher/inference-benchmarker/classification.json |
Min=8000, max=12000, variance=5000 |
Min=30, max=80, variance=10 |
Code generation |
hlarcher/inference-benchmarker/github_code.json |
Min=3000, max=6000, variance=1000 |
Min=30, max=80, variance=10 |
Huggingface Lama 3.1 8B models used |
Precision |
Model Size (GiB) |
meta-llama/Llama-3.1-8B-Instruct |
FP16 |
14.9 |
vLLM parameters |
Default value |
gpu_memory_utilization |
0.9 |
max_num_seqs |
1024 |
max_num_batched_tokens |
2048 (A100), 8192 (H100,H200) |
enable_chunked_prefill |
True |
enable_prefix_caching |
True |
VM Configuration
GPU
ND-H100-v5, ND-H200-v5, HD-A100-v4 (8 H100 80GB &40GB) running HPC Ubuntu 22.04 (Pytorch 2.7.0+cu128, GPU driver: 535.161.08 and NCCL 2.21.5-1). 1 GPU was used in benchmark tests.
CPU
Ubuntu 22.02 (HPC and Canonical/jammy)
Results
GPU |
Profile |
Avg prompt throughput |
Avg generation throughput |
Max # Requests waiting |
Max KV Cache usage % |
Avg KV Cache hit rate % |
H100 |
Chat |
~2667 |
~6067 |
0 |
~14% |
~75% |
Classification |
~254149 |
~1291 |
0 |
~46% |
~98% | |
Code generation |
~22269 |
~266 |
~111 |
~93% |
~1% | |
H200 |
Chat |
~3271 |
~7464 |
0 |
~2% |
~77% |
Classification |
~337301 |
~1635 |
0 |
~24% |
~99% | |
Code generation |
~22726 |
~274 |
~57 |
~46% |
~1% | |
A100 |
Chat |
~1177 |
~2622 |
0 |
~2% |
~75% |
Classification |
~64526 |
~333 |
0 |
~45% |
~97% | |
Code generation |
~7926 |
~95 |
~106 |
~21% |
~1% | |
A100_40G |
Chat |
~1069 |
~2459 |
0 |
~27% |
~75% |
Classification |
~7846 |
~39 |
~116 |
~68% |
~5% | |
Code generation |
~7836 |
~94 |
~123 |
~66% |
~1% |
Cost analysis
Cost analysis used pay-as-you-go pricing for the south-central region and measured throughput in tokens per second to calculate the metric $/(1K tokens).
CPU performance and takeaways
The Huggingface AI-MO/aimo-validation-aime data was by vllm bench to test the performance of Llama 3.1 8B on various VM types (left graph below). It is a struggle (insufficient FLOPs and memory bandwidth) to run Llama 3.1 8B on CPU VM’s, even the best performing CPU VM (HB176-96_v4) throughput and latency is significantly slower than the A100_40GB GPU.
Tips
- Enable/use AVX512 (avx512f, avx512_bf16, avx512_vnni etc) (See what is supported/available via lscpu)
- Put AI model on single socket (if it has sufficient memory). For larger models you can use tensor parallel to split the model across sockets.
- Use pinning to specify which cores the threads will run on (in vLLM, VLLM_CPU_OMP_THREADS_BIND=0-22)
- Specify large enough KVCache (on CPU memory). In vLLM, VLLM_CPU_KVCACHE_SPACE=100)
Analysis
Throughput & Latency
- H200 outperforms all other GPUs across all workloads, with the highest prompt and generation throughput.
- H100 is a close second, showing strong performance especially in classification and code generation.
- A100 and A100_40G lag significantly behind, particularly in classification tasks where throughput drops by an order of magnitude (on A100_40G, due to smaller GPU memory and lower KV Cache hit percentage).
KV Cache Utilization
- H200 and H100 show efficient cache usage with high hit rates (up to 99%) and low waiting requests. (The exception is code generation which has low hit rates (~1%))
- A100_40G suffers from high KV cache usage and low hit rates, especially in classification and code generation, indicating memory bottlenecks. The strain on the inference server is observed by the higher number of waiting requests.
Cost Efficiency
- Chat profiles: The A100 GPU (40G) offers the best value.
- Classification profiles: The H200 is most cost-effective.
- Code-generation profiles: The H100 provides the greatest cost efficiency.
CPU vs GPU
- Llama 3.1 3B can run on CPU VM’s but the throughput and latency are so poor compared to GPU’s if does not make an practical or financial sense to do so.
- Smaller AI models (<= 1B parameters) may be OK on CPU’s for some light weight inference serves (like Chat).
Conclusion
The benchmarking results clearly demonstrate that hardware choice significantly impacts the inference performance and cost-efficiency of Llama 3.1 8B deployments. The H200 GPU consistently delivers the highest throughput and cache efficiency across workloads, making it the top performer overall. H100 follows closely, especially excelling in code generation tasks. While A100 and A100_40G offer budget-friendly options for chat workloads, their limitations in memory and cache performance make them less suitable for more demanding tasks. CPU virtual machines do not offer adequate performance—in terms of throughput and latency—for running AI models comparable in size to Llama 3.1 8B. These insights provide a practical foundation for selecting optimal infrastructure based on inference workload type and cost constraints.
References
- Hugging Face Inference Benchmarker
https://github.com/huggingface/inference-benchmarker - Datasets used for benchmarking:
- Chat: hlarcher/inference-benchmarker/share_gpt_turns.json
- Classification: hlarcher/inference-benchmarker/classification.json
- Code Generation: hlarcher/inference-benchmarker/github_code.json
- Model:
- meta-llama/Llama-3.1-8B-Instruct on Hugging Face
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct - vLLM Inference Engine
https://github.com/vllm-project/vllm - Azure ND-Series GPU Infrastructure
https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series - PyTorch 2.7.0 + CUDA 12.8
https://pytorch.org - NVIDIA GPU Drivers and NCCL
- Driver: 535.161.08
- NCCL: 2.21.5-1
https://developer.nvidia.com/nccl - Azure Pricing Calculator (South-Central US Region)
https://azure.microsoft.com/en-us/pricing/calculator - CPU - vLLM
Appendix
Install vLLM on CPU VM’s
git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
edit Dockerfiles (vllm_source/docker/Dockerfile.cpu)
cp Dockerfile.cpu Dockerfile_serve.cpu
change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","serve"]”
cp Dockerfile.cpu Dockerfile_bench.cpu
change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","bench","serve"]”
Build images (enable AVX512 supported features (see lscpu))
docker build -f docker/Dockerfile_serve.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-serve-cpu-env --target vllm-openai .
docker build -f docker/Dockerfile_bench.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-bench-cpu-env --target vllm-openai .
Start vllm server
Remember to set <YOUR HF TOKEN> and <CPU CORE RANGE>
docker run --rm --privileged=true --shm-size=8g -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=<SIZE in GiB> -e VLLM_CPU_OMP_THREADS_BIND=<CPU CORE RANGE> -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-serve-cpu-env meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype=bfloat16
Run vLLM benchmark
Remember to set <YOUR HF TOKEN>
docker run --rm --privileged=true --shm-size=4g -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-bench-cpu-env --backend vllm --model=meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path AI-MO/aimo-validation-aime --ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 2 --num-prompts 200 --seed 42 --host 10.0.0.4