Azure High Performance Computing (HPC) Blog

6 MIN READ

Inference performance of Llama 3.1 8B using vLLM across various GPUs and CPUs

CormacGarvey

Microsoft

Aug 26, 2025

Introduction

Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance across a range of GPU and CPU platforms. Using the Hugging Face inference benchmarker, we assess not only throughput and latency but also the cost-efficiency of each configuration—an increasingly critical factor for enterprise deployment.

As organizations seek scalable and budget-conscious solutions for deploying large language models (LLMs), understanding the trade-offs between compute-bound and memory-bound stages of inference becomes essential. Smaller models like Llama 3.1 8B offer a compelling balance between capability and resource demand, but the underlying hardware and software stack can dramatically influence both performance and operational cost.

This report presents a comparative analysis of inference performance across multiple hardware platforms, factoring in:

Token throughput and latency across chat, classification, and code generation workloads.
Resource utilization, including KV cache utilization and efficiency.
Cost per token, derived from cloud pricing models and hardware utilization metrics.

By combining performance metrics with cost analysis, we aim to identify the most effective deployment strategies for enterprise-grade LLMs, whether optimizing for speed, scalability, or budget.

Benchmark environment

Inference benchmark

The Hugging face Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined.

Chat: Probably the most common use case, question and answer format on a wide range of topics.
Classification: Providing various documents and requesting a summary of its contents.
Code generation: Providing code and requesting code generation, e.g. create a new function.

Profile	Data set	Input prompt	Output prompt
Chat	hlarcher/inference-benchmarker/share_gpt_turns.json	N/A	min=50, max=800, variance=100
Classification	hlarcher/inference-benchmarker/classification.json	Min=8000, max=12000, variance=5000	Min=30, max=80, variance=10
Code generation	hlarcher/inference-benchmarker/github_code.json	Min=3000, max=6000, variance=1000	Min=30, max=80, variance=10

Huggingface Lama 3.1 8B models used	Precision	Model Size (GiB)
meta-llama/Llama-3.1-8B-Instruct	FP16	14.9

vLLM parameters	Default value
gpu_memory_utilization	0.9
max_num_seqs	1024
max_num_batched_tokens	2048 (A100), 8192 (H100,H200)
enable_chunked_prefill	True
enable_prefix_caching	True

VM Configuration

GPU

ND-H100-v5, ND-H200-v5, HD-A100-v4 (8 H100 80GB &40GB) running HPC Ubuntu 22.04 (Pytorch 2.7.0+cu128, GPU driver: 535.161.08 and NCCL 2.21.5-1). 1 GPU was used in benchmark tests.

CPU

Ubuntu 22.02 (HPC and Canonical/jammy)

Results

GPU	Profile	Avg prompt throughput	Avg generation throughput	Max # Requests waiting	Max KV Cache usage %	Avg KV Cache hit rate %
H100	Chat	~2667	~6067	0	~14%	~75%
	Classification	~254149	~1291	0	~46%	~98%
	Code generation	~22269	~266	~111	~93%	~1%
H200	Chat	~3271	~7464	0	~2%	~77%
	Classification	~337301	~1635	0	~24%	~99%
	Code generation	~22726	~274	~57	~46%	~1%
A100	Chat	~1177	~2622	0	~2%	~75%
	Classification	~64526	~333	0	~45%	~97%
	Code generation	~7926	~95	~106	~21%	~1%
A100_40G	Chat	~1069	~2459	0	~27%	~75%
	Classification	~7846	~39	~116	~68%	~5%
	Code generation	~7836	~94	~123	~66%	~1%

Cost analysis

Cost analysis used pay-as-you-go pricing for the south-central region and measured throughput in tokens per second to calculate the metric $/(1K tokens).

CPU performance and takeaways

The Huggingface AI-MO/aimo-validation-aime data was by vllm bench to test the performance of Llama 3.1 8B on various VM types (left graph below). It is a struggle (insufficient FLOPs and memory bandwidth) to run Llama 3.1 8B on CPU VM’s, even the best performing CPU VM (HB176-96_v4) throughput and latency is significantly slower than the A100_40GB GPU.

Tips

Enable/use AVX512 (avx512f, avx512_bf16, avx512_vnni etc) (See what is supported/available via lscpu)
Put AI model on single socket (if it has sufficient memory). For larger models you can use tensor parallel to split the model across sockets.
Use pinning to specify which cores the threads will run on (in vLLM, VLLM_CPU_OMP_THREADS_BIND=0-22)
Specify large enough KVCache (on CPU memory). In vLLM, VLLM_CPU_KVCACHE_SPACE=100)

Analysis

Throughput & Latency

H200 outperforms all other GPUs across all workloads, with the highest prompt and generation throughput.
H100 is a close second, showing strong performance especially in classification and code generation.
A100 and A100_40G lag significantly behind, particularly in classification tasks where throughput drops by an order of magnitude (on A100_40G, due to smaller GPU memory and lower KV Cache hit percentage).

KV Cache Utilization

H200 and H100 show efficient cache usage with high hit rates (up to 99%) and low waiting requests. (The exception is code generation which has low hit rates (~1%))
A100_40G suffers from high KV cache usage and low hit rates, especially in classification and code generation, indicating memory bottlenecks. The strain on the inference server is observed by the higher number of waiting requests.

Cost Efficiency

Chat profiles: The A100 GPU (40G) offers the best value.
Classification profiles: The H200 is most cost-effective.
Code-generation profiles: The H100 provides the greatest cost efficiency.

CPU vs GPU

Llama 3.1 3B can run on CPU VM’s but the throughput and latency are so poor compared to GPU’s if does not make an practical or financial sense to do so.
Smaller AI models (<= 1B parameters) may be OK on CPU’s for some light weight inference serves (like Chat).

Conclusion

The benchmarking results clearly demonstrate that hardware choice significantly impacts the inference performance and cost-efficiency of Llama 3.1 8B deployments. The H200 GPU consistently delivers the highest throughput and cache efficiency across workloads, making it the top performer overall. H100 follows closely, especially excelling in code generation tasks. While A100 and A100_40G offer budget-friendly options for chat workloads, their limitations in memory and cache performance make them less suitable for more demanding tasks. CPU virtual machines do not offer adequate performance—in terms of throughput and latency—for running AI models comparable in size to Llama 3.1 8B. These insights provide a practical foundation for selecting optimal infrastructure based on inference workload type and cost constraints.

References

Hugging Face Inference Benchmarker
https://github.com/huggingface/inference-benchmarker
Datasets used for benchmarking:

Chat: hlarcher/inference-benchmarker/share_gpt_turns.json
Classification: hlarcher/inference-benchmarker/classification.json
Code Generation: hlarcher/inference-benchmarker/github_code.json

Model:

meta-llama/Llama-3.1-8B-Instruct on Hugging Face
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

vLLM Inference Engine
https://github.com/vllm-project/vllm
Azure ND-Series GPU Infrastructure
https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series
PyTorch 2.7.0 + CUDA 12.8
https://pytorch.org
NVIDIA GPU Drivers and NCCL

Driver: 535.161.08
NCCL: 2.21.5-1
https://developer.nvidia.com/nccl

Azure Pricing Calculator (South-Central US Region)
https://azure.microsoft.com/en-us/pricing/calculator
CPU - vLLM

Appendix

Install vLLM on CPU VM’s

git clone https://github.com/vllm-project/vllm.git vllm_source

cd vllm_source

edit Dockerfiles (vllm_source/docker/Dockerfile.cpu)

cp Dockerfile.cpu Dockerfile_serve.cpu

change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","serve"]”

cp Dockerfile.cpu Dockerfile_bench.cpu

change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","bench","serve"]”

Build images (enable AVX512 supported features (see lscpu))

docker build -f docker/Dockerfile_serve.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-serve-cpu-env --target vllm-openai .

docker build -f docker/Dockerfile_bench.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-bench-cpu-env --target vllm-openai .

Start vllm server

Remember to set <YOUR HF TOKEN> and <CPU CORE RANGE>

docker run --rm --privileged=true --shm-size=8g -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=<SIZE in GiB> -e VLLM_CPU_OMP_THREADS_BIND=<CPU CORE RANGE> -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-serve-cpu-env meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype=bfloat16

Run vLLM benchmark

Remember to set <YOUR HF TOKEN>

docker run --rm --privileged=true --shm-size=4g -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-bench-cpu-env --backend vllm --model=meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path AI-MO/aimo-validation-aime --ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 2 --num-prompts 200 --seed 42 --host 10.0.0.4

Updated Aug 26, 2025

Version 1.0

Microsoft

Joined June 20, 2019

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity