Blog Post

Azure High Performance Computing (HPC) Blog
6 MIN READ

Inference performance of Llama 3.1 8B using vLLM across various GPUs and CPUs

CormacGarvey's avatar
CormacGarvey
Icon for Microsoft rankMicrosoft
Aug 26, 2025

Introduction

Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance across a range of GPU and CPU platforms. Using the Hugging Face inference benchmarker, we assess not only throughput and latency but also the cost-efficiency of each configuration—an increasingly critical factor for enterprise deployment.

As organizations seek scalable and budget-conscious solutions for deploying large language models (LLMs), understanding the trade-offs between compute-bound and memory-bound stages of inference becomes essential. Smaller models like Llama 3.1 8B offer a compelling balance between capability and resource demand, but the underlying hardware and software stack can dramatically influence both performance and operational cost.

This report presents a comparative analysis of inference performance across multiple hardware platforms, factoring in:

  • Token throughput and latency across chat, classification, and code generation workloads.
  • Resource utilization, including KV cache utilization and efficiency.
  • Cost per token, derived from cloud pricing models and hardware utilization metrics.

By combining performance metrics with cost analysis, we aim to identify the most effective deployment strategies for enterprise-grade LLMs, whether optimizing for speed, scalability, or budget.

Benchmark environment

Inference benchmark

The Hugging face Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined.

  • Chat: Probably the most common use case, question and answer format on a wide range of topics.
  • Classification: Providing various documents and requesting a summary of its contents.
  • Code generation: Providing code and requesting code generation, e.g. create a new function.

 

Profile

Data set

Input prompt

Output prompt

Chat

hlarcher/inference-benchmarker/share_gpt_turns.json

N/A

min=50, max=800, variance=100

Classification

hlarcher/inference-benchmarker/classification.json

Min=8000, max=12000, variance=5000

Min=30, max=80, variance=10

Code generation

hlarcher/inference-benchmarker/github_code.json

Min=3000, max=6000, variance=1000

Min=30, max=80, variance=10

 

Huggingface Lama 3.1 8B models used

Precision

Model Size (GiB)

meta-llama/Llama-3.1-8B-Instruct

FP16

14.9

 

vLLM parameters

Default value

gpu_memory_utilization

0.9

max_num_seqs

1024

max_num_batched_tokens

2048 (A100), 8192 (H100,H200)

enable_chunked_prefill

True

enable_prefix_caching

True

VM Configuration

GPU

ND-H100-v5, ND-H200-v5, HD-A100-v4 (8 H100 80GB  &40GB) running HPC Ubuntu 22.04 (Pytorch 2.7.0+cu128, GPU driver: 535.161.08 and NCCL 2.21.5-1). 1 GPU was used in benchmark tests.

CPU

Ubuntu 22.02 (HPC and Canonical/jammy)

Results

 

 


GPU

Profile

Avg prompt throughput

Avg generation throughput

Max # Requests waiting

Max KV Cache usage %

Avg KV Cache hit rate %

H100

Chat

~2667

~6067

0

~14%

~75%

Classification

~254149

~1291

0

~46%

~98%

Code generation

~22269

~266

~111

~93%

~1%

H200

Chat

~3271

~7464

0

~2%

~77%

Classification

~337301

~1635

0

~24%

~99%

Code generation

~22726

~274

~57

~46%

~1%

A100

Chat

~1177

~2622

0

~2%

~75%

Classification

~64526

~333

0

~45%

~97%

Code generation

~7926

~95

~106

~21%

~1%

A100_40G

Chat

~1069

~2459

0

~27%

~75%

Classification

~7846

~39

~116

~68%

~5%

Code generation

~7836

~94

~123

~66%

~1%

 

Cost analysis

Cost analysis used pay-as-you-go pricing for the south-central region and measured throughput in tokens per second to calculate the metric $/(1K tokens).

CPU performance and takeaways

The Huggingface AI-MO/aimo-validation-aime data was by vllm bench to test the performance of Llama 3.1 8B on various VM types (left graph below). It is a struggle (insufficient FLOPs and memory bandwidth) to run Llama 3.1 8B on CPU VM’s, even the best performing CPU VM (HB176-96_v4) throughput and latency is significantly slower than the A100_40GB GPU.

Tips

  • Enable/use AVX512 (avx512f, avx512_bf16, avx512_vnni etc) (See what is supported/available via lscpu)
  • Put AI model on single socket (if it has sufficient memory). For larger models you can use tensor parallel to split the model across sockets.
  • Use pinning to specify which cores the threads will run on (in vLLM, VLLM_CPU_OMP_THREADS_BIND=0-22)
  • Specify large enough KVCache (on CPU memory). In vLLM, VLLM_CPU_KVCACHE_SPACE=100)

Analysis

Throughput & Latency

  • H200 outperforms all other GPUs across all workloads, with the highest prompt and generation throughput.
  • H100 is a close second, showing strong performance especially in classification and code generation.
  • A100 and A100_40G lag significantly behind, particularly in classification tasks where throughput drops by an order of magnitude (on A100_40G, due to smaller GPU memory and lower KV Cache hit percentage).

KV Cache Utilization

  • H200 and H100 show efficient cache usage with high hit rates (up to 99%) and low waiting requests. (The exception is code generation which has low hit rates (~1%))
  • A100_40G suffers from high KV cache usage and low hit rates, especially in classification and code generation, indicating memory bottlenecks. The strain on the inference server is observed by the higher number of waiting requests.

Cost Efficiency

  • Chat profiles: The A100 GPU (40G) offers the best value.
  • Classification profiles: The H200 is most cost-effective.
  • Code-generation profiles: The H100 provides the greatest cost efficiency.

CPU vs GPU

  • Llama 3.1 3B can run on CPU VM’s but the throughput and latency are so poor compared to GPU’s if does not make an practical or financial sense to do so.
  • Smaller AI models (<= 1B parameters) may be OK on CPU’s for some light weight inference serves (like Chat).

Conclusion

The benchmarking results clearly demonstrate that hardware choice significantly impacts the inference performance and cost-efficiency of Llama 3.1 8B deployments. The H200 GPU consistently delivers the highest throughput and cache efficiency across workloads, making it the top performer overall. H100 follows closely, especially excelling in code generation tasks. While A100 and A100_40G offer budget-friendly options for chat workloads, their limitations in memory and cache performance make them less suitable for more demanding tasks. CPU virtual machines do not offer adequate performance—in terms of throughput and latency—for running AI models comparable in size to Llama 3.1 8B. These insights provide a practical foundation for selecting optimal infrastructure based on inference workload type and cost constraints.

References

  1. Hugging Face Inference Benchmarker
    https://github.com/huggingface/inference-benchmarker
  2. Datasets used for benchmarking:
    • Chat: hlarcher/inference-benchmarker/share_gpt_turns.json
    • Classification: hlarcher/inference-benchmarker/classification.json
    • Code Generation: hlarcher/inference-benchmarker/github_code.json
  3. Model:
    • meta-llama/Llama-3.1-8B-Instruct on Hugging Face
      https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
  4. vLLM Inference Engine
    https://github.com/vllm-project/vllm
  5. Azure ND-Series GPU Infrastructure
    https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series
  6. PyTorch 2.7.0 + CUDA 12.8
    https://pytorch.org
  7. NVIDIA GPU Drivers and NCCL
    • Driver: 535.161.08
    • NCCL: 2.21.5-1
      https://developer.nvidia.com/nccl
  8. Azure Pricing Calculator (South-Central US Region)
    https://azure.microsoft.com/en-us/pricing/calculator
  9. CPU - vLLM

Appendix

Install vLLM on CPU VM’s

git clone https://github.com/vllm-project/vllm.git vllm_source

cd vllm_source

edit Dockerfiles (vllm_source/docker/Dockerfile.cpu)

cp Dockerfile.cpu Dockerfile_serve.cpu

change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","serve"]”

cp Dockerfile.cpu Dockerfile_bench.cpu

change last line to “ENTRYPOINT ["/opt/venv/bin/vllm","bench","serve"]”

Build images (enable AVX512 supported features (see lscpu))

docker build -f docker/Dockerfile_serve.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-serve-cpu-env --target vllm-openai .

docker build -f docker/Dockerfile_bench.cpu --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_DISABLE_AVX512=false --tag vllm-bench-cpu-env --target vllm-openai .

 Start vllm server

Remember to set <YOUR HF TOKEN> and <CPU CORE RANGE>

docker run --rm --privileged=true --shm-size=8g -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=<SIZE in GiB> -e VLLM_CPU_OMP_THREADS_BIND=<CPU CORE RANGE> -e HF_TOKEN=<YOUR HF TOKEN> -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" vllm-serve-cpu-env meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype=bfloat16

Run vLLM benchmark

Remember to set <YOUR HF TOKEN>   

docker run --rm   --privileged=true     --shm-size=4g    -e HF_TOKEN=<YOUR HF TOKEN>   -e LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"      vllm-bench-cpu-env  --backend vllm    --model=meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path AI-MO/aimo-validation-aime --ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 2 --num-prompts 200 --seed 42 --host 10.0.0.4

Updated Aug 26, 2025
Version 1.0
No CommentsBe the first to comment