Azure High Performance Computing (HPC) Blog

10 MIN READ

Performance of Llama 3.1 8B AI Inference using vLLM on ND-H100-v5

CormacGarvey

Microsoft

Aug 26, 2025

Introduction

The pace of development in large language models (LLMs) has continued to accelerate as the global AI community races toward the goal of artificial general intelligence (AGI). Today’s most advanced models boast trillions of parameters, pushing the boundaries of what machines can understand and generate. However, this scale comes at a steep cost—both in terms of training and inference—due to the immense GPU resources required to host and operate these models.

Yet, innovation is not limited to those with access to the largest AI supercomputers. DeepSeek have demonstrated that it is possible to build highly competitive models without relying on the latest, most expensive infrastructure. At the same time, a renewed wave of open-source collaboration is challenging the closed-source strategies of leading AI companies, offering more accessible and customizable alternatives.

For enterprise customers, the focus is shifting toward practical, cost-effective solutions. Rather than deploying trillion-parameter giants, many organizations are turning to smaller models—such as those with around 8 billion parameters—that strike a balance between accuracy and efficiency. These models are not only easier to fine-tune and deploy but also significantly reduce the cost per token, making them ideal for real-world business applications.

In this paper, we explore the capabilities of the Llama 3.1 8B model as a representative example of a modern, enterprise-ready LLM. We benchmark its inference performance on Azure’s ND-H100 v5 infrastructure using the vLLM engine and present our findings along with recommendations for enterprise deployment.

AI Inference architecture

Inference in transformer-based large language models (LLMs) is typically divided into two primary stages: prefill and decode. Understanding the characteristics and resource demands of each stage is essential for optimizing performance, especially when deploying models like Llama 3.1 8B in enterprise environments.

Prefill Stage: Compute-Bound Initialization

The prefill stage is responsible for processing the input prompt. It involves tokenizing the input and performing a forward pass through the model to populate the key-value (KV) cache. This stage is compute-intensive, as it requires full attention computation across all input tokens. The performance bottleneck here is typically the GPU's compute throughput, especially for long prompts or large batch sizes.

Decode Stage: Memory-Bound Token Generation

Once the KV cache is populated, the decode stage begins. This stage generates one token at a time, using the cached context to predict the next token. The decode step is memory-bound, as it relies heavily on fast access to the KV cache. When the model achieves a KV cache hit, it can skip re-computation of earlier tokens, significantly reducing latency and compute load. This makes cache efficiency a critical factor in real-time inference performance.

Fig 1. High level architecture of AI Inference, showing efficient use of KV cache can increase token throughput and reduce AI inference latency.

Overall Inference Characteristics

In general, AI inference is memory-bound, particularly during the decode phase. The ability to serve multiple concurrent requests efficiently depends on how well the system can manage memory bandwidth and cache locality. As such, optimizing memory access patterns and minimizing cache misses are key to achieving high throughput and low latency.

Techniques for Optimization

To maximize GPU utilization and token throughput while minimizing latency, several architectural strategies are employed:

KV Cache Management: Efficient reuse and eviction policies to maintain high cache hit rates. vLLM uses PagedAttention (which is inspired by virtual memory and paging techniques used in operating systems), to manage the KV cache using blocks/pages. This allows vLLM to efficiently/dynamically utilize HBM memory and minimizes memory fragmentation.
Batching and Scheduling: Grouping similar requests to improve parallelism and reduce idle GPU time. VLLM has a few parameters to control batching/parallelism.

MAX_NUM_SEQS: How many input requests to process in parallel.
MAX_NUM_BATCHED_TOKENS: The number of tokens to process in parallel (Forward pass in Deep neural network)

Note: Larger values may not always be optimal; you could improve token throughput at the expense of latency.

Weight and Activation Quantization (fp8): Reducing the precision of AI model weights and activations can give more memory to load AI models or have a larger KV cache. Lowering the precision also allows computations to be performed on more efficient GPU (higher FLOPS) computational units.

Parallelization techniques: Tensor parallelism, Pipeline parallelism, Expert parallelism or Data parallelism can be used to split larger models across multiple Nodes/GPUs. Tensor parallelism distributes the model across the GPUs, with each GPU handling multiple layers of the model. Pipeline parallelism involves dividing the model, where a group of nodes (or GPUs) is responsible for processing its assigned DDN layer. Expert parallelism supports Mixture of Experts (MoE) models where different expert networks can be distributed across GPU’s. Data parallelism

replicates the entire model across multiple GPU sets and processes different batches of requests in parallel
Speculative Decoding: Predicting multiple tokens ahead to reduce the number of forward passes.
Prefill/Decode Decoupling: Recent advancements, such as those implemented in vLLM (and NVIDIA Dynamo), decouple the prefill and decode stages, allowing each to be assigned dedicated GPU or CPU resources. This separation enables better resource allocation and parallelism, especially in multi-tenant or high-throughput environments.

By leveraging these techniques, vLLM provides a highly efficient inference engine that is well-suited for serving modern LLMs like Llama 3.1 8B. This makes it a compelling choice for enterprise applications where cost, latency, and scalability are critical considerations.

Benchmark environment

Inference benchmark

The Huggingface Inference benchmarking code was used for the AI Inference benchmark. Three different popular AI inference profiles were examined.

Chat: Probably the most common use case, question and answer format on a wide range of topics.
Classification: Providing various documents and requesting a summary of its contents.
Code generation: Providing code and requesting code generation, e.g. create a new function.

Profile	Data set	Input prompt	Output prompt
Chat	hlarcher/inference-benchmarker/share_gpt_turns.json	N/A	min=50, max=800, variance=100
Classification	hlarcher/inference-benchmarker/classification.json	Min=8000, max=12000, variance=5000	Min=30, max=80, variance=10
Code generation	hlarcher/inference-benchmarker/github_code.json	Min=3000, max=6000, variance=1000	Min=30, max=80, variance=10

Huggingface Lama 3.1 8B models used	Precision	Model Size (GiB)
meta-llama/Llama-3.1-8B-Instruct	FP16	14.9
neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8	FP8	8.4
nvidia/Llama-3.1-8B-Instruct-FP8	FP8	8.4

vLLM parameters	Default value
gpu_memory_utilization	0.9
max_num_seqs	1024
max_num_batched_tokens	8192
enable_chunked_prefill	True
enable_prefix_caching	True

Dynamo parameters	Default values
block-size	64
max-model-len	16384
kv_connector	DynamoNixlConnector
router	round-robin
remote-prefill	True
conditional-disagg	True
max-local-prefill-length	10
max-prefill-queue-size	2
max-num-batched-tokens	16384
Environment	Local
No-operation (Planner)	True

Results

Fig 2: AI Inference performance comparison between the chat, classification and code generation profiles on ND-H100-v5 (8 H100).

Profile	Avg prompt throughput	Avg generation throughput	Max # Requests waiting	Max KV Cache usage %	Avg KV Cache hit rate %
Chat	~7500	~17600	0	~2%	~78%
Classification	~55300	~2900	0	~5%	~100%
Code generation	~130400	~1450	~ 37	~10%	~2%

Fig 3: Show the impact of modifying MAX_NUM_BATCHED_TOKENS on the code-generation inference benchmark. (This parameter has a greater impact on the code generation benchmark compared to the chat/classification because of the low KV cache hit percentage.)

Fig 4: Code generation inference benchmark run on 1 H100 showing the performance impact of using fp8 quantization.

Fig 5: Code generation benchmark run on 1, 2, 4 & 8 H100. The results indicate that higher token throughput could be achieved running inference using 1 copy of model on each GPU instead of distributing model (via tensor parallel) amongst the 8 GPUs.

Fig 6: Impact on throughput (tokens/sec) by adjusting AI Inference configuration (vLLM) on ND-H100-v5.

Results comparing Dynamo vs traditional vLLM

Fig 7: Dynamo vs traditional vLLM throughput (tokens/sec) comparison on 1 ND_H100_v5 (8 GPU’s). The best traditional vLLM configuration (8 x vLLM tensor_parallel=1) throughput performance is compared with various Dynamo configurations (i.e different combinations of GPU’s assigned for decode and prefill).

Note: vLLM does have an experimental disaggregated prefill capability with various connector types. I attempted to run vLLM experimental disaggregated prefill using the kv-connector = LMCacheConnectorV1 (with Nixl enabled). I got mixed results, eventually running into the following issues (and deciding to switch to Nvidia Dynamo instead).

Limited control over allocating GPU’s to decode vs prefill (used tensor parallel option, but limited by specific ratio with number of attention multi-heads).
Memory management problems, got OOM errors even though there was plenty of HBM memory available on GPU’s (HBM was not distributed evenly amongst GPU’s).

Analysis

The performance of the inference prefill is determined by length and number of input prompts, this phase is compute bound (effectively a Matrix-Matrix operation) and we see that the code-generation profile does best primarily because it had the largest number of input tokens. The decode phase is a memory bound operation (effectively a Vector-Matrix operation) and performance in this phase is heavily dependent on the KV cache hit percentage. The code-generation profile only had ~1.7% KV cache hit percentage (there was plenty of HBM capacity, only ~10% of available KV Cache capacity was used), which resulted in slow decode performance impacting its overall throughput and latency especially at higher QPS (the code-generation benchmark was the only one which had requests being backed-up and waiting). The classification profile did well in the decode phase primarily due to the high KV cache hit percentage (~100%), it did struggle in overall throughput due to the small length of the output tokens.

Adjusting the size of MAX_NUM_BATCHED_TOKENS had very little impact on the chat and classification benchmarks probably because they had high KV Cache hit percentages, but it did impact the performance of the code-generation benchmark (a ~3% improvement in tokens/sec using a larger value).

Quantization of AI model can free up HBM memory to allow a large model to load or improve the AI inference performance by providing more memory for KV caching, it can also improve performance by performing computations with a higher FLOPS lower precision (e.g. FP8) format. In this case there is plenty of HBM and all three profiles did not use all the available KV cache space available, so FP8 quantization does not improve KV caching efficiency. Improvements in compute performance are observed with quantization, especially for the code generation profile which had a low KV cache hit percentage. The code generation tokens/sec on 1 GPU improved by ~38%.

Since the Llama 3.1 8B model can easily fit on 1 H100, you can get significantly better total throughput (tokens/sec) on ND-H100-v5 if a complete model is loaded into each separate GPU instead of splitting the model across multiple GPU’s (via tensor parallel). The chat, classification and code generation inference throughput improved 4.2, 5.2, 1.9 times respectively.

Newer inferencing server architectures feature disaggregated prefill, which allows you to decouple prefill from decode and assign resources (GPU’s, CPU’s) to each type of worker (prefill or decode). This is especially suited for large reasoning models with large context windows, running on large GPU inference clusters, significant performance gains have been reported. In this case we have a modest size (8B parameters) NLP model running on a single ND_H100_v5 node (only 8 GPU’s), so we were not expecting any significant performance improvements. The traditional aggregated vLLM was much faster than the best Dynamo configuration, running this inference benchmark on llama 3 8B model on ND_H100_v5. In this case the model can fit in a single GPU and the overhead of disaggregation might outweigh any parallelism gains when one GPU can already handle both phases efficiently.

Conclusions

When analyzing the performance of AI inference, it’s important not only to focus on the number of input and output tokens but also on the type of AI inference application, which can have a big impact on KV Cache effectiveness.

Smaller AI models provide opportunities and more options to configure your environment in an optimal way to maximize token/sec throughput.

References

Appendix

Installation of hugging face inference benchmarker

Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Build and install inference benchmarker

cargo install --git https://github.com/huggingface/inference-benchmarker/

Installation of vLLM

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Set-up python workspace

uv venv --python 3.10 --seed

source .venv/bin/activate

Install vLLM

uv pip install vllm --torch-backend=auto

Install FlashInfer

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive

pip install ninja

cd flashinfer

pip install --no-build-isolation --verbose .

Start vLLM server

#!/bin/bash

NUM_GPUS=8

#NUM_GPUS=4

#NUM_GPUS=2

#NUM_GPUS=1

MODEL=meta-llama/Llama-3.1-8B-Instruct

#MODEL=nvidia/Llama-3.1-8B-Instruct-FP8

#MODEL=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

PORT=5000

export VLLM_LOGGING_LEVEL=INFO

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port $PORT --model $MODEL --tensor-parallel-size $NUM_GPUS --dtype auto

Run Inference benchmark

TOKENIZER_NAME="meta-llama/Llama-3.1-8B-Instruct"

#TOKENIZER_NAME="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8"

#TOKENIZER_NAME="nvidia/Llama-3.1-8B-Instruct-FP8"

inference-benchmarker \

--tokenizer-name $TOKENIZER_NAME \

--url http://localhost:5000 \

--profile code-generation -n

Nvidia Dynamo installation

Install python virtual environments

sudo apt install python3.10-venv

Create dynamo virtual workspace

python3 -m venv mydynamo

Activate virtual environment

source /home/azureuser/mydynamo/bin/activate

Check out dynamo github code.

git  clone https://github.com/ai-dynamo/dynamo.git

cd dynamo

git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Install dynamo prerequisites

pip install "ai-dynamo[all]"

Install some additional python modules and packages

pip install tensorboardX

pip install pynvml

sudo apt install etcd

systemctl status nats

Restart etcd

systemctl restart etcd

Set-up nats-server

wget https://github.com/nats-io/nats-server/releases/download/v2.10.22/nats-server-v2.10.22-linux-amd64.zip

unzip  nats-server-v2.10.22-linux-amd64.zip

sudo mv nats-server-v2.10.22-linux-amd64/nats-server /usr/local/bin/

sudo chmod +x /usr/local/bin/nats-server

Create/edit /etc/systemd/system/nats.service

[Unit]

Description=NATS Server

After=network.target



[Service]

Type=simple

ExecStart=/usr/local/bin/nats-server -js

Restart=always

RestartSec=10s

LimitNOFILE=40000



[Install]

WantedBy=multi-user.target

export NATS_SERVER="nats://localhost:4222"

Start Dynamo server/service

cd $DYNAMO_HOME/examples/llm

edit disagg.yaml file to modify parameters.

dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml

Set Dynamo environmental variables

export DYNAMO_HOME=$(pwd)

cd $DYNAMO_HOME/examples/llm

Updated Aug 26, 2025

Version 1.0

ai infrastructure

benchmarking

virtual machines