Blog Post

Azure High Performance Computing (HPC) Blog
5 MIN READ

Optimizing Language Model Inference on Azure

HugoAffaticati's avatar
Oct 02, 2024

By Shantanu Deepak Patankar, Software Engineer Intern, and Hugo Affaticati, Technical Program Manager 2

 

Inefficient inference optimization can lead to skyrocketing costs for customers, making it crucial to establish clear performance benchmarking numbers. This blog sets the standard for expected performance, helping customers make informed decisions that maximize efficiency and minimize expenses with the new Azure ND H200 v5-series.

 

We evaluated the inference performance of the new Azure ND H200 v5-series for Small Language Models (SLMs) and Large Language Models (LLMs). The ND H200 v5-series, powered by eight NVIDIA H200 Tensor Core GPUs, offers a 76% increase in memory over the NVIDIA H100 Tensor Core GPU of the ND H100 v5-series. We compared three models: Phi 3 (128k parameters), Mistral v0.1 (7B parameters), and Llama 3.1 (8B, 70B, and 405B parameters) to set performance standards and empower Azure customers to optimize their workloads for time or resources.

 

Model Architecture

Achieving optimal performance requires a clear understanding of where time is spent during the inference workload, enabling effective optimization. The first critical step is to carefully examine the parameters that directly impact performance. For the models discussed, and more broadly, these key parameters include input sequence length, output sequence length, batch size, and tensor parallelism. In this article, we measured the impact of these variables using two essential metrics: throughput and first token latency.

 

The inference process can be categorized into three primary components: pure computation phases (e.g., local GEMMs), pure communication phases (e.g., all-reduce), and attention phases. Analyzing the Llama3 8B model on the new ND H200 v5 virtual machine revealed that computation consistently accounts for at least 50% and up to 85% of total inference time. Communication time ranges from 10% to 25%, scaling as the number of GPUs increases from 2 to 8. In contrast, attention mechanisms consistently represent less than 10% of the total time spent, as shown in Table 1. This article aims to guide customers in striking the right balance between computation and communication when selecting their AI inference architecture, based on whether time efficiency or cost-effectiveness is their primary goal.

 

Tensor Parallelism Computation (% of time spent) Communication (% of time spent) Attention (% of time spent)
1 GPU 83.3 0 9.2
2 GPUs 70.7 10.8 7.4
4 GPUs 56.7 24.7 6.1
8 GPUs 57.2 25.1 8.2

Table 1: Breakdown of time spent per mechanism for LLAMA 3 8B inference on the ND H200 v5 virtual machine, with an input sequence length of 1024, output sequence length of 128, and batch size of 32.

 

Resource optimization

Since most of the inference time is spent on computation, the GPU computational speed has a tremendous impact on the overall performance. Understanding the memory requirements ensures better GPU usage. The two main factors influencing GPU memory consumption are the model weights and the key-value cache.

 

Model Weights: the memory occupied by the model weights depends on the number of parameters and the quantization of the model. The memory required can be calculated using the formula: 

Memory used (in GB) = number of parameters (in billions) × precision (in bits / 8)

For example, the model weights of a LLAMA 3 model using 8B parameters and FP8 precision, would require 8 GB of memory (8B parameters x 8 / 8 = 8 GB)

 

Key-Value Cache: since the attention score of each token only depends on the preceding tokens, the model stores the key and value matrices in the cache to avoid recalculating attention values for every token in the sequence, accounted for by the factor 2 in the equation below.

Size of KV cache (in B) = batch size * sequence length * 2 * number of layers * (number of heads * dimension of head) * precision (in bits / 8)

For example, the key-value cache of a LLAMA 3 model using 8B parameters, FP8 precision, input size 1024, and output length 128 would require 0.5 GB of memory for a batch size of 1 (1 x (1024+128) sequence length x 2 x 32 layers x 4096 x 8 / 8 = 0.5 GB)

 

By using these two quantities, customers can accurately estimate the maximum batch size that the virtual machines can accommodate for their model, thereby optimizing resource utilization. The available GPU memory is calculated by subtracting the weight memory from the total GPU memory when the system is idle. The maximum batch size is then determined by dividing the available memory by the size of the KV cache required for a batch size of one. Table 2 provides several examples of these theoretical batch sizes. This approach not only simplifies the process but also helps customers avoid the trial-and-error method, which can lead to higher GPU consumption and increased costs.

 

Model ND H200 v5 memory per GPU (in GB) Number of parameters(in billions) Weight Memory(in GB) Available memory(in GB) KV Cache size (in GB) Max Batch size
LLAMA 3 140 8 16 124 0.60 206
Mistral 140 7 14 126 0.60 210
Phi-3 medium 140 14 28 115.8 0.94 123

Table 2: Theoretical maximum batch size for inference with various languageinference models (LLAMA 3 8B, Mistral, Phi-3 medium) on the ND H200 v5 virtual machine with sequence length 1152 and FP8.

 

Very similar results have been obtained empirically to confirm the theoretical limits. Figure 1 below highlights the maximum batch size to maximize the usage of one NVIDIA H200 Tensor Core GPU, then combined to up to the eight other GPUs of the latest ND H200 v5 virtual machine, with the corresponding throughput. By optimizing the batch size, customers can extract extra performance from each GPU, fully utilizing available resources. This ensures that every virtual machine operates at its peak capacity, maximizing performance while minimizing cost.

 

Figure 1: Experimental maximum batch size as a function of tensor parallelism (TP) for inference with LLAMA 3 8B on the ND H200 v5 virtual machine with sequence total length 1152.

 

Time optimization

For some specific workloads, time is more of the essence. While increasing the batch size can enhance throughput and maximize resource utilization, it also leads to higher latency. By measuring both latency and throughput of the inference workload, the optimal balance can be determined. For instance, when running models like Llama 3 and Mistral on a single GPU of the latest ND H200 v5 virtual machine, a batch size of 32 delivers the highest throughput-to-latency ratio, as shown in Figure 2. The optimum batch size is specific to the customer’s workload, as highlighted by the Phi-3 model, which achieves its highest ratio at a batch size of 64 with a single GPU. When scaling to two GPUs, the optimal batch size increases to 64, as illustrated in Figure 3. Although this approach may not fully utilize the available memory, it achieves the lowest possible latency for inference, making it ideal for time-sensitive applications.

 

Figure 2: Experimental optimal throughput to latency balance as a function of batch for inference with LLAMA 3, Phi-3 and Mistral on a single GPU of the ND H200 v5 virtual machine with sequence total length 1152, FP8, and TP 1.

 

Figure 3: Experimental optimal throughput to latency balance as a function of batch for inference with LLAMA 3, Phi-3 and Mistral on two GPUs of the ND H200 v5 virtual machine with sequence total length 1152, FP8, and TP 2.

Updated Nov 13, 2024
Version 2.0
  • Hello Dmonakhov, thank you for your interest in our benchmarking approach!

    The optimum values referenced in this post serve as examples to illustrate our optimization process. Variations in these values are expected, particularly as we continue updating the virtual machine configurations. Throughput performance depends on both the VM version and on how the engines are configured. For instance, hyperparameters like –max_num_tokens and –max_seq_len impact memory allocation during engine build, which in turn influences throughput. These parameters can be tailored to specific use cases, enabling optimal configurations across different engine setups.

    As you mentioned, while the exact throughput values may vary, the general shape of the throughput-to-batch-size curve remains consistent. For further customization, you can set –max_num_tokens above max_batch_size * max_seq_len here: GitHub Link. With increased values, the Azure team has successfully enabled engines to support larger batch sizes.

    Thank you again for your valuable feedback!



    • dmonakhov's avatar
      dmonakhov
      Copper Contributor

      For further customization, you can set –max_num_tokens above max_batch_size * max_seq_len here: GitHub Link. With increased values, the Azure team has successfully enabled engines to support larger batch sizes.

      Benchmark you mentioned above use tensortt-llm for inference performance, which has a bug which prevent to use tensor size above 1<<31, which practically means that for seq_len=1024,128 it is impossible to scale above batch_size=512, see https://github.com/NVIDIA/TensorRT-LLM/issues/2422. So there is no way to scale configuration you mentioned seq_size=1024,128 up to batch_size=750 as it claimed in a paper. Benchmark AI-benchmarking-guide/Benchmarks/LLMBenchmark.py simply crashes on assertions inside TensorRT-LLM.  Probably you use different inference benchmark which has not this limitation, or tensorrt-llm but with smaller seq_size so, for example seq_len=128,8 can scale up to batch_size=4k 

  • dmonakhov's avatar
    dmonakhov
    Copper Contributor

    Attempt to run reproduce result for "batch_size as a function of tensor parallelism" experiment are failed because benchmark with batch_size >=512,seq_len=1024,128 are crash with assertions "tensor volume exceeds 2147483647", tensorrt-llm versions tested 0.12,0.13,0.14(latests)  , I've opened an issue https://github.com/NVIDIA/TensorRT-LLM/issues/2422

    This basically block scaling experiments for TP=4,8. Please explain how you workaround this issue and get a results for TP=4,8

    • dmonakhov's avatar
      dmonakhov
      Copper Contributor

      With more investigation it seems that currently it is impossible to scale TensorRT-LLM to batch_size above 512 for seq_len=1024,128, see https://github.com/NVIDIA/TensorRT-LLM/issues/2422. So it seems you use other inference benchmark for measuring TP scalability experiment. Please share more details about which benchmark was used.

  • dmonakhov's avatar
    dmonakhov
    Copper Contributor

    Hi, I'm trying to reproduce your results for "Time optimization" case, but have got different optimum 

    I use https://github.com/azure/AI-benchmarking-guide with following config: config-tp1-lat.json  ,it use same parameters for input_output_sizes="1024,128" ,  but have got  slightly different result, see batch_size_vs_perf_no_gp_ctx.png 

       

    In general graph pattern is the same, but batch_size optimum is different, optimum batch_size is ~80. I suspect that this is just a side effect of our environment is different. Can you please post which version of TensorRT-LLM you use and which parameters was used for engine-build and benchmark execution.