Blog Post

Azure High Performance Computing (HPC) Blog

5 MIN READ

Optimizing Language Model Inference on Azure

Microsoft

Oct 02, 2024

By Shantanu Deepak Patankar, Software Engineer Intern, and Hugo Affaticati, Technical Program Manager 2 Inefficient inference optimization can lead to skyrocketing costs for customers, making i...

Updated Nov 13, 2024

Version 2.0

Microsoft

Joined July 26, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity

dmonakhov

Copper Contributor

Nov 04, 2024

Hi, I'm trying to reproduce your results for "Time optimization" case, but have got different optimum

I use https://github.com/azure/AI-benchmarking-guide with following config: https://github.com/dmonakhov/AI-benchmarking-guide/blob/paper-result-snap-v1/results/optimizing-language-model-inference-on-azure-paper/configs/config-tp1-lat.json#L53-L74 ,it use same parameters for input_output_sizes="1024,128" , but have got slightly different result, see https://github.com/dmonakhov/AI-benchmarking-guide/blob/paper-result-snap-v1/results/optimizing-language-model-inference-on-azure-paper/config-tp1-lat/batch_size_vs_perf_no_gp_ctx.png

In general graph pattern is the same, but batch_size optimum is different, optimum batch_size is ~80. I suspect that this is just a side effect of our environment is different. Can you please post which version of TensorRT-LLM you use and which parameters was used for engine-build and benchmark execution.