By Shantanu Deepak Patankar, Software Engineer Intern, and Hugo Affaticati, Technical Program Manager 2
Inefficient inference optimization can lead to skyrocketing costs for customers, making i...
Hi, I'm trying to reproduce your results for "Time optimization" case, but have got different optimum
I use https://github.com/azure/AI-benchmarking-guide with following config: https://github.com/dmonakhov/AI-benchmarking-guide/blob/paper-result-snap-v1/results/optimizing-language-model-inference-on-azure-paper/configs/config-tp1-lat.json#L53-L74 ,it use same parameters for input_output_sizes="1024,128" , but have got slightly different result, see https://github.com/dmonakhov/AI-benchmarking-guide/blob/paper-result-snap-v1/results/optimizing-language-model-inference-on-azure-paper/config-tp1-lat/batch_size_vs_perf_no_gp_ctx.png
In general graph pattern is the same, but batch_size optimum is different, optimum batch_size is ~80. I suspect that this is just a side effect of our environment is different. Can you please post which version of TensorRT-LLM you use and which parameters was used for engine-build and benchmark execution.