Blog Post

AI - Machine Learning Blog
5 MIN READ

LLM Load Test on Azure (Serverless & Managed-Compute)

maljazaery's avatar
maljazaery
Icon for Microsoft rankMicrosoft
Sep 09, 2024


Introduction
In the ever-evolving landscape of artificial intelligence, the ability to efficiently load test large language models (LLMs) is crucial for ensuring optimal performance and scalability. llm-load-test-azure is a powerful tool designed to facilitate load testing of LLMs running in various Azure deployment settings.

 

 Why Use llm-load-test-azure? 

The ability to load test LLMs is essential for ensuring that they can handle real-world usage scenarios. By using llm-load-test-azure, developers can identify potential bottlenecks, optimize performance, and ensure that their models are ready for deployment. The tool's flexibility, comprehensive feature set, and support for various Azure AI models make it an invaluable resource for anyone working with LLMs on Azure.

Some scenarios where this tool is helpful:

  • You set up an endpoint and need to determine the number of tokens it can process per minute and the latency expectations.
  • You implemented a Large Language Model (LLM) on your own infrastructure and aim to benchmark various compute types for your application.
  • You intend to test the real token throughput and conduct a stress test on your premium PTUs. 

 

Key Features

llm-load-test-azure is packed with features that make it an indispensable tool for anyone working with LLMs on Azure. Here are some of the highlights:

  • Customizable Testing Dataset: Generate a custom testing dataset tailored to settings similar to your use case. This flexibility ensures that the load tests are as relevant and accurate as possible.
  • Load Testing Options: The tool supports customizable concurrency, duration, and warmup options, allowing users to simulate various load scenarios and measure the performance of their models under different conditions.
  • Support for Multiple Azure AI Models: Whether you're using Azure OpenAI, Azure OpenAI Embedding, Azure Model Catalog serverless (Maas), or managed-compute (MaaP), llm-load-test-azure has you covered. The tool's modular design enables developers to integrate new endpoints with minimal effort.
  • Detailed Results: Obtain comprehensive statistics like throughput, time-to-first-token, time-between-tokens and end2end latency in JSON format, providing valuable insights into the performance of your models.

 

Getting Started 

Using llm-load-test-azure is straightforward. Here’s a quick guide to get you started:

  1. Generate Dataset (Optional): Create a custom dataset using the generate_dataset.py script. Specify the input and output lengths, the number of samples, and the output file name.

 

 

[ python datasets/generate_dataset.py --tok_input_length 250 --tok_output_length 50 --N 100 --output_file datasets/random_text_dataset.jsonl ]

--tok_input_length: The length of the input. minimum 25.

--tok_output_length: The length of the output.

--N: The number of samples to generate.

--output_file: The name of the output file (default is random_text_dataset.jsonl).

 

 

 

  1. Run the Tool: Execute the load_test.py script with the desired configuration options. Customize the tool's behavior using a YAML configuration file, specifying parameters such as output format, storage type, and warmup options.

 

 

load_test.py [-h] [-c CONFIG] [-log {warn,warning,info,debug}]

optional arguments:

  -h, --help            show this help message and exit

  -c CONFIG, --config CONFIG

                        config YAML file name

  -log {warn,warning,info,debug}, --log_level {warn,warning,info,debug}

                        Provide logging level. Example --log_level debug, default=warning

 

 

 

Results

The tool will produce comprehensive statistics like throughput, time-to-first-token, time-between-tokens and end2end latency in JSON format, providing valuable insights into the performance of your LLM Azure endpoint.



Example of the json output:

 

"results": [ # stats on a request level
...
  ],
  "config": { # the run settings
...
    "load_options": {
      "type": "constant",
      "concurrency": 8,
      "duration": 20
...
  },
  "summary": { # overall stats 
    "output_tokens_throughput": 159.25729928295627,
    "input_tokens_throughput": 1592.5729928295625,
    "full_duration": 20.093270540237427,
    "total_requests": 16,
    "complete_request_per_sec": 0.79,  # number of competed requests / full_duration 
    "total_failures": 0,
    "failure_rate": 0.0


    #time per ouput_token
    "tpot": { 
      "min": 0.010512285232543946,
      "max": 0.018693844079971312,
      "median": 0.01216195583343506,
      "mean": 0.012808671338217597,
      "percentile_80": 0.012455177783966065,
      "percentile_90": 0.01592913103103638,
      "percentile_95": 0.017840550780296324,
      "percentile_99": 0.018523185420036312
    },
     #time to first token
    "ttft": {
      "min": 0.4043765068054199,
      "max": 0.5446293354034424,
      "median": 0.46433258056640625,
      "mean": 0.4660029411315918,
      "percentile_80": 0.51033935546875,
      "percentile_90": 0.5210948467254639,
      "percentile_95": 0.5295632600784301,
      "percentile_99": 0.54161612033844
    },
    #input token latency
    "itl": { 
      "min": 0.008117493672586566,
      "max": 0.01664590356337964,
      "median": 0.009861880810416522,
      "mean": 0.010531313198552402,
      "percentile_80": 0.010261738599844314,
      "percentile_90": 0.013813444118403915,
      "percentile_95": 0.015781731761280615,
      "percentile_99": 0.016473069202959836
    },
    #time to ack
    "tt_ack": { 
      "min": 0.404374361038208,
      "max": 0.544623851776123,
      "median": 0.464330792427063,
      "mean": 0.46600091457366943,
      "percentile_80": 0.5103373527526855,
      "percentile_90": 0.5210925340652466,
      "percentile_95": 0.5295597910881042,
      "percentile_99": 0.5416110396385193
    },
    "response_time": {
      "min": 2.102457046508789,
      "max": 3.7387688159942627,
      "median": 2.3843793869018555,
      "mean": 2.5091602653265,
      "percentile_80": 2.4795608520507812,
      "percentile_90": 2.992232322692871,
      "percentile_95": 3.541854977607727,
      "percentile_99": 3.6993860483169554
    },
    "output_tokens": {
      "min": 200,
      "max": 200,
      "median": 200.0,
      "mean": 200.0,
      "percentile_80": 200.0,
      "percentile_90": 200.0,
      "percentile_95": 200.0,
      "percentile_99": 200.0
    },
    "input_tokens": {
      "min": 2000,
      "max": 2000,
      "median": 2000.0,
      "mean": 2000.0,
      "percentile_80": 2000.0,
      "percentile_90": 2000.0,
      "percentile_95": 2000.0,
      "percentile_99": 2000.0
    },
    
  }
}

 

 

 

Conclusion

llm-load-test-azure is a powerful and versatile tool that simplifies the process of load testing large language models on Azure. Whether you're a developer or AI enthusiast, this repository provides the tools you need to ensure that your models perform optimally under various conditions. Check out the repository on GitHub and start optimizing your LLMs today! 

Bookmark this Github link: maljazaery/llm-load-test-azure (github.com)

 

Acknowledgments

Special thanks to Zack Soenen for code contributions, Vlad Feigin for feedback and reviews, and Andrew ThomasGunjan Shah and my manager Joel Borellis for ideation and discussions.
llm-load-test-azure tool is derived from the original load test tool [openshift-psap/llm-load-test (github.com)]. Thanks to the creators.

 

Disclaimer

This tool is unofficial and not a Microsoft product. It is still under development, so feedback and bug reports are welcome.

 

 

Updated Sep 09, 2024
Version 2.0
No CommentsBe the first to comment