Blog Post

AI - Azure AI services Blog
5 MIN READ

The LLM Latency Guidebook: Optimizing Response Times for GenAI Applications

LucaStamatescu's avatar
May 14, 2024

Co-authors: Priya Kedia, Julian Lee, Manoranjan Rajguru, Shikha Agrawal, Michael Tremeer

Contributors: Ranjani Mani, Sumit Pokhariyal, Sydnee Mayers

 

Generative AI applications are transforming how we do business today, creating new, engaging ways for customers to engage with applications. However, these new LLM models require massive amounts of compute to run, and unoptimized applications can run quite slowly, leading users to become frustrated. Creating a positive user experience is critical to the adoption of these tools, so minimising the response time of your LLM API calls is a must. The techniques shared in this article demonstrate how applications can be sped up by up to 100x their original speed* through clever prompt engineering and a small amount of code!

 

Previous work has identified the core principles for reducing LLM response times. This article expands upon these, by providing practical examples coupled with working code, to help you accelerate your own applications and delight customers. This article is primarily intended for software developers, data scientists and application developers, though any business stakeholder managing GenAI applications should read on to learn new ideas for improving their customer experience.

 

Understanding the drivers of long response times

The response time of an LLM can vary based on four primary factors:

  • the model used.
  • the number of tokens in the prompt.
  • the number of tokens generated.
  • the overall load on the deployment & system.

You can imagine the model as a person typing on a keyboard, where each token is generated one after another. The speed of the person (the model used) and the amount they need to type (the number of generation tokens) tend to be the largest contributor to long response times.

Figure 1 - The response generation step typically dominates the overall response time. Not to scale.

 

Techniques for improving LLM response times

The below table contains a range of recommendations that can be implemented to improve the response times of your Generative AI application. Where applicable, sample code is included, to allow you to see these benefits for yourself, and copy the relevant code or prompts into your application.

 

Best Practice

Intuition

GitHub

Potential Speed up of application

1. Generation Token Compression

Prompt the LLM to return the shortest response possible. A few simple phrases in your prompt can speed up your application. Few-shot prompting can also be used to ensure the response includes all the key information.

Link

Up to 2-3x or more

20s -> 8s

2. Avoid using LLMs to output large amounts of predetermined text

Rather than rewriting documents, use the LLM to identify which parts of the text need to be edited, and use code to make the edits. For RAG, use code to simply append documents to the LLM response.

Link

Up to 16x or more

310s-> 20s

3. Implement semantic caching

By caching responses, LLM responses can be reused, rather than calling Azure OpenAI, saving cost and time. The input does not need an exact match- for example “How can I sign up for Azure” and “I want to sign up for Azure” will return the same cached result.

Link

Up to 14x or more

19s -> 1.3s

4. Parallelize requests

Many use cases (such as document processing, classification etc.) can be parallelized.

Link

Up to 72x or more

180s -> 2.5s

5. Use GPT-3.5 over GPT-4 where possible

GPT-3.5 has a much faster token generation speed. Certain use cases require the more advanced reasoning capabilities of GPT-4, however sometimes few-shot prompting or finetuning may enable GPT-3.5 to perform the same tasks. Generally only recommended for advanced users, after attempting other optimizations first.

Link

Up to 4x

17s -> 5s

6. Leverage translation services for certain languages

Certain languages have not been optimised, leading to long response times. Generate the output in English and leverage another model or API for the translation step.

Link

Up to 3x

53s -> 16s

7. Co-locate cloud resources

Ensure model is deployed close your users. Ensure Azure AI Search and Azure OpenAI are as closely located as possible (in the same region, firewall, vNet etc.).

NA

1-2x

8. Load balancing

Having an additional endpoint for handling overflow capacity (for example, a PTU overflowing to a Pay-as-you-Go endpoint) can save latency by avoiding queuing when retrying requests.

Link

Up to 2x

58s -> 31s

9. Enable streaming

Streaming improves the perceived latency of the application, by returning the response in chunks as soon as they are available.

Coming soon

Coming soon

10. Separation of workloads

Mixing different workloads on the same endpoint can negatively impact latency. 1) This is because short completions batched with longer ones will have to wait before being sent back. 2) Mixing the calls can reduce your cache hit rate as they are both competing for the same space.

Coming soon

Coming soon

 

Putting it into practice through case studies

This section includes an overview of two case studies, which represent typical GenAI applications- perhaps one is similar to yours! The linked code repositories show the original speed of the application, and then walk you step-by-step through the process of implementing different combinations of the techniques in this document. Implementing these recommendations achieved an improvement in the response time ranging from 6.8-102x!

 

Case Study

Techniques applied

Cumulative speed improvement

GitHub

Document processing

 

Rewrite a document to correct spelling errors and grammar. This example can be extended with custom logic to adapt to more specific document processing use cases.

1. Base case

1x (315s)

Link

2. Avoid rewriting documents

8.3x (38s)

3. Generation token compression

15.8x (20s)

4. Parallelization

105x (3s)

Retrieval Augmented Generation (RAG)

 

Help a user troubleshoot a product which is not working.

1. Base case

1x (23s)

Link

2. Generation token compression

2.3x (9.8s)

3. Avoid rewriting documents

6.8x (3.4s)

Retrieval Augmented Generation (RAG)

 

Provide general product information

1. Base case

1x (17s)

Link

2. Semantic caching

17x (1s)

 

Conclusion

With Generative AI transforming how people interact with applications, minimising response times is essential. If you’re interested in improving your GenAI application’s performance, select a few of these recommendations, clone the repository, and implement them in your application’s next release!

 

*Disclaimer: The results depicted are merely illustrative, emphasizing the potential benefits of these techniques. They are not all-encompassing and are based on a single test. Response times may differ with each run, thus the main goal is to demonstrate relative improvement. The tests are performed using the powerful, but slower, GPT-4 32k model, with a focus on improving response times. The effectiveness of techniques like error correction through document rewriting varies depending on the input; a document with many errors might take longer to correct than to rewrite entirely. Therefore, these techniques should be tailored to your application.

Updated May 14, 2024
Version 1.0
  • tomshapland's avatar
    tomshapland
    Copper Contributor

    I love the clarity of your guidebook and the concrete numbers around latency improvement. We've been working on making semantic caching work in conversational AI applications where the conversational context is important. By conversational AI, I mean more like your RAG example where someone is troubleshooting a product in an ongoing chat. The same question at one point in a conversation can have a different meaning at another point in the conversation. I'd love if you checked out how we're building context-awareness into our cache. I've been writing about it here: https://www.linkedin.com/company/canonical-ai