1- Introduction
PTUs are reserved processing capacity, ensuring stable performance for uniform LLM workloads. The reserved capacity of PTUs makes KV caching more effective compared to Pay-As-You-Go (PayGo). This blog post delves into the role of Key-Value (KV) caching in enhancing PTU throughput, and practical strategies to create cache-friendly prompts that maximize efficiency.
2- What are Provisioned Throughput Units (PTUs)?
Provisioned Throughput Units (PTUs) in Azure represent a dedicated model processing capacity that can be reserved and deployed for handling prompts and generating completions. The key benefits of PTUs include:
- Predictable Performance: Ensures stable maximum latency and throughput for uniform workloads.
- Reserved Processing Capacity: Once deployed, the throughput is available irrespective of utilization.
- Cost Savings: High throughput workloads may lead to cost savings compared to token-based consumption models.
3- KV Caching: Enhancing Efficiency in Language Models
Key-Value (KV) caching is a technique employed in generative transformer models, such as language models (LLMs), to optimize the inference process. Key aspects of KV caching include:
- Reduction of Computational Cost: Minimizes the need to recompute key and value tensors for past tokens during each generation step.
- Memory-Compute Trade-off: Tensors are stored (cached) in GPU memory, balancing memory usage and compute efficiency.
4- Crafting KV Cache-Friendly Prompts:
To optimize your prompts for KV caching, consider the following strategies:
- Position Dynamic Elements Wisely: Place dynamic elements, such as grounding data, date & time, or chat history, toward the end of your prompt.
- Maintain Order for Static Elements: Keep static elements like safety instructions, examples, and tool/function definitions at the beginning and in a consistent order.
- Dedicate Your PTU Deployment: Dedicating your deployment to few use cases can further improve cache hit rates, as the requests will be more uniform.
5- A Case Study with GPT4-T-0409:
The following experiments focused on the impact of the cacheable/fixed percentage of the prompt on system performance, specifically average time-to-first-token and throughput. The results showed a clear trend: as the fixed/cacheable part of the prompt increased, the average latency decreased and the request capacity increased.
General Settings:
- Model: GPT4-T-0409
- Region: UK South
- PTU: 100
- Load test duration: 5 min
Experiment 1:
- Input token size: 10245
- Output token size: 192
Cacheable % of the prompt |
1% |
25% |
50% |
75% |
Throughput (request/min) |
7 |
9 |
12.5 |
20 |
Time to first token (sec) |
2.4 |
2.0 |
1.77 |
1.3 |
Analysis:
- Throughput Improvement: As the cacheable percentage of the prompt increased from 1% to 75%, throughput saw a significant increase from 7 requests per minute to 20 requests per minute. This translates to nearly a threefold improvement, highlighting the efficiency gain from caching.
- Latency Reduction: The time to the first token decreased from 2.4 seconds to 1.3 seconds as the cacheable percentage increased. This reduction in latency indicates faster initial response times, which is crucial for user experience.
Experiment 2:
- Input token size: 5000
- Output token size: 100
Cacheable % of the prompt |
1% |
25% |
50% |
75% |
Throughput (request/min) |
17 |
22 |
32 |
55 |
Time to first token (sec) |
1.31 |
1.25 |
1.16 |
0.9 |
Analysis:
- Throughput Improvement: When the cacheable percentage of the prompt increased from 1% to 75%, throughput saw an impressive rise from 17 requests per minute to 55 requests per minute. This more than threefold increase demonstrates the substantial impact of cache-friendly prompts on system performance.
- Latency Reduction: The time to the first token improved from 1.31 seconds to 0.9 seconds with higher cacheable percentages. This faster response time is beneficial for applications requiring real-time or near-real-time interactions.
* The results may vary based on the model type, deployment region, and use case.
Summary of the results:
In both experiments, a longer cacheable part of the prompt resulted in significant boosts in throughput and reductions in latency. The improvements were more pronounced in Experiment 2, likely due to the smaller input token sizes.
Throughput: Across both experiments, a higher cacheable percentage of the prompt resulted in substantial increases in throughput. In Experiment 1, throughput increased by almost 186%, and in Experiment 2, it increased by approximately 224% from the lowest to the highest cacheable percentage.
Latency: The time to the first token decreased consistently as the cacheable percentage of the prompt increased. This reduction in latency enhances the user experience by providing quicker initial responses.
These results underscore the importance of optimizing prompts to be cache-friendly, thereby maximizing the performance of the system in terms of both throughput and latency. By leveraging caching strategies, systems can handle more requests per minute and provide faster responses, ultimately leading to a more efficient and scalable AI deployment.
6- Conclusion
Provisioned Throughput Units (PTUs) in Azure offer significant advantages in terms of performance, capacity, and cost savings. By leveraging KV caching and creating cache-friendly prompts, you can further enhance the efficiency of your AI workloads. Optimizing prompt structure not only maximizes the benefits of PTUs but also ensures more effective and resource-efficient model processing.
7- Acknowledgments
A special thanks to Michael Tremeer for his invaluable review and feedback on this blog post. Your insights have greatly enhanced the quality of this work.
8- References
Transformers KV Caching Explained | by João Lages | Medium
Techniques for KV Cache Optimization in Large Language Models (omrimallis.com)