Addressing the challenges of building AI solutions with high-volume token usage, explore strategic recommendations for overcoming token limits, optimizing model deployments, and practical techniques for maximizing token usage with Azure OpenAI.
The available models for Azure OpenAI Service, including GPT-3.5 Turbo and GPT-4, have hard maximum token limits per request. These ensure the models operate efficiently and produce relevant, cohesive responses. While token limits increase with newer models, token limits still require ISVs and Digital Natives to explore alternative approaches to overcome them for their project needs.
Different models have different capabilities and limitations. GPT-3.5 provides the most cost-effective deployment and is significantly cheaper to run. However, this comes at the expense of limited tokens. GPT-4 offers a far more extensive data set with the ability to solve more complex queries with greater accuracy. ISVs and Digital Natives must consider appropriate techniques to utilize LLMs for their business needs to maximize their token usage.
Achieving scalability while avoiding underutilization or overloading of model deployments is a significant hurdle. Using a shared Azure OpenAI Service instance among multiple tenants can lead to a Noisy Neighbour problem. This can result in service degradation for certain users of an application. Single deployments pose a challenge as a user base grows requiring ISVs and Digital Natives to consider how to provide efficient mechanisms for multiple deployments and cost allocations to customers.
As ISVs and Digital Natives creating reliable AI solutions with high-volume token usage, you should:
ISVs and Digital Natives are increasingly leveraging the power of the Azure OpenAI Service in new and existing multitenant, software-as-a-service (SaaS) architectures to push the boundaries of their solutions to meet their customers’ changing expectations. In a 2023 report published by Stanford Institute for Human-Centered Artificial Intelligence (HAI), companies adopting AI solutions has increased to 50-60%. This highlights an increase in demand for AI from consumers of solutions provided by ISVs and Digital Natives.
However, engineering teams transitioning from well-established development processes to this fast-paced innovative technology face new challenges. Not only are they tasked with integrating with the APIs, but they need to consider the adoption and management of services and models to provide a reliable AI service across their user base.
This leads ISVs and Digital Natives to ask, “How do we establish best practices in our AI solutions for handling high volumes of tokens?”
This article explores the key focus areas of high-volume token usage with Azure OpenAI. It highlights where ISVs and Digital Natives can make improvements to deliver reliable multitenant SaaS AI solutions.
Tokens are made up of individual characters, words, or parts of sentences and are tokenized so that models, such as the OpenAI GPT family, can process them for text generation, translation, or summarization.
Using the Byte-Pair Encoding (BPE) tokenization method, the most frequently occurring pairs of characters merge into a single token. The models learn to understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens.
The architecture of each model determines a maximum number of tokens that can be processed in a single request. For example, GPT-3.5 Turbo has a token limit of 4,096. This means that it can manage 4,096 tokens in one go, including both the prompt and the completion.
Model |
Token Limit |
Tokens Per Minute |
gpt-35-turbo |
4,096 |
240-300K |
gpt-35-turbo-16k |
16,384 |
240-300K |
gpt-4 |
8,192 |
20-40K |
gpt-4-32k |
32,768 |
60-80K |
gpt-4-turbo |
132,096 (128K in, 4K out) |
80-150K |
text-embedding-ada-002 |
8,191 |
240-350K |
Azure OpenAI Service applies additional rate limits on top of these model specific limitations for each model deployment per region. Tokens-per-minute (TPM) is a configurable limit set per model per region within the API that provides a best prediction of your expected token usage over time. The requests-per-minute (RPM) rate limit is also set proportionally to the TPM based on 6 RPM per 1000 TPM. These additional quota limits help to manage the compute resources required by the models for processing customer requests. The more tokens a model must process, the more compute is required to process them.
It is important to consider the specific model token limits as well as the additional Azure OpenAI quota limits when architecting AI solutions.
Before choosing a specific model, define business objectives to help you understand how each can help you achieve your goals and define use cases.
The GPT family models are best used for natural language processing tasks such as chatbots, Q&A, language translation, text generation, and summarization. These models can generate high-quality content that is coherent and contextually relevant.
Text embedding models, on the other hand, perform better for tasks such as document search, sentiment analysis, content filtering, and classification. These models can represent text as a vector, a numerical representation which can be used to measure the similarity between different texts.
Conduct workshops with your engineering teams to collaboratively map out potential use cases to the various models. Identify specific uses where GPT models will support your requirements for natural language tasks, while exploring where you can optimize your cost-effectiveness using text embedding models for semantic analysis.
Avoid a one-size-fits-all approach when considering your models. Recognize that each model excels in distinct areas. Tailor your choices based on the specific requirements of your use cases to achieve significant cost-efficiency in your token usage. Consider that you may use multiple models in conjunction for your use cases to optimize your token usage further.
Embeddings are the numerical representation of any text you provide to a model such as text-embedding-ada-002 that capture the contextual relationships and meaning behind it.
Embeddings serve as a powerful tool in enriching a prompt to GPT models with semantic understanding from your existing data. By locating related text using embeddings, GPT models are provided with condensed, semantic context which results in fewer tokens used. This is crucial when considering high-volume token scenarios, contributing to cost savings without compromising the quality of responses.
When generating embeddings, it is important to note that token limits apply for the amount of content that is processed in single transaction. Unlike GPT models, a prompt is not required for these requests. However, appropriate strategies need to be made to segment the text into chunks. This is required so the semantic relationship in the text is captured effectively.
Consider splitting text by the most appropriate method for your use cases, such as by paragraph, section, key phrases, or applying clustering algorithms to group text into similar segments. Experiment, iterate, and refine your strategies for embeddings to optimize performance.
Prompt engineering involves crafting input queries or instructions in a way that extracts the most relevant information from the model while minimizing the number of tokens used. It is a strategic approach to achieve precise and resource-efficient interactions with Azure OpenAI.
Appropriately applying prompt engineering is a crucial aspect of maximizing the efficiency and reliability of LLM solutions. It is important to choose succinct and targeted prompts for use cases that convey the desired output. Employ the understanding of the GPT model’s tokenization to truncate and segment the instruction in prompts to reduce the overall prompt size without sacrificing the quality of the response. Avoiding unnecessary verbosity ensures token usage is minimal.
Test multiple prompts and context retrieval techniques for scenarios to validate the accuracy and reliability of the generated content. Utilize tools such as Prompt Flow in the Azure AI Studio to streamline the development of AI applications and evaluate the performance of your prompts.
With Azure OpenAI becoming a critical component of AI workloads, strategies for ensuring reliability and availability of this functionality are vital. With the limitations set by the service for model deployments per region, maximizing token usage can be achieved through multiple deployments across regions.
Applying load balancing techniques in front of each Azure OpenAI Service instance provides even distribution of requests across regions ensuring high availability for customers. Load balancing provides additional support for resiliency, enabling a seamless failover to another region if the rate limits for one region are met.
For scenarios where global reach is a requirement, approaching multi-region deployments of the same Azure OpenAI infrastructure can provide a better user experience for customers. Requests from client applications can be routed to an appropriate, nearest region while taking advantage of the load balancing and failover to another region to maximize token usage.
Although it is possible to deploy multiple of the same model in a single instance, the limitations in the TPM/RPM of model deployments per region limit the usages of per-tenant deployments of models.
As high-volume token usage increases, consider solutions for tracking token usage across customers in multitenant scenarios by introducing monitoring tools, such as Azure Managed Grafana, to simplify the process.
Adopt best practices in DevOps including infrastructure-as-code when deploying and managing a complex, multi-region Azure AI infrastructure. This approach will simplify the deployment process, minimize human-error, and ensure consistency across all regions.
Consider all limitations when architecting a high-volume token usage scenario including TPM and RPM per model deployment in each region.
Provisioned throughput units (PTUs) ensure predictable performance for your Azure OpenAI solutions by reserving processing capacity for prompts and generation completions. Unlike TPMs, which are based on a pay-as-you-go model, PTUs are purchased as a monthly commitment. By reserving capacity, this allows you to specify the required throughput offering stable maximum latency and throughput for your workloads. High throughput workloads may see improved cost savings vs using token-based consumption in Azure OpenAI.
Carefully assess your AI solution's throughput requirements to prevent overprovisioning. Quota for PTUs is determined by a deployment type, model, and region triplet which is not interchangeable. For example, if you have 300 PTUs provisioned for GPT 3.5 Turbo, those PTUs can only be used for GPT 3.5 Turbo deployments within a specific Azure subscription. Once deployed, the throughput is available whether you use it or not. Avoid overprovisioning PTUs to prevent unnecessary costs and underutilization of resources.
Regularly monitor your deployments and adjust PTUs as needed. Be aware that while quota represents the amount of total throughput you can deploy, it does not guarantee underlying capacity availability. You may encounter out of capacity errors that you need to be vigilant in responding to in order to ensure reliability of your AI solution.
Creating reliable AI solutions with high-volume token usage with Azure OpenAI requires a strategic and multifaceted approach. ISVs and Digital Natives must navigate the constraints of model token limits, choosing appropriate models for their use cases, exploring multiple prompts and context request combinations, and optimize their model deployments to maximize their token usage.
As the demand for AI solutions continues to grow, ISVs and Digital Natives are challenged to establish best practices for production. With a collaborative, systematic approach, they can push the boundaries of possibilities with Azure OpenAI to deliver reliable AI solutions that meet their evolving customer expectations.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.