Manage your LLM token spend

Chris_Noring

Microsoft

Mar 25, 2025

In this article, we will talk about how to limit the number of tokens that can be requested from the OpenAI API. But before we get that far, let's back up a bit and talk about what and why we need to limit the number of tokens.

This article is part of a series of articles on API Management and Generative AI. We firmly believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure, and easier to manage.

API management

Once you bring an LLM to production and exposes it as an API Endpoint, you need to consider how you "manage" such an API. There are many considerations to be made everything from caching, scaling, error management, rate limiting, monitoring and more. In this article we will use the Azure service, Azure API Management and show how by adding one of its policies to an LLM endpoint you can control usage of tokens.

Resources

Here's a list of resources that you might find useful:

- Docs page]

- [Azure Sample]

- [Azure Gateway]

What is a token limit policy?

A token limit policy is something you can apply to your API to limit the number of tokens that can be requested from it. The idea is that you configure the policy to allow a certain number of tokens to be requested within a certain time frame. If the limit is exceeded, the API will return an error message. Typically, you write a policy that specifies the number of tokens allowed and the time frame in which they can be requested.

Why do we need a token limit policy?

There are a few reasons why you might want to limit the number of tokens that can be requested from your API:

- To prevent abuse of the API, such as a single user making too many requests in a short period of time.

- Availability To ensure that the API is available to all users, not just a few who are making too many requests. This is also known as rate limiting and the problem being addressed is called the "noisy neighbour" problem.

- Security To prevent denial of service attacks, where an attacker tries to overwhelm the API with too many requests.

- Cost To prevent excessive usage of the API, which could result in higher costs.

As you can see, there are many good reasons to limit the number of tokens.

How it works

Here's an example of what a token limit policy might look like:

<policies> 
  <inbound> 
    <base /> 
      <azure-openai-token-limit 
        counter-key="@(context.Subscription.Id)" 
        tokens-per-minute="60" 
        estimate-prompt-tokens="false" 
        retry-after-variable-name="token-limit-retry-after" 
      /> 
  </inbound> 
  <backend> 
    <base /> 
  </backend> 
  <outbound> 
    <base /> 
  </outbound> 
  <on-error> 
  <base /> 
  </on-error>

- azure-openai-token-limit policy is the element that limits the number of tokens that can be requested from the Open AI API.

- counter-key attribute is used to specify the key that is used to track the number of tokens requested. In this case, we are using the subscription ID as the key. You can also use other keys, such as the user ID or the IP address. Why you would use one key over another depends on your specific use case, for example, if you want to limit the number of tokens requested by a single user or by a single subscription.

- tokens-per-minute is the number of tokens that can be requested within a minute. In this case, we are allowing 60 tokens per minute.

- estimate-prompt-tokens is a boolean value that specifies whether to estimate the number of tokens that will be requested in the future. If set to true, the policy will estimate the number of tokens that will be requested in the future and adjust the rate limit accordingly. If set to false, the policy will not estimate the number of tokens that will be requested in the future and will use the rate limit specified in the policy.

- retry-after-variable-name is the name of the variable that will be used to specify the number of seconds to wait before making another request. There's also the option to use the "retry-after" header to specify the number of seconds to wait before making another request. The reason to use a variable is that it allows you to customize the number of seconds to wait rather than using a fixed value.