Announcing a new workshop experience for AI Gateway for Azure API Management.

Chris_Noring

Microsoft

Jun 13, 2025

So, what’s really the problem of bringing Generative AI to production? Well, there’s more than one problem. You have to consider, for example, how many tokens are spent to ensure throughput and fair usage. You also need to keep track of it all by monitoring it. Scaling is another problem, and it needs to be resilient. Any errors need to be dealt with gracefully and much more.

TLDR; We’re proud to announce 3 new workshops that will help you add some great Generative AI Features to Azure API Management.

https://azure-samples.github.io/AI-Gateway/

Problems with Gen AI in production

So, what’s really the problem of bringing Generative AI to production? Well, there’s more than one problem. You have to consider, for example, how many tokens are spent to ensure throughput and fair usage. You also need to keep track of it all by monitoring it. Scaling is another problem, and it needs to be resilient. Any errors need to be dealt with gracefully and much more. The good news is that Azure API Management addresses many of these problems. In fact, we’re glad to announce that we’re now releasing a set of workshops on these themes meant for the Azure Portal. This will give you an ample chance to try out these features. When you’re ready for production, we’ve got Bicep code for that as well.

What are you waiting for? Let’s explore!

https://azure-samples.github.io/AI-Gateway/

Here’s what you’re getting

3 workshops on Generative AI features for Azure API Management.

Azure portal experience. All workshops take you through setting things up via Azure Portal which is a great first experience to try out new features. There’s also a link at the end of each workshop showing you how to deploy it using Bicep templates.

Vastly improved AI setup as you’re leveraging features that will help you control spending, monitoring and scaling, and more.

Workshops in detail

Here’s the workshop in detail:

-1-Control cost and performance with token quotas and limits

This policy helps you limit how many Gen AI tokens are allowed to pass, which helps you control throughput, spending, and more. The policy manages and enforces limits per API consumer based on the usage of Azure OpenAI Service tokens. With this policy you can set a rate limit, expressed in tokens-per-minute (TPM). You can also set a token quota over a specified period, such as hourly, daily, weekly, monthly, or yearly

The above image shows how you decide which requests are being let through and which ones are rate limited based on the policy below.

<azure-openai-token-limit  

  counter-key="@(context.Subscription.Id)"  

  tokens-per-minute="500"  

  estimate-prompt-tokens="false"  

  remaining-tokens-variable-name="remainingTokens">  

</azure-openai-token-limit>

Here’s how you can configure the policy above. What you see here is how we limit the number of tokens per minute to 500 and how it’s counted per Subscription Id.

-2- Keep visibility into AI consumption with model monitoring

By keeping track of your prompts, you can much easier control the costs of your Gen AI. The Azure OpenAI emit token metric policy sends metrics to Application Insights about consumption of LLM tokens through Azure OpenAI Service APIs. The policy helps provide an overview of the utilization of Azure OpenAI Service models across multiple applications or API consumers. This policy could be useful for chargeback scenarios, monitoring, and capacity planning.

The image above shows how for each request it logs a set of metrics, “dimensions” that can be viewed on a dashboard. Below we specify which dimensions are logged.

<azure-openai-emit-token-metric namespace="openai">  

  <dimension name="Client IP" value="@(context.Request.IpAddress)" />  

  <dimension name="API ID" value="@(context.Api.Id)" />  

  <dimension name="User ID" value="@(context.Request.Headers.GetValueOrDefault("x-user-id", "N/A"))" /> 
</azure-openai-emit-token-metric>

Here’s how you can configure this policy. What you’re seeing above is how you track dimensions like IP Address, User ID and more which will later help you drill down which usage comes from what.

-3 Load balancing and circuit breaker.

Load balancing helps you scale, and circuit breaker is an efficient pattern that ensures that any errors are dealt with and doesn’t affect a customer’s experience.