As enterprises increasingly adopt AI-driven solutions, the need for scalable, secure, and intelligent API architectures becomes critical. In this blog, we’ll explore how to integrate Azure API Management (APIM) with Azure OpenAI endpoints, leverage Azure OpenAI semantic caching to optimize performance, manage Token Per Minute (TPM) limits, and deploy self-hosted gateways for hybrid environments.
🔐Azure API Management (APIM): The Gateway to Modern APIs
Azure API Management provides a robust platform to expose, manage, and secure APIs. It acts as a facade between your backend services and consumers, offering:
- Rate limiting and throttling
- Authentication and authorization
- Analytics and monitoring
- Policy-based transformations
🤖 Integrating Azure OpenAI with APIM
Azure OpenAI brings the power of GPT models to your enterprise applications. By exposing Azure OpenAI endpoints through APIM, you can:
- Apply rate limits and quotas to control usage
- Add authentication layers (e.g., OAuth2, subscription keys)
- Monitor usage and performance via Azure Monitor
⚡ Understanding Token Per Minute (TPM) Limits
Azure OpenAI enforces TPM limits to manage model usage. Each model (e.g., GPT-4, GPT-3.5) has a quota for how many tokens can be processed per minute.
📌 Best Practices
- Distribute load across multiple deployments
- Use APIM policies to throttle requests before hitting TPM limits
- Monitor usage with Azure Monitor and alerts
🧠 Azure OpenAI Semantic Caching: Optimize LLM Performance
Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and for prompts that are similar in meaning, even if the text isn't the same.
🛠️ How It Works
- Generate embeddings for incoming prompts
- Compare with cached embeddings using similarity score threshold based on vector proximity of the prompt to previous requests.
- Return cached response if similarity exceeds threshold
📈 Benefits
- Reduces costs and latency
- Improves scalability
- Enhances user experience
📷 Semantic Cache Flow
🌐 APIM Self-Hosted Gateway: Hybrid API Management
The self-hosted gateway allows you to run APIM in your own infrastructure—ideal for on-prem or hybrid cloud scenarios.
🔍 Key Features
- Same policies and configuration as cloud APIM
- Works in Kubernetes, Docker, or VMs
- Enables local traffic routing and compliance
📷 Self-Hosted Gateway Architecture
🧩 Bringing It All Together
By combining these technologies, you can build a secure, scalable, and intelligent API platform:
- Use APIM to expose and secure OpenAI endpoints
- Enforce TPM limits and throttle requests
- Implement semantic caching to reduce LLM costs
- Deploy self-hosted gateways for hybrid environments
📚 Resources
- Azure API Management Documentation
- Azure OpenAI Service
- Azure Open AI Token Per Limit
- Azure Open AI Semantic Caching
- Self-Hosted Gateway Setup
🧪 Proof of Concept (POC): Bringing the Architecture to Life
To validate the integration of Azure APIM, Azure OpenAI, semantic caching, and self-hosted gateways, here are some POC steps you can implement:
List of Azure Resources to be deployed for this POC-
- 1 API Management service
- 1 Azure Managed Redis
- OpenAI model deployments
- gpt-4o
- gpt-4o-mini
- text-embedding-3-small
 
- 2 Azure Virtual Machines (APIM Self-Hosted gateway configured on docker container)
- 1 Log Analytics Workspace
- 1 Application Insights
✅ 1. Expose Azure OpenAI via APIM
- Create an Azure OpenAI resource and deploy a model (e.g., gpt-4o/gpt-4o-mini).
- Create an API in Azure API Management that proxy's requests to the Azure OpenAI endpoint.
✅ 2. Implement Token Per Minute (TPM) Throttling
- Use APIM policies to enforce TPM limits.
- Monitor usage via Azure Monitor and set alerts for threshold breaches.
Refer below APIM policy for TPM setup, where dynamic values are set during the API call for TPM attribute:
<set-variable name="tokensPerMinute" value="@(context.Request.Headers.GetValueOrDefault("X-Tokens-Per-Minute", "1000"))" />
<azure-openai-token-limit tokens-per-minute="@(Convert.ToInt32(context.Variables["tokensPerMinute"]))" counter-key="@(context.Request.IpAddress)" estimate-prompt-tokens="true" tokens-consumed-header-name="consumed-tokens" remaining-tokens-header-name="remaining-tokens" />
✅ 3. Integrate Azure OpenAI Semantic Caching
Pre-requisites
- An Azure Cache for Redis Enterprise or Azure Managed Redis instance. The RediSearch module must be enabled on the Redis cache.
- Azure Open AI embeddings model example - text-embedding-3-small
- Create a "backend" resource in the APIM instance which points to embeddings model URL.
To enable semantic caching for Azure OpenAI APIs in Azure API Management, apply the following policies: one to check the cache before sending requests (lookup) and another to store responses for future reuse (store):
In the Inbound processing section for the API, add the azure-openai-semantic-cache-lookup policy. In the embeddings-backend-id attribute, specify the Embeddings API backend you created.
<azure-openai-semantic-cache-lookup
score-threshold="0.8"
embeddings-backend-id="embeddings-backend"
embeddings-backend-auth="system-assigned"
ignore-system-messages="true"
max-message-count="10">
<vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
In the Outbound processing section for the API, add the azure-openai-semantic-cache-store policy.
<azure-openai-semantic-cache-store duration="60" />
✅ 4. Deploy Self-Hosted Gateway
- Provision a gateway resource in API Management instance.
- Deploy the APIM self-hosted gateway in a Docker container or Kubernetes cluster.
- Connect it to your APIM instance using a gateway token.
- Route internal traffic through the self-hosted gateway for compliance and reduced latency.
Reference - Deploy self-hosted gateway to Docker | Microsoft Learn
✅ 5. End-to-End Testing
- Simulate user queries via Postman or a frontend app.
- Validate:
- Response times with and without caching
- TPM enforcement
- Gateway routing and failover
- Logging and analytics in Azure Monitor
🧾 Conclusion
Integrating Azure API Management with Azure OpenAI endpoints and leveraging Azure OpenAI semantic caching unlocks a powerful architecture for building intelligent, scalable, and secure APIs. By thoughtfully managing Token Per Minute (TPM) limits and deploying self-hosted gateways, organizations can ensure high performance, cost efficiency, and compliance across hybrid environments.
This architecture not only supports modern AI-driven applications but also provides the flexibility and control needed for enterprise-grade deployments. Whether you're building a chatbot, a knowledge assistant, or an internal AI tool, this approach offers a robust foundation to scale responsibly and intelligently.