Most startup teams start with the simplest thing that can work. One or two apps call Microsoft Foundry model endpoints directly, traffic is predictable, and “routing” is just a config value in the app.
The gateway pattern becomes necessary when Foundry stops being “an integration” and becomes “a shared platform”. That shift shows up in a few reliable signals:
- You do not fully control client code, or updating client configuration is riskier than updating a central routing configuration.
- You need blue green rollouts for model versions or fine-tuned variants without forcing every client to redeploy
- You need server-side retry and circuit breaking semantics to handle throttling and availability without duplicating logic across every app.
- You need consistent token governance and usage visibility across multiple apps and consumers.
On Azure, this is commonly implemented with Azure API Management (APIM) using GenAI-aware “AI Gateway” capabilities, and it can be configured from the Foundry portal and applied per project.
What problems a gateway solves
A production gateway in front of Foundry is not about adding a hop. It is about centralizing cross-cutting concerns that otherwise get reimplemented inconsistently:
- Stable API surface while deployments and backends evolve.
- Consistent auth termination at the gateway, then reestablish trust from the gateway to the model backend (for example with Azure RBAC).
- Token-based throttling and quotas for fairness and cost control across consumers.
- Operational resiliency via backend pools, priority and weight routing, retry, and circuit breaker behavior that honors throttling signals like Retry-After.
- Unified telemetry at the choke point, even when you have multiple underlying instances.
Decoupling clients from backend topology
One secondary but important effect of introducing a gateway is that it shifts backend-specific details out of application code. Clients call a stable API owned by your platform team, while routing, credentials, and failover semantics live behind that boundary. This does not make models interchangeable, and it does not eliminate platform dependencies. What it does is contain them. As backend topology evolves, whether that means new deployments, additional subscriptions, or additional regions, those changes become operational updates rather than coordinated application rewrites.
In practice, this means your platform team owns the API contract and operational semantics, while backend providers remain an implementation detail behind that contract.
One simple mental model
Concrete gateway patterns
Choosing the right gateway pattern
The table below summarizes when each pattern is most appropriate, and what trade-offs it introduces.
| Pattern | Primary goal | Isolation level | Throughput scaling | Resiliency impact | Operational complexity |
|---|---|---|---|---|---|
| Single Foundry, multi-deployment routing | Decouple clients from models and enable safe rollouts | Logical only (same resource boundary) | Limited to single resource quotas | Low to moderate (deployment-level) | Low |
| Multi-resource, same region, same subscription | Security segmentation, reliability, backend pooling | Resource-level | Not increased for standard tier | Moderate (backend failover) | Medium |
| Prioritized failover, spillover (PTU → standard) | Cost control and burst protection | Resource-level | Controlled spillover | High (explicit failover semantics) | Medium to high |
| Multi-subscription, same region | Quota expansion, org boundaries, central AI service | Subscription-level | Scales with number of subscriptions | High | High |
| Multi-region | Regional resilience, data residency, global access | Region-level | Region-bounded | Very high | High |
How to read this table:
- If your problem is model lifecycle and client decoupling, start with Pattern 1.
- If your problem is reliability and segmentation, Pattern 2 and 3 are the usual next step.
- If your problem is quota ceilings or organizational boundaries, Pattern 4 appears.
- If your problem is regional resilience or global scale, Pattern 5 becomes unavoidable.
Below are the most common patterns that show up as startups move from “one app calling one deployment” to “multiple products, multiple teams, and production SLOs”.
Pattern 1: Single Foundry resource with multi-deployment routing
When you use it
- You run multiple model deployments under one Foundry resource and want to control routing centrally.
- You want safer rollouts (blue green) without forcing client updates.
What it solves
- Routing decisions move from clients to a single place.
- You can gradually shift traffic between deployments, but you still need safe deployment practices because changing “which model” can be a breaking change from the client’s perspective.
Key operational detail
- Strongly consider credential termination and reestablishment. Clients authenticate to the gateway. The gateway authenticates to the model backend via Azure RBAC.
Pattern 2: Multi-resource in the same region and same subscription
When you use it
- You need security segmentation boundaries (separate keys or Azure RBAC per client).
- You want an easier chargeback model.
- You want failover for availability issues, operational mistakes, or pairing provisioned and standard for spillover.
What it solves
- You can treat multiple backends as active-active and load balance across instances.
- You centralize retry and circuit-breaker behavior.
Critical constraint
- Standard quotas are subscription-level, not instance-level. Load balancing across standard instances in the same subscription does not create additional throughput
Pattern 3: Prioritized failover and planned spillover (PTU first, consumption fallback)
This is the pattern you reach for when you want to maximize utilization of dedicated capacity and still survive bursts and outages.
The AI Gateway workshop describes a “Prioritized PTU with Fallback Consumption” approach using APIM backend pools with priority and weight-based routing, combined with circuit breaker rules and retries for 429 and selected 503 cases.
Concrete implementation details from the workshop that are worth copying into your playbook:
- Configure backend pool across multiple endpoints.
- Add a circuit breaker rule that can trip on throttling (429) and accept Retry-After
- Use APIM policy to authenticate with managed identity and set the backend to the pool, then retry on 429 or specific 503 conditions.
This moves “resiliency logic” out of every client and into one place you can test and iterate.
Pattern 4: Multi-subscription, same region (quota scaling and centralized service)
When you use it
- You need more quota in standard deployments but must constrain models to a single region.
- You are building a centralized “Microsoft Foundry as a service” model. Standard quota is subscription-bound, so capacity pooling often implies multiple subscriptions.
Implementation tips from the Azure Architecture Center guide
- Prefer subscriptions backed by the same Microsoft Entra tenant for consistency in Azure RBAC and Azure Policy.
- Deploy the gateway in the same region as the backends.
- Consider a dedicated gateway subscription.
- Ensure private endpoints are reachable across subscriptions, including cross-subscription Private Link where supported.
Pattern 5: Multi-region
When you use it
- You need a service availability failover strategy (for example cross-region pairs).
- You have data residency and compliance requirements.
- You face mixed model availability across regions.
The Azure Architecture Center guide calls out that for business-critical architectures that must survive a complete regional outage, a global unified gateway helps eliminate failover logic from client code. It also notes the trade-offs of single-region gateway deployment doing active-active load balancing across regions, including added latency and egress charges for cross-region calls.
Real-world scenarios this architecture supports
These are representative scenarios drawn from common production environments and directly supported by the gateway patterns and reference implementations.
Scenario A: Containing a runaway application
A company has five internal applications sharing the same Foundry environment. One application ships a prompt regression that suddenly multiplies average request size by 8x.
Without a gateway:
- Token consumption spikes globally.
- Other apps experience 429s and degraded latency.
- Root cause takes time to identify because telemetry is scattered.
With an AI Gateway in front of Foundry:
- Token-based limits are enforced per application.
- The faulty app is throttled at the gateway.
- Other applications continue operating normally.
- The gateway telemetry immediately shows which consumer is exhausting the quota.
Outcome:
- Incident blast radius is limited to one consumer.
- No global outage.
- Faster root cause isolation.
Scenario B: Zero-downtime model migration
A startup is migrating from one production deployment to a newer model version.
They deploy the new model alongside the old one and configure the gateway to:
- Route 5 percent of traffic to the new deployment.
- Keep 95 percent on the old deployment.
They observe:
- Error rate.
- Latency.
- Token growth.
Over several days they progressively shift traffic to 100 percent without requiring any client changes.
Outcome:
- No forced redeployments.
- No mass client reconfiguration.
- Rollback is a gateway configuration change, not an emergency code change.
Scenario C: Cost-controlled burst handling
A product runs steady baseline traffic on provisioned capacity and experiences unpredictable spikes.
Gateway configuration:
- Priority backend pool.
- Provisioned deployment as primary.
- Standard deployment as secondary.
- Circuit breaker honors Retry-After.
Normal operation:
- Nearly all traffic hits provisioned throughput.
During spikes:
- Overflow is routed to standard tier.
- The gateway absorbs throttling behavior and retries.
Outcome:
- Provisioned capacity is fully utilized.
- Spikes are handled without hard failures.
- Clients are unaware that backend routing changed.
Scenario D: Subscription quota pooling
An organization reaches standard tier quota ceilings in a single subscription.
They deploy Foundry resources across multiple subscriptions and place a single gateway in front.
Gateway behavior:
- Distributes requests across subscriptions.
- Applies unified token governance.
- Exposes one API endpoint to all internal teams.
Outcome:
- Aggregate usable quota increases.
- Organizational boundaries are preserved.
- Clients remain unaware of backend topology.
Operational playbook
This is the part that separates “it works” from “it survives production”.
1. Authentication strategy
Recommended default
- Terminate client auth at the gateway.
- Reestablish gateway-to-backend authorization via Azure RBAC rather than passing through client secrets.
The AI Gateway workshop provides a concrete example using authentication-managed-identity and setting the Authorization header for the backend call.
Guardrail
- If you choose pass-through client credentials, ensure clients cannot bypass the gateway or model restrictions.
2. Token throttling and fairness
You want limits that match how LLMs consume capacity and budget.
- APIM GenAI capabilities emphasize controlled token limits and monitoring for cost efficiency.
- Foundry AI Gateway governance scenarios explicitly include configuring token limits for models at the project level.
Use token throttling as your primary fairness control, then layer request-rate limits if needed.
3. Failover semantics
Two rules that prevent most “self-inflicted outages”:
- Honor Retry-After from the backend when implementing failover and circuit breaker behavior. Do not continuously hit a throttled endpoint returning 429.
- Prefer gateway-side retry and circuit breaking to avoid repeated code and to keep one place to test.
The workshop shows a pragmatic retry condition on 429 and selected 503, combined with backend pool routing and a circuit breaker that can trip on 429 while checking Retry-After.
4. Observability and consumption tracking
A gateway is uniquely positioned to publish telemetry across all consumed models to a single store, which makes unified dashboarding and alerting easier.
APIM’s GenAI positioning highlights token monitoring as part of “cost efficiency”.
The workshop navigation includes model monitoring and consumption tracking as first-class steps in the AI Gateway journey.
Operationally, decide up front what you will dimension your telemetry by (project, tenant, application, environment) and enforce those identifiers at the gateway.
5. APIOps: Treat gateway configuration as code
Even if you configure the first version in the portal, production systems need repeatability:
- Use a code-driven workflow for policies and configuration so routing and governance changes are reviewed and promoted like any other production change.
- If you adopt a federated model, APIM Workspaces are positioned to help organizations manage APIs more productively and securely.
- Keep an eye on the APIM changelog and GenAI feature updates because gateway capabilities are evolving quickly.
When not to add a gateway
The Architecture Center guide is explicit: If controlling client configuration is as easy as controlling gateway routing, the added reliability, security, cost, maintenance, and performance impact might not be worth it.
Also, if you are using a single instance with multiple deployments primarily to simulate identity segmentation, you might be better served by multiple instances with distinct Azure RBAC boundaries instead of pushing that complexity into gateway logic.
Closing thought
A gateway is not a prerequisite for Foundry. It is an operational maturity step.
When Foundry usage becomes multi-tenant, SLO-driven, and quota-sensitive, the gateway stops being “extra architecture” and becomes the place you express your platform intent. Auth boundaries. Token governance. Failover semantics. Telemetry. And a repeatable APIOps process to keep it all sane as the system evolves.
References
- Use a gateway in front of multiple Azure OpenAI deployments or instances
- Configure AI Gateway in your Foundry resources
- AI gateway in Azure API Management
- Azure API Management
- Ensure resiliency and optimized resource consumption with load balancer & circuit breaker
- Control cost and performance with token quotas and limits
- Keep visibility into AI consumption with model monitoring
- GitHub - Azure-Samples/AI-Gateway
- AI-Gateway/labs/access-controlling
- AI-Gateway/labs/function-calling
- AI-Gateway/labs/model-context-protocol
- AI-Gateway/labs/openai-agents
- AI-Gateway/labs/ai-agent-service
- AI-Gateway/labs/semantic-caching
- AI-Gateway/labs/finops-framework
- AI-Gateway/labs/slm-self-hosting
- AI-Gateway/labs/ai-foundry-deepseek