A pattern for consolidating multi-model routing into a single governed deployment — with built-in failover, data-zone enforcement, and per-prompt cost/quality trade-offs across 18 underlying LLMs
The architectural problem
In any non-trivial GenAI platform, you end up managing a fleet of models. Cheap models for classification and light chat. Reasoning models for multi-step tasks. Frontier models for the hard stuff. Specialty models for code, vision, or long context.
The architectural question isn’t which model is best — it’s how do we dispatch the right model per request, at scale, with governance and observability intact?
The usual patterns each have problems:
|
Pattern |
Trade-off |
|
Single-model deployment |
Overpays on simple prompts, underperforms on complex ones |
|
Application-layer router (rules/classifier) |
Brittle, needs constant retuning as models evolve |
|
LLM-as-router |
Adds a call hop, governance complexity, and its own failure modes |
|
Per-use-case deployments |
Explodes deployment surface; quota and cost reporting fragment |
Model Router in Microsoft Foundry is a platform-level answer to this: a trained routing model, deployed as a single endpoint, that dispatches across up to 18 underlying LLMs per prompt.
Conceptual architecture
Design note: The routing decision is made by a trained model, not a rules engine. It analyzes the prompt itself — complexity, task type, reasoning requirements — and is updated by Microsoft as new underlying models are onboarded.
What you govern, what the platform governs
For architects, the division of responsibility is the key mental model:
Platform-owned
- Real-time prompt analysis and routing decisions
- Automatic failover across the subset
- Data-zone boundary enforcement
- Prompt-caching passthrough to supporting models
- Underlying-model versioning (via router versioning)
You own
- Routing mode — Balanced (default), Quality, or Cost
- Model subset — the allow-list of underlying models
- Deployment type — Global Standard or Data Zone Standard
- Region — East US 2 or Sweden Central (current availability)
- Observability hooks — logging response.model for per-request attribution
Routing modes as design levers
|
Mode |
Quality band |
When to use |
|
Balanced (default) |
~1–2% of top model |
General-purpose chat and agent surfaces |
|
Quality |
Always top model |
Regulated outputs, complex reasoning, RAG over critical docs |
|
Cost |
~5–6% band |
High-volume classification, drafting, low-stakes chat |
Treat the routing mode as a deployment-scoped SLO lever. Different product surfaces can point at different Model Router deployments with different modes and subsets.
The model subset: your governance surface
This is the feature most worth deliberate design thought. The subset list governs:
- Compliance — which vendors/regions your prompts can touch
- Context window — the effective context equals the smallest model in the subset; curate accordingly
- Cost ceiling — bound worst-case per-call cost
- Failover pool — keep at least two models in every subset
- Cache hit rate — narrower, more deterministic subsets improve the odds that consecutive overlapping prompts land on the same underlying model
New models introduced in future router versions are not auto-added to your subset. That’s a deliberate guardrail — additions require explicit deployment changes.
Code: deploying with a custom subset
Model Router is deployed like any Foundry model. Below is an indicative ARM/Bicep-style deployment snippet that sets Balanced mode and restricts routing to a curated subset — omit subset to accept the full default pool.
|
resource modelRouter 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = { name: 'model-router-prod' parent: foundryAccount sku: { name: 'GlobalStandard' capacity: 250 } properties: { model: { format: 'OpenAI' name: 'model-router' version: '2025-11-18' } routingConfiguration: { mode: 'Balanced' // Balanced | Quality | Cost modelSubset: [ 'gpt-5-mini' 'gpt-5' 'gpt-5.2' 'claude-sonnet-4-5' 'claude-opus-4-6' 'o4-mini' ] } } } |
Confirm the exact schema against the current Foundry deployment API — parameter names can evolve between API versions.
Deploying via the Foundry portal
If you prefer the portal over IaC, the flow is short:
- Sign in to Microsoft Foundry and ensure the New Foundry toggle is on.
- Open the model catalog, find model-router, and select it.
- Choose Default settings for Balanced mode across all supported models, or Custom settings to pick a routing mode and a model subset.
- Apply a content filter at the model router deployment — it covers all underlying models. Don’t set per-model content filters.
- Set the TPM rate limit at the model router level — it applies to all activity to and from the router. Don’t set rate limits per underlying model.
- (Claude only) Deploy Claude models separately from the catalog before adding them to your subset. Other vendors are invoked transparently.
Propagation note: changes to routing mode or model subset can take up to five minutes to take effect. Plan rollouts and tests accordingly.
Code: calling the endpoint (Python)
Once deployed, Model Router is a standard chat-completions endpoint. Always capture response.model — it’s your per-request attribution for cost analysis and routing validation.
|
from openai import AzureOpenAI
client = AzureOpenAI( azure_endpoint="https://<your-resource>.openai.azure.com/", api_key="<your-key>", api_version="2025-11-18", )
response = client.chat.completions.create( model="model-router-prod", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Summarize the trade-offs of event sourcing at scale."}, ], )
print(response.choices[0].message.content) print("Served by:", response.model) # e.g. "gpt-5-mini-2025-08-07" |
Code: streaming responses
Streaming works exactly as it does for any Azure OpenAI chat deployment. The routing decision happens before the first token; once chosen, the underlying model streams directly.
|
stream = client.chat.completions.create( model="model-router-prod", messages=[ {"role": "user", "content": "Walk me through CAP theorem with a concrete example."}, ], stream=True, )
for chunk in stream: if chunk.choices and chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) |
Code: tool use (agentic scenarios)
The 2025-11-18 release adds tool-use support, enabling Model Router inside the Foundry Agent Service. The router picks the right model per turn — cheap for trivial turns, reasoning-grade for multi-step ones.
|
tools = [{ "type": "function", "function": { "name": "get_order_status", "description": "Retrieve the current status of a customer order.", "parameters": { "type": "object", "properties": { "order_id": {"type": "string", "description": "The order ID."}, }, "required": ["order_id"], }, }, }]
response = client.chat.completions.create( model="model-router-prod", messages=[ {"role": "system", "content": "You help customers track orders."}, {"role": "user", "content": "Where is order A-4571?"}, ], tools=tools, tool_choice="auto", )
choice = response.choices[0] if choice.message.tool_calls: call = choice.message.tool_calls[0] print("Tool requested:", call.function.name, call.function.arguments) print("Served by:", response.model) |
Agent Service caveat: if your agent flow uses Foundry Agent Service tools, routing is restricted to OpenAI models only. Plan your subset accordingly when the router sits behind agent flows that depend on those tools.
Code: alternative — Foundry Responses SDK
If you’re standardizing on the Microsoft Foundry SDK rather than the OpenAI Python SDK, the Responses API offers an equivalent path. Install: pip install azure-ai-projects>=2.0.0 azure-identity.
|
from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient
with ( DefaultAzureCredential() as credential, AIProjectClient(endpoint=project_endpoint, credential=credential) as project_client, project_client.get_openai_client() as openai_client, ): response = openai_client.responses.create( model="model-router-prod", input="In one sentence, name the most popular tourist destination in Seattle.", ) print(response.output_text) |
Parameter handling when a reasoning model is selected
Because Model Router can dispatch to either chat or reasoning (o-series) models, parameter behavior shifts based on the actual model picked. Build your application around the union of both behaviors.
- Temperature, Top_P — ignored when an o-series reasoning model is selected; honored otherwise.
- stop, presence_penalty, frequency_penalty, logit_bias, logprobs — dropped for o-series; honored otherwise.
- reasoning_effort — supported starting in the 2025-11-18 router release. When a reasoning model is selected, the router passes your value through to the underlying model.
Practical rule: don’t rely on temperature/top-p for determinism in a router-fronted deployment, and treat reasoning_effort as the only knob with consistent meaning across reasoning vs. non-reasoning paths.
Anatomy of the response
The JSON shape is identical to a standard chat completion. The model field is the key signal — it tells you which underlying model actually served the request. The usage block also reveals cached_tokens (prompt-cache hits) and reasoning_tokens (when an o-series model handled the prompt).
|
{ "id": "xxxx-yyyy-zzzz", "object": "chat.completion", "model": "gpt-5-mini-2025-08-07", "choices": [ { "index": 0, "finish_reason": "stop", "message": { "role": "assistant", "content": "Charismatic and bold—combining brash showmanship..." }, "content_filter_results": { "hate": { "filtered": false, "severity": "safe" }, ... } } ], "usage": { "prompt_tokens": 3254, "completion_tokens": 163, "total_tokens": 3417, "prompt_tokens_details": { "cached_tokens": 3200, "audio_tokens": 0 }, "completion_tokens_details": { "reasoning_tokens": 128, "audio_tokens": 0 } } } |
Monitoring in the Azure portal
Performance metrics
- Open the Azure portal and navigate to Monitoring → Metrics for your Azure OpenAI / Foundry resource.
- Filter by your model router deployment name.
- Split the metrics by underlying model to see how traffic is actually being distributed across the routed models.
Cost attribution
- Open Resource Management → Cost analysis in the Azure portal.
- Filter by Tag, set the tag type to Deployment, and select your model router deployment name.
- Total cost = sum of the underlying-model charges for requests that hit this deployment.
Three practical recommendations
- Log response.model on every call. This is your primary application-side signal for routing distribution and per-request attribution.
- Expect mixed-model billing. Model Router charges at the rate of the underlying model that served each request. Cross-check Azure Cost analysis against your application logs.
- Watch cache hit rates per underlying model. Caching benefits apply only when consecutive overlapping prompts land on the same model. A too-permissive subset can silently degrade cache efficiency.
Failure modes to design around
- Context-window overrun. The effective context is the smallest model in the subset. If a large prompt arrives, it fails unless routed to a larger-context model. Defend against this by curating the subset or by summarizing/truncating upstream.
- Claude model not routing. Claude requires a separate catalog deployment first. Surface a deployment health check.
- Region/deployment-type mismatch. Currently East US 2 and Sweden Central; Global Standard and Data Zone Standard only. Plan DR accordingly.
- Rate limits. 250 RPM / 250K TPM on Global Standard by default; higher on Enterprise/MCA-E. Build backpressure early.
- Audio unsupported. Images are accepted but routing decisions are text-only.
Common issues — quick reference
|
Issue |
Likely cause |
Resolution |
|
Rate limit exceeded |
Too many requests to the router deployment |
Increase TPM quota or implement retry with exponential backoff |
|
Unexpected model selection |
Routing logic picked a different model than expected |
Review routing mode; constrain via model subset |
|
High latency |
Router overhead plus underlying-model processing |
Use Cost mode for latency-sensitive workloads; smaller models respond faster |
|
Claude model not routing |
Claude requires a separate catalog deployment |
Deploy Claude models from the catalog before adding to subset |
|
Context exceeded |
Effective context = smallest model in subset |
Curate subset to larger-context models, or summarize/truncate upstream |
When Model Router is the right architectural choice
Strong fit:
- Heterogeneous traffic — wide variance in prompt complexity
- Multi-vendor LLM strategy (OpenAI + Anthropic + open models) that you want to consolidate behind a single governed endpoint
- Agent platforms where tasks span trivial to complex reasoning
Weaker fit:
- Uniform workloads where a single well-chosen model is simpler
- Workloads dominated by large-context prompts (unless the subset is curated for it)
- Scenarios requiring deterministic, reproducible model selection per request — the router is intentionally adaptive
Recommended rollout path
- Phase 1 — Baseline. Deploy Model Router with Balanced mode and the full pool. Log response.model across representative traffic.
- Phase 2 — Govern. Introduce a model subset based on your compliance, context, and cost requirements. Ensure at least two models for failover.
- Phase 3 — Tune. Only after the baseline distribution tells you which way to lean, switch to Cost or Quality mode, or split product surfaces across two Model Router deployments with different profiles.
- Phase 4 — Integrate. Wire the router into Foundry Agent Service for agentic surfaces.
Closing thought
Model Router turns multi-model dispatch from an application concern into a platform concern, with governance levers (mode, subset, region) that map cleanly to the trade-offs architects actually negotiate: cost, quality, compliance, and resilience. That’s a meaningful simplification of an otherwise accidentally-complex part of production GenAI architecture.
Sample repositories
Microsoft publishes several open-source samples in the foundry-samples GitHub organization that are useful for hands-on evaluation:
- Model Router Capabilities Interactive Demo (Python). Compare Balanced, Cost, and Quality routing modes against your own prompt sets; see live benchmark data for cost savings, latency, and routing distribution.
- Routed Models Distribution Analysis (Python). Run prompt batches across routing profiles and model subsets to inspect which models the router selects and in what proportions — useful before committing to a routing policy.
- Multi-team Quality & Cost Benchmarking (Python workshop). Deploy Model Router, benchmark against fixed-model deployments, and analyze cost/latency trade-offs in a multi-team enterprise scenario.
- On-Call Copilot Multi-Agent Demo (Python). See per-step model selection inside an agent flow — fast/cheap models for classification, reasoning models for root-cause analysis.
These samples are for learning and experimentation. Review against your organization’s security, compliance, and Responsible AI policies before adapting any of it for production.
Learn more
- Model router for Microsoft Foundry concepts
- How to use model router
- Azure OpenAI in Microsoft Foundry models
If you’re piloting Model Router, what subset and mode did you land on — and what surprised you in the routing distribution? Share in the comments.