Model Router (preview) in Azure AI Foundry gives you a single chat deployment that intelligently selects an underlying model—small, large, or reasoning—per prompt. Instead of hard‑coding "which model?" decisions, you send one request and let the router optimize for capability and cost. In this blog, we will explore what the Model Router is, why to use it, how to implement it in TypeScript, versioning and monitoring strategies.
What Is the Model Router?
Model Router is a deployable chat model in Azure AI Foundry that routes each prompt to the most suitable underlying chat or reasoning model (for example: gpt‑4.1-mini, gpt‑4.1-nano, o4-mini, gpt‑5 variants etc). You interact with it exactly like a standard Chat Completions deployment—same endpoint shape, almost similar response schema, and the `model` field in the response here reveals which underlying model actually produced the answer.
Key architectural simplifications:
- Single deployment: one set of content filtering and rate limits covers all underlying models.
- Dynamic model choice: cheaper models for simple prompts, reasoning models when complexity warrants.
- Future flexibility: newer underlying models appear with Auto-update enabled.
Why use the Model Router
- Cost Efficiency: Avoid overpaying by defaulting every prompt to a large reasoning model.
- Operational Simplicity: One deployment, one name in configuration, unified logging.
- Performance Balance: Get higher reasoning capacity only when needed.
- Version Agility: A less manual update path when Auto-update is set at deployment.
- Observability: Response JSON’s `model` field lets you split metrics by underlying models to build routing distribution dashboards.
Take a scenario where a global SAAS platform supports three workloads:
- Customer Support Triage: Receives short classification prompts → router picks nano/mini models → low cost.
- Developer Knowledge Assistant: Occasionally consists of complex code reasoning → router escalates to o4-mini / gpt‑5 reasoning only when needed.
- Strategic Analytics Q&A: For deep analytical queries → router selects higher reasoning tier; fewer but more expensive calls.
Expected Impact: By not defaulting workload #1 - Customer Support Triage to a fixed large reasoning model, monthly LLM spend drops while workloads #2 - Developer Knowledge Assistant & #3 - Strategic Analytics Q&A still gain high-quality answers when complexity triggers upscale routing to models designed for reasoning.
How to use it (TypeScript)
Pre-requisites
- An Azure AI Foundry project
- Model Router model deployment
Set up your endpoint and key. In this example, we use the Azure Inference SDK and authenticate via API key for simplicity.
Go with the recommended Microsoft Entra ID (Managed Identity) authentication in production.
A link to the full sample repo will be provided at the end of this blog, but here’s a minimal explanation of the core logic.
const client = ModelClient(endpoint, new AzureKeyCredential(key));
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Give me a concise 5-bullet travel safety list for solo backpacking in South America." }
];
const response = await client.path("/chat/completions").post({
body: {
model: "model-router",
messages,
max_tokens: 512,
temperature: 0.7,
top_p: 0.95
}
});
console.log("Model chosen by the router:", (response as any).body?.model ?? "(unknown)");
console.log("Model response:", (response as any).body?.choices?.[0]?.message?.content);
console.log("Usage:", (response as any).body?.usage);
In the above code:
- We create a client with the mode-router endpoint and key.
- We prepare a chat message array containing a system prompt and a single user prompt.
- We send a POST request to `/chat/completions` with our request body with generation parameters.
- From the response, we extract the `model` field to see which underlying model was selected by the router, along with the generated content and usage statistics.
Expect an output similar to:
Output in terminal after running an API call through the model-router. Model chosen by the router: gpt-5-mini-2025-08-07In a multi-turn conversation, you would observe the router potentially selecting different underlying models for each turn based on the evolving context and complexity of the conversation.
Some recommended steps for multi-turn conversations include trimming chat history to control token growth (cost and context window limits)
function trimHistory(messages: ChatMessage[], maxHistoryTurns: number): ChatMessage[] {
if (messages.length <= 1) return messages;
const head = messages.slice(0, 1); // preserve system message
const body = messages.slice(1);
const keep = maxHistoryTurns * 2; // user message + assistant response pairs
return [...head, ...body.slice(-keep)];
}
Expected output:
Different models being selected by model router based on prompt complexityVersioning & Monitoring
Each router version maps to a fixed set of underlying models + versions. Upgrading the router could shift cost profiles, latency, and context window outcomes for the underlying models.
Upgrade policy options (for standard deployments):
- OnceNewDefaultVersionAvailable (auto upgrade soon after default changes)
- OnceCurrentVersionExpired (upgrade at retirement boundary; default if policy isn't set)
- NoAutoUpgrade (never auto; you must manually update or usage stops at retirement)
A recommended approach:
- Dev/Test: Enable auto-update to quickly assess improvements.
- Staging: Compare token usage & routing distribution (e.g., percentage of reasoning models) before scaling.
- Prod: Scale after regression checks (e.g., latency & cost KPIs).
Use Azure Monitor and Application Insights to track routing distribution and performance metrics by filtering on your router deployment name and splitting by underlying model. In Cost Analysis, you can use the resource tag ‘Deployment’ to isolate router consumption based on KPIs like percentage of reasoning models used vs non-reasoning models, average tokens per turn etc.
Known Limitations
- Context Window Variability. API calls with large prompts may fail if routed to a smaller-window model. To mitigate this, summarize/ truncate prompts before passing into the model to reduce input size.
- Parameter Dropping for Reasoning Models. Parameters like temperature, top_p, don't apply to reasoning models because these models prioritize internal deterministic and multi-step reasoning processes over randomness. To mitigate this, specify critical control parameters via prompt engineering instead of relying on these parameters for output variability.
- Unsupported Modalities. Model router currently only supports Text/ Image input and Text output. If your application requires audio or other modalities, you will need to process those requests separately.
- Latency Spikes. Expect larger reasoning models to incur higher processing time compared to the smaller model alternatives. For this reason, monitor latency metrics and implement fallback mechanisms for time-sensitive workloads. For example, you could set up a circuit breaker that routes requests to a default cheaper & smaller model if latency exceeds a certain threshold.
Resources
- Code examples (Python & TypeScript): https://github.com/Azure-Samples/insideAIF/tree/main/Samples/Model-Router
- Model Router How-To: https://learn.microsoft.com/azure/ai-foundry/openai/how-to/model-router
- Model Router Concepts: https://learn.microsoft.com/azure/ai-foundry/openai/concepts/model-router
- Working With Models (Version Upgrade Options): https://learn.microsoft.com/azure/ai-foundry/openai/how-to/working-with-models
- Quotas & Limits (Router rate limits, context guidance): https://learn.microsoft.com/azure/ai-foundry/openai/quotas-limits
- Models Catalog (Router region availability & capabilities): https://learn.microsoft.com/azure/ai-foundry/openai/concepts/models#model-router
- Reasoning Models (parameter support differences): https://learn.microsoft.com/azure/ai-foundry/openai/how-to/reasoning