Microsoft Developer Community Blog

7 MIN READ

Optimising AI Costs with Microsoft Foundry Model Router

Lee_Stott

Microsoft

Feb 27, 2026

A hands-on demo of intelligent model routing with real benchmark data

Microsoft Foundry Model Router analyses each prompt in real-time and forwards it to the most appropriate LLM from a pool of underlying models. Simple requests go to fast, cheap models; complex requests go to premium ones, all automatically.

I built an interactive demo app so you can see the routing decisions, measure latencies, and compare costs yourself. This post walks through how it works, what we measured, and when it makes sense to use.

The Problem: One Model for Everything Is Wasteful

Traditional deployments force a single choice:

Strategy	Upside	Downside
Use a small model	Fast, cheap	Struggles with complex tasks
Use a large model	Handles everything	Overpay for simple tasks
Build your own router	Full control	Maintenance burden; hard to optimise

Most production workloads are mixed-complexity. Classification, FAQ look-ups, and data extraction sit alongside code analysis, multi-constraint planning, and long-document summarisation. Paying premium-model prices for the simple 40% is money left on the table.

The Solution: Model Router

Model Router is a trained language model deployed as a single Azure endpoint. For each incoming request it:

Analyses the prompt — complexity, task type, context length
Selects an underlying model from the routing pool
Forwards the request and returns the response
Exposes the choice via the response.model field

You interact with one deployment. No if/else routing logic in your code.

Routing Modes

Mode	Goal	Trade-off
Balanced (default)	Best cost-quality ratio	General-purpose
Cost	Minimise spend	May use smaller models more aggressively
Quality	Maximise accuracy	Higher cost for complex tasks

Modes are configured in the Foundry Portal, no code change needed to switch.

Building the Demo

To make routing decisions tangible, we built a React + TypeScript app that sends the same prompt through both Model Router and a fixed standard deployment (e.g. GPT-5-nano), then compares:

Which model the router selected
Latency (ms)
Token usage (prompt + completion)
Estimated cost (based on per-model pricing)

Select a prompt, choose a routing mode, and hit Run Both to compare side-by-side

What You Can Do

10 pre-built prompts spanning simple classification to complex multi-constraint planning
Custom prompt input enter any text and benchmarks run automatically
Three routing modes switch and re-run to see how distribution changes
Batch mode run all 10 prompts in one click to gather aggregate stats

API Integration

The integration is a standard Azure OpenAI chat completion call. The only difference is the deployment name (model-router instead of a specific model):

const response = await fetch(
  `${endpoint}/openai/deployments/model-router/chat/completions?api-version=2024-10-21`,
  {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'api-key': apiKey,
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
      max_completion_tokens: 1024,
    }),
  }
);

const data = await response.json();

// The key insight: response.model reveals the underlying model
const selectedModel = data.model; // e.g. "gpt-5-nano-2025-08-07"

That data.model field is what makes cost tracking and distribution analysis possible.

Results: What the Data Shows

We ran all 10 prompts through both Model Router (Balanced mode) and a fixed standard deployment.

Note: Results vary by run, region, model versions, and Azure load. These numbers are from a representative sample run.

Side-by-side comparison across all 10 prompts in Balanced mode

Summary

Metric	Router (Balanced)	Standard (GPT-5-nano)
Avg Latency	~7,800 ms	~7,700 ms
Total Cost (10 prompts)	~$0.029	~$0.030
Cost Savings	~4.5%	—
Models Used	4	1

Model Distribution

The router used 4 different models across 10 prompts:

Model	Requests	Share	Typical Use
`gpt-5-nano`	5	50%	Classification, summarisation, planning
`gpt-5-mini`	2	20%	FAQ answers, data extraction
`gpt-oss-120b`	2	20%	Long-context analysis, creative tasks
`gpt-4.1-mini`	1	10%	Complex debugging & reasoning

Routing distribution chart — the router favours efficient models for simple prompts

Across All Three Modes

Metric	Balanced	Cost-Optimised	Quality-Optimised
Cost Savings	~4.5%	~4.7%	~14.2%
Avg Latency (Router)	~7,800 ms	~7,800 ms	~6,800 ms
Avg Latency (Standard)	~7,700 ms	~7,300 ms	~8,300 ms
Primary Goal	Balance cost + quality	Minimise spend	Maximise accuracy
Model Selection	Mixed (4 models)	Prefers cheaper	Prefers premium

Cost-optimised mode — routes more aggressively to nano/mini models

Quality-optimised mode — routes to larger models for complex tasks

Analysis

What Worked Well

Intelligent distribution The router didn't just default to one model. It used 4 different models and mapped prompt complexity to model capability: simple classification → nano, FAQ answers → mini, long-context documents → oss-120b, complex debugging → 4.1-mini.

Measurable cost savings across all modes 4.5% in Balanced, 4.7% in Cost, and 14.2% in Quality mode. Quality mode was the surprise winner by choosing faster, cheaper models for simple prompts, it actually saved the most while still routing complex requests to capable models.

Zero routing logic in application code One endpoint, one deployment name. The complexity lives in Azure's infrastructure, not yours.

Operational flexibility Switch between Balanced, Cost, and Quality modes in the Foundry Portal without redeploying your app. Need to cut costs for a high-traffic period? Switch to Cost mode. Need accuracy for a compliance run? Switch to Quality.

Future-proofing As Azure adds new models to the routing pool, your deployment benefits automatically. No code changes needed.

Trade-offs to Consider

Latency is comparable, not always faster In Balanced mode, Router averaged ~7,800 ms vs Standard's ~7,700 ms nearly identical. In Quality mode, the Router was actually faster (~6,800 ms vs ~8,300 ms) because it chose more efficient models for simple prompts. The delta depends on which models the router selects.

Savings scale with workload diversity Our 10-prompt test set showed 4.5–14.2% savings. Production workloads with a wider spread of simple vs complex prompts should see larger savings, since the router has more opportunity to route simple requests to cheaper models.

Opaque routing decisions You can see which model was picked via response.model, but you can't see why. For most applications this is fine; for debugging edge cases you may want to test specific prompts in the demo first.

Custom Prompt Testing

One of the most practical features of the demo is testing your own prompts before committing to Model Router in production.

Enter any prompt `the quantum computing example is a medium-complexity educational prompt`

Benchmarks execute automatically, showing the selected model, latency, tokens, and cost

Workflow:

Click ✏️ Custom in the prompt selector
Enter your production-representative prompt
Click ✓ Use This Prompt — Router and Standard run automatically
Compare results — repeat with different routing modes
Use the data to inform your deployment strategy

This lets you predict costs and validate routing behaviour with your actual workload before going to production.

When to Use Model Router

Great Fit

Mixed-complexity workloads — chatbots, customer service, content pipelines
Cost-sensitive deployments — where even single-digit percentage savings matter at scale
Teams wanting simplicity — one endpoint beats managing multi-model routing logic
Rapid experimentation — try new models without changing application code

Consider Carefully

Ultra-low-latency requirements — if you need sub-second responses, the routing overhead matters
Single-task, single-model workloads — if one model is clearly optimal for 100% of your traffic, a router adds complexity without benefit
Full control over model selection — if you need deterministic model choice per request

Mode Selection Guide

Is accuracy critical (compliance, legal, medical)?

Is accuracy critical (compliance, legal, medical)?
  └─ YES → Quality-Optimised
  └─ NO  → Strict budget constraints?
             └─ YES → Cost-Optimised
             └─ NO  → Balanced (recommended)

Best Practices

Start with Balanced mode — measure actual results, then optimise
Test with your real prompts — use the Custom Prompt feature to validate routing before production
Monitor model distribution — track which models handle your traffic over time
Compare against a baseline — always keep a standard deployment to measure savings
Review regularly — as new models enter the routing pool, distributions shift

Technical Stack

Technology	Purpose
React 19 + TypeScript 5.9	UI and type safety
Vite 7	Dev server and build tool
Tailwind CSS 4	Styling
Recharts 3	Distribution and comparison charts
Azure OpenAI API (2024-10-21)	Model Router and standard completions

Security measures include an ErrorBoundary for crash resilience, sanitised API error messages, AbortController request timeouts, input length validation, and restrictive security headers. API keys are loaded from environment variables and gitignored.
Source: leestott/router-demo-app: An interactive web application demonstrating the power of Microsoft Foundry Model Router - an intelligent routing system that automatically selects the optimal language model for each request based on complexity, reasoning requirements, and task type.

⚠️ This demo calls Azure OpenAI directly from the browser. This is fine for local development. For production, proxy through a backend and use Managed Identity.

Try It Yourself
Quick Start

git clone https://github.com/leestott/router-demo-app/


cd router-demo-app

# Option A: Use the setup script (recommended)
# Windows:
.\setup.ps1 -StartDev
# macOS/Linux:
chmod +x setup.sh && ./setup.sh --start-dev

# Option B: Manual
npm install
cp .env.example .env.local
# Edit .env.local with your Azure credentials
npm run dev

Open http://localhost:5173, select a prompt, and click ⚡ Run Both.

Get Your Credentials

Go to ai.azure.com → open your project
Copy the Project connection string (endpoint URL)
Navigate to Deployments → confirm model-router is deployed
Get your API key from Project Settings → Keys

Configuration

Edit .env.local:

VITE_ROUTER_ENDPOINT=https://your-resource.cognitiveservices.azure.com
VITE_ROUTER_API_KEY=your-api-key
VITE_ROUTER_DEPLOYMENT=model-router

VITE_STANDARD_ENDPOINT=https://your-resource.cognitiveservices.azure.com
VITE_STANDARD_API_KEY=your-api-key
VITE_STANDARD_DEPLOYMENT=gpt-5-nano

Ideas for Enhancement

Historical analysis — persist results to track routing trends over time
Cost projections — estimate monthly spend based on prompt patterns and volume
A/B testing framework — compare modes with statistical significance
Streaming support — show model selection for streaming responses
Export reports — download benchmark data as CSV/JSON for further analysis

Conclusion

Model Router addresses a real problem: most AI workloads have mixed complexity, but most deployments use a single model. By routing each request to the right model automatically, you get:

Cost savings (~4.5–14.2% measured across modes, scaling with volume)
Intelligent distribution (4 models used, zero routing code)
Operational simplicity (one endpoint, mode changes via portal)
Future-proofing (new models added to the pool automatically)

The latency trade-off is minimal — in Quality mode, the Router was actually faster than the standard deployment. The real value is flexibility: tune for cost, quality, or balance without touching your code.

Ready to try it? Clone the demo repository, plug in your Azure credentials, and test with your own prompts.