How to smooth traffic and avoid gridlocks
In my previous blog, (Azure OpenAI offering models - Explain it Like I'm 5 | Microsoft Community Hub), I drew a comparison of our Azure OpenAI service to a highway. In the standard offering, all users (cars) share common capacity (lanes) as they use the service. In times of congestion or high usage, traffic can slow down due to all the cars on the road at a single time. Provisioned Throughput (PTU) deployments can be seen as private express lanes designed for consistency and low variability on the highway of our service. These private lanes guarantee a certain amount of traffic can move through the service at a steady speed. Also, by having your own lane, you control the amount of traffic and are not impeded by other cars (or other users) using the service.
But what happens when too many cars try to get on your private lane at once, and the lane reaches its full capacity before you can purchase another lane? In the past, we had a barrier separating the private lanes from the public lanes, so in this case, you were forced to slow down. But recently, Azure OpenAI released a new feature to alleviate this problem called PTU Spillover. Think of Spillover as a conversion of the barrier to exit lanes that let the extra cars safely switch over to a regular lane (standard deployment) so that everyone can keep moving without a jam.
What is Spillover?
Spillover is a clever way to handle traffic surges in your Azure OpenAI deployments. When your provisioned deployment (your express lane) is fully busy, any extra requests are automatically rerouted to a standard deployment (your regular lane) for processing. This means that even if there’s a sudden burst of traffic, your service keeps running smoothly without all requests coming to a halt.
Prerequisites: Setting Up Your Highway
Before you can enjoy the smooth flow of traffic management, here’s what you need:
- A primary (provisioned) deployment—this is like your reserved, express highway lane.
- A standard deployment—this will be your spillover lane for extra traffic.
- Both deployments need to be part of the same Azure OpenAI Service resource (imagine they’re on the same highway system!).
- Both deployments must have the same data processing type (for example, a global provisioned lane can only spill over to a global standard lane).
More info here: Manage traffic with spillover for Provisioned deployments - Azure AI services | Microsoft Learn
How to Enable Spillover
- By Default (recommended)- To keep traffic moving during rush hours, you can enable spillover on PTU deployments by default. By doing so, if there’s a sudden spike or burst in usage—like more cars than the express lane can handle—spillover automatically sends the extra requests to the standard lane.
- By Request - Alternatively, if you want more control, you can enable spillover on a per-request basis, much like choosing when to take an exit on the highway.
When Does Spillover Kick In?
Picture this: you’re driving down the provisioned deployment lane, and suddenly your car (request) hits a snag (a non-200 response)—it’s like encountering heavy congestion. At that moment, spillover is activated, and your car is automatically guided to the spillover (standard) lane where there’s free space and the journey can continue smoothly. Importantly, Azure OpenAI always prioritizes sending your car through the express lane until it’s absolutely necessary to switch.
How Does Spillover Affect Costs?
There are two billing scenarios on our smart highway:
- Cars staying in the express lane (provisioned deployment) are billed using a fixed hourly cost—no extra fees.
- Cars that have taken the exit lane (routed to the standard deployment) are billed based on the tokens processed (input, cached, output tokens).
In simple terms, while you always pay the reserved fee for using your express lane (PTU charges), any cars using the overflow exit lane incur costs based on what they process along the journey. There is no surcharge for having the spillover turned on in your deployment.
What are the Risks?
While the benefits of free-flowing traffic are obvious, there are still risks in enabling spillover:
- The public lanes may also be congested; having spillover does not guarantee your traffic will move at the same speed as your uncongested private lanes. Traffic in public lanes can still be slow in periods of high congestion.
- You may degrade other internal apps using your standard quota. If the surge of traffic is big enough, you may overrun your allotment of cars on the road (quota).
Taking It Home
Think of Azure OpenAI’s spillover feature like an extra tool available to remove traffic restrictions and enable smooth travel on our highway. Even in cases where you have implemented your own traffic load balancing, spillover can be used as an additional safeguard when configured on by default, ensuring that every request reaches its destination promptly without causing a massive traffic jam. With simple setup options and transparent cost management, spillover guarantees a smoother and more resilient journey for your production applications—even during times of unexpected traffic.
Learn more: Manage traffic with spillover for Provisioned deployments - Azure AI services | Microsoft Learn
Updated Mar 20, 2025
Version 1.0jakeatmsft
Microsoft
Joined October 19, 2020
AI - Azure AI services Blog
Follow this blog board to get notified when there's new activity