Blog Post

AI - Azure AI services Blog
4 MIN READ

Managing Traffic Jams with Azure OpenAI PTU Spillover

jakeatmsft's avatar
jakeatmsft
Icon for Microsoft rankMicrosoft
Mar 20, 2025

How to smooth traffic and avoid gridlocks

In my previous blog, (Azure OpenAI offering models - Explain it Like I'm 5 | Microsoft Community Hub), I drew a comparison of our Azure OpenAI service to a highway. In the standard offering, all users (cars) share common capacity (lanes) as they use the service. In times of congestion or high usage, traffic can slow down due to all the cars on the road at a single time. Provisioned Throughput (PTU) deployments can be seen as private express lanes designed for consistency and low variability on the highway of our service. These private lanes guarantee a certain amount of traffic can move through the service at a steady speed. Also, by having your own lane, you control the amount of traffic and are not impeded by other cars (or other users) using the service.

But what happens when too many cars try to get on your private lane at once, and the lane reaches its full capacity before you can purchase another lane? In the past, we had a barrier separating the private lanes from the public lanes, so in this case, you were forced to slow down. But recently, Azure OpenAI released a new feature to alleviate this problem called PTU Spillover. Think of Spillover as a conversion of the barrier to exit lanes that let the extra cars safely switch over to a regular lane (standard deployment) so that everyone can keep moving without a jam.

What is Spillover?

Spillover is a clever way to handle traffic surges in your Azure OpenAI deployments. When your provisioned deployment (your express lane) is fully busy, any extra requests are automatically rerouted to a standard deployment (your regular lane) for processing. This means that even if there’s a sudden burst of traffic, your service keeps running smoothly without all requests coming to a halt.

Prerequisites: Setting Up Your Highway

Before you can enjoy the smooth flow of traffic management, here’s what you need:

  • A primary (provisioned) deployment—this is like your reserved, express highway lane.
  • A standard deployment—this will be your spillover lane for extra traffic.
  • Both deployments need to be part of the same Azure OpenAI Service resource (imagine they’re on the same highway system!).
  • Both deployments must have the same data processing type (for example, a global provisioned lane can only spill over to a global standard lane).

More info here: Manage traffic with spillover for Provisioned deployments - Azure AI services | Microsoft Learn

How to Enable Spillover

  • By Default (recommended)- To keep traffic moving during rush hours, you can enable spillover on PTU deployments by default. By doing so, if there’s a sudden spike or burst in usage—like more cars than the express lane can handle—spillover automatically sends the extra requests to the standard lane. 
  • By Request - Alternatively, if you want more control, you can enable spillover on a per-request basis, much like choosing when to take an exit on the highway.

When Does Spillover Kick In?

Picture this: you’re driving down the provisioned deployment lane, and suddenly your car (request) hits a snag (a non-200 response)—it’s like encountering heavy congestion. At that moment, spillover is activated, and your car is automatically guided to the spillover (standard) lane where there’s free space and the journey can continue smoothly. Importantly, Azure OpenAI always prioritizes sending your car through the express lane until it’s absolutely necessary to switch.

How Does Spillover Affect Costs?

There are two billing scenarios on our smart highway:

  • Cars staying in the express lane (provisioned deployment) are billed using a fixed hourly cost—no extra fees.
  • Cars that have taken the exit lane (routed to the standard deployment) are billed based on the tokens processed (input, cached, output tokens).

In simple terms, while you always pay the reserved fee for using your express lane (PTU charges), any cars using the overflow exit lane incur costs based on what they process along the journey. There is no surcharge for having the spillover turned on in your deployment.

What are the Risks?

While the benefits of free-flowing traffic are obvious, there are still risks in enabling spillover:

  • The public lanes may also be congested; having spillover does not guarantee your traffic will move at the same speed as your uncongested private lanes. Traffic in public lanes can still be slow in periods of high congestion.
  • You may degrade other internal apps using your standard quota. If the surge of traffic is big enough, you may overrun your allotment of cars on the road (quota).

Taking It Home

Think of Azure OpenAI’s spillover feature like an extra tool available to remove traffic restrictions and enable smooth travel on our highway. Even in cases where you have implemented your own traffic load balancing, spillover can be used as an additional safeguard when configured on by default, ensuring that every request reaches its destination promptly without causing a massive traffic jam. With simple setup options and transparent cost management, spillover guarantees a smoother and more resilient journey for your production applications—even during times of unexpected traffic.

Learn more: Manage traffic with spillover for Provisioned deployments - Azure AI services | Microsoft Learn

 

Updated Mar 20, 2025
Version 1.0
  • kkwei's avatar
    kkwei
    Copper Contributor

    I tried using the PTU Spillover feature by adding x-ms-spillover-deployment in the header through "by request". However, it did not work correctly. I still received a 429 response and was not automatically redirected to the standard deployment.

     

    import requests
    import json
    
    url = "$AZURE_OPENAI_ENDPOINT/openai/deployments/{ptu-deployment}/chat/completions?api-version=2025-02-01-preview"
    
    payload = json.dumps({
      "messages": [
        {
          "content": "Does Azure OpenAI support customer managed keys?",
          "role": "user"
        }
      ],
      "model": "gpt-4o",
      "stream": False
    })
    headers = {
        "Content-Type": "application/json",
        "x-ms-spillover-deployment": "{spillover-standard-deployment}",
        "Authorization": "Bearer YOUR_AUTH_TOKEN"
    }
    
    response = requests.request("POST", url, headers=headers, data=payload)
    
    print(response.text)

     

    I believe I have met the following prerequisites:
    1. A primary (provisioned) deployment. (Model: Gpt-4o, Provisioned Managed)
    2. A standard deployment (Model: Gpt-4o, Global Standard)
    3. Both deployments need to be part of the same Azure OpenAI Service resource. (Deploy on same region)

    The only part I’m unsure about is whether a provisioned deployment can spill over to a global standard deployment?
    According to the documentation, both deployments must have the same data processing type (e.g., a global provisioned lane can only spill over to a global standard lane).

    By the way, is there a minimum API version requirement for PTU Spillover?

    • jakeatmsft's avatar
      jakeatmsft
      Icon for Microsoft rankMicrosoft

      In the current preview state, PTU spillover only supports Azure OpenAI resources with Networking access set to "Allow All".  The feature will support all configurations when it is announced GA (General Availability).

  • MarekPaulik's avatar
    MarekPaulik
    Copper Contributor

    Hello, does this currently work in other way than the curl requests described in the documentation?
    I am trying to make it work by sending extra_headers to `chat_completions.create` from Python but it doesn't seem to spillover

    • jakeatmsft's avatar
      jakeatmsft
      Icon for Microsoft rankMicrosoft

      Are you able to test with requests library in python?  I'm not sure how the SDK forwards extra_headers to the backend. 

      • jakeatmsft's avatar
        jakeatmsft
        Icon for Microsoft rankMicrosoft

        something like this: 

        import requests
        
        url = "$AZURE_OPENAI_ENDPOINT/openai/deployments/{ptu-deployment}/chat/completions?api-version=2025-02-01-preview"
        headers = {
            "Content-Type": "application/json",
            "x-ms-spillover-deployment": "{spillover-standard-deployment}",
            "Authorization": "Bearer YOUR_AUTH_TOKEN"
        }
        data = {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Does Azure OpenAI support customer managed keys?"},
                {"role": "assistant", "content": "Yes, customer managed keys are supported by Azure OpenAI."},
                {"role": "user", "content": "Do other Azure AI services support this too?"}
            ]
        }
        
        response = requests.post(url, headers=headers, json=data)
        print(response.json())