Unveiling Azure OpenAI Service Provisioned Reservations and hourly pricing
Introduction
Microsoft recently launched Azure Essentials to provide guidance for managing your cloud investments efficiently by choosing the pricing offers that best meet your needs, paying only for what you use, and managing your cloud spend as your business evolves, whether you’re migrating your first workload or optimizing complex deployments. In alignment with the launch of Azure Essentials, the Microsoft team is thrilled to announce meaningful changes to provisioned deployments for Azure OpenAI Service. My name is Roman, I am part of the AI global black belt team and it’s an absolute pleasure to share some of these changes with you today.
Starting today, we are rolling out several updates that are going to change the way you procure and deploy Provisioned Throughput Units. These changes are designed to help you be more agile, faster to market and more cost effective. The changes unveiled today only pertain to the provisioned deployments purchasing process. The technical value proposition is still the same and provisioned deployments remain the best option for real-time and high throughput applications.
Today we are announcing:
- Self-service provisioning and model independent quota requests
- Visibility to service capacity and availability
- Provisioned hourly pricing and provisioned reservations
This blog post will focus on the last point and will delve into Azure Reservations for Provisioned Deployments. To learn more about all these changes, visit this link.
Late summer 2023, Microsoft launched Provisioned Throughput Units for Azure Open AI Service. This was, and still is, a way for customers to request a specified amount of computational power in the Azure OpenAI Service and solve the challenges related to the "noisy neighbor” problem presented with AI computing in a public cloud. In contrast with the regional standard and global-standard deployments, provisioned deployments allow customers to create a deployment with a guaranteed measure of capacity; as a result, customers can build GenAI applications with predictable latency and throughput.
Until today, if you wanted to create a provisioned deployment, you had to plan carefully and work with your account team; quota meant actual capacity carved out from the pool and pre-allocated to your subscription temporarily until the purchase was completed. The capacity and model family that you committed to using for 30 days was strongly coupled with what you could deploy. In addition, you had to tie your commitment to a specific resource and that could cause an administrative burden for multi region or even multi subscription architectures.
Introducing hourly no-commitment purchasing
At Microsoft, we want to empower our customers to build world class, real-time, high throughput applications using generative AI. To do that, we want to make our provisioned deployments more easily accessible; we want to provide as much flexibility as possible alongside high quality of service. We no longer require a 30-day minimum commitment to purchase provisioned throughput. You can now create a provisioned deployment in a self-service manner just to run a benchmarking script this afternoon if you so desire. And you can also tear down the deployment when done. No strings attached. To do this, we came up with an easy-to-understand, flat, hourly price of $2 (subject to change, check this link for up-to-date pricing) per unit per hour. It does not matter if you deploy GPT-3.5-Turbo or GPT-4o, the price per unit is the same and the construct of a provisioned throughput unit now becomes entirely model independent. That said, different models still have different minimum increment sizes. Taking GPT-4o as an example, you can deploy any multiple of 50 PTUs.
We did not stop there. If even one hour is more than you need, you can stop use early and we will prorate the cost for partial hours. If you create a provisioned deployment of 100 units and only use it for 15 minutes, you will be billed the equivalent of 25 hourly units. We want to make provisioned deployments accessible to all.
This option is great for all testing scenarios as well as transition periods where customers might be moving a deployment from one region to another and want to do so without PTU downtime.
Introducing Azure OpenAI Service provisioned reservations
Hourly no-commitment purchasing gives our customers more flexibility. But in the spirit of making provisioned deployments even more accessible, we also wanted to provide a mechanism for cost optimization.
Say you built an application and hopefully you enjoyed the hourly pricing; during development, it gave you all the flexibility to run your tests against several models with various parameters. But you have now deployed that application in production, and it will be sending completion requests steadily. Not just for the next few hours, but also weeks and months to come. In that case, Azure OpenAI Service provisioned reservations will be immensely beneficial. Azure reservations do not change your provisioned deployment at all from a technical standpoint. Rather, it overlays a predictable and cost-effective billing mechanism on top of it.
The way it works is simple. In Azure OpenAI Studio, we provide a simple capacity calculator, you input the characteristics of your applications, and the calculator estimates the number of units you would need to provision to cover the entirety of the requests you expect to process with your deployment.
The capacity calculator in Azure Open AI Studio
As a customer, using the calculator you determine that across your applications, you will need a certain number of Provisioned Throughput Units. From there, you can purchase this number of PTU reservations for one month or one year in your Azure portal. By making a monthly reservation, you can save up to 82%* over the hourly rate and for a one-year reservation you can save up to 85%**. Keep in mind that although you must select a region, you do not need to commit to a specific model or model family any longer. Say you reserve 500 units for the year, you get a sizeable discount, and you can switch the models to which the units are allocated, mix and match models across multiple deployments, create and tear down deployments at will. Also keep in mind that reservations and deployments are now decoupled and can be changed entirely independently. Purchasing a PTU reservation does not create a deployment or guarantee availability of capacity. So, it is our recommendation to first create the deployment and only then proceed with the reservation. This methodology ensures that you do not purchase a reservation that cannot be fulfilled due to a temporary capacity shortage.
Reservations can be flexibly scoped to cover deployments in a single resource group, a subscription, a list of subscriptions in a management group, or all subscriptions in the same billing context. Now, if you are a large enterprise with say one subscription per business unit and within those subscriptions perhaps you have one resource on a per application basis, you can now purchase one single centralized reservation to cover all.
If the number of units you have reserved matches the units you have deployed, you are in good shape and you are optimizing cost as much as possible. If the number of units you have reserved is greater than the units you have deployed, then you are leaving value on the table. And if you deployed more units than reserved, then the difference is billed hourly as described in the previous paragraph. In any case, it is good practice to reassess your coverage periodically.
Conclusion
This new iteration does not change the technical characteristics of provisioned deployments. Those still provide the best-in-class service with respect to low and predictable latency. In terms of procurement, however, it is a brand-new world. Self service provisioning, model independent units along with hourly on-demand deployment for maximum flexibility and Reservations for maximum cost savings, provisioned deployments for Azure OpenAI are more attractive now than ever and barriers to entry have been lowered.
Learn more about how to elevate reliability, security, and ongoing performance of your cloud and AI investments with Azure Essentials.
Additional Resources:
- What are Azure Reservations? - Microsoft Cost Management | Microsoft Learn
- Reservations | Microsoft Azure
- Azure OpenAI Service provisioned throughput - Azure AI services | Microsoft Learn
- Azure OpenAI Service Provisioned Throughput Units (PTU) onboarding - Azure AI services | Microsoft Learn
*The 82% savings are based on the Provisioned Throughput Hourly rate of approximately $2/hour, compared to the reduced rate of a 1-month reservation at approximately $0.3562/hour. Azure pricing as of July 29, 2024 (prices subject to change). Actual savings may vary depending on the specific Azure OpenAI model and region availability.
**The 85% savings are based on the Provisioned Throughput Hourly rate of approximately $2/hour, compared to the reduced rate of a 1-year reservation at approximately $0.3028/hour. Azure pricing as of July 29, 2024 (prices subject to change). Actual savings may vary depending on the specific Azure OpenAI model and region availability.
My sincere appreciation to contributors and reviewers: Tierney Morgan, Mary Newcomer Williams, Kailyn Sylvester, Kyle Ikeda, Andy Beatman, David Huntley and Noah Aldonas.