When deploying large language models in Azure AI Foundry, does selecting PTUs (Provisioned Throughput Units) save you money?
When deploying large language models in Azure AI Foundry, does selecting PTUs (Provisioned Throughput Units) save you money? This is the kind of article that might get its humble writer in hot water, but what’s life without a little controversy?
Introduction
What are PTUs? PTUs, also known as provisioned throughput units, are your own share of the Azure OpenAI service, offering high availability and consistent latency backed by a service level agreement. They provide an express lane to the service, reducing concerns about noisy neighbors. This was the primary reason for introducing this deployment type: to ensure a consistently low latency for real-time use cases compared to standard deployments (often called PayGo or Pay-as-you-go). Since their release, PTUs have become the de facto standard for production environments due to their unmatched performance.
Standard deployments are priced per token: we simply meter the number of tokens processed and multiply by the price per token. This is an easy way to deploy without any usage commitment. Provisioned deployments, however, come with an hourly rate for experimentation, and then require a provisioned reservation. A provisioned reservation offers a significant discount in exchange for a commitment to keep your allocated processing capacity for 30 days or a year.
It has been proven over time with many clients that provisioned deployments provide the best possible latency experience. However, some literature suggests that this deployment type may also offer cost benefits. Is this true? If so, under what circumstances? Are there sweet spots, tipping points, or inflection points? How can you determine if this applies to your application? This article explores these questions. It’s a non-trivial issue requiring either many assumptions or a perfect understanding of your application profile. Fortunately, some simple algebra can help us approximate the answer.
Before we dive in, let me state this clearly: provisioned deployments are not primarily a cost-saving mechanism. PTUs are the production grade level of deployment to ensure consistent experience for clients and end users. They are a superior product compared to standard deployments (PayGo), which is why PTU are priced differently. Provisioned deployments are engineered to provide consistently low latency, and that’s their main advantage; they are ideal for a successful AI Project Launch with consistent latency performance.
Example Use Case
I recently had a use case requiring GPT-4o. To provide best-in-class latency to end users, I wanted to make the case for Provisioned Throughput Units.
The next question is naturally, “How many PTUs do I need, and what kind of TPMs (Tokens Per Minute) will my application get on this provisioned deployment?”. One method would be to use the capacity calculator in Azure AI Foundry. An even better method is to create a small deployment and benchmark your actual application against it. Azure AI Foundry allows you to create these deployments on an hourly basis without any commitment, which is a flexible way to run benchmarks. That said, after finishing testing and benchmarking, it is important to either delete the hourly deployments or purchase a monthly/yearly provisioned reservation.
We created a small deployment of 25 PTUs for GPT-4o and started benchmarking using our actual application prompts. The data showed that we could saturate the deployment by reaching:
- 62,500 input TPM
- 50,000 output TPM
For more information about the latency SLA, visit this link below:
Understanding costs associated with provisioned throughput units (PTU) | Microsoft Learn
At that point, we knew what 25 PTUs could deliver for this application profile. Since PTUs scale almost linearly with throughput, it’s easy to determine how many you’ll need for production: just estimate your requests per minute, average input tokens per request, and average output tokens per request for your full user base.
Cost Calculations and Considerations
PTU pricing is straightforward. At the time of writing, a monthly reservation for a GPT-4o PTU is $260 USD per unit for global deployments. So, 25 PTUs for a month costs 25 x $260 = $6,500 USD. This buys you 62,500 input TPM and 50,000 output TPM at the best possible latency. Prices change over time and vary based on models, regions and deployment types. Always check the latest prices here: Azure OpenAI Service - Pricing | Microsoft Azure
Now, let’s compare this to the cost of a standard/PayGo deployment to see whether provisioned throughput comes at a high premium, is on par, or saves money. Here, we need to make some major assumptions about the application profile.
With standard deployments, we measure usage as a stock; with provisioned deployments, we measure usage as a flow. Provisioned deployments are always up, and you pay for uptime, whether you use it or not. Let’s assume that our application’s profile matches the 36,000/62,500 input/output token ratio and runs 24/7 for a month, with no caching.
PayGo pricing for GPT-4o per 1M tokens (as of this writing):
- Input: $2.50
- Output: $10.00
Cost per minute:
- Input: 0.0625 (62,500 tokens in millions of tokens) x $2.50 = $0.156
- Output: 0.05 (50,000 tokens in millions of tokens) x $10.00 = $0.50
- Total: $0.65 per minute
There are 43,800 minutes in a month, so the standard (PayGo) cost is $28,732 - approximately four times more than the provisioned deployment. This proves that under certain conditions, PTUs can save you money. Of course, these assumptions can be deemed extreme; few applications are as active at night and on weekends as during business hours.
As an additional exercise, let’s find the break-even point. At $0.65 per minute, you reach the $6,500 tipping point after 10,000 minutes - about 7 days. In other words, if your application needed steady processing at the 62,500/50,000 input/output token ratio for at least 7 days per month, provisioned deployment could be more cost-effective. And, regardless of cost, latency is better with provisioned deployment.
Conclusion
This analysis proves there can be a tipping point where PTUs start saving you money, at least with this model and application profile. There are many variables, and the assumptions here are at one extreme. Use this article as inspiration: plug in your own numbers, prices, and application profile to determine whether PTUs can save you money. Most importantly, remember that provisioned deployments are the best option for real-time production applications, providing consistently low latency - and, in some cases, cost efficiency compared to PayGo. Your stakeholder will enjoy the end user experience that PTUs provide which will set you up for success with you AI project.
Acknowledgement: Many thanks to my co-author and tech reviewer Julie Morin!