Hello Team, We are building a Azure OpenAI based finetuned model making use of GPT 4o-mini for long run.Wanted to understand the costing, here we came up with the following question over Azure OpenAI Service Provisioned Reservations plan PTU units. Need to understand how it works:Is there any Token Quota Limit Provisioned finetuned model deployment?How many finetuned model with Provisioned capacity can be deployed under the plan, How will the pricing affect if we deploy multiple finetune model? Model Deployment - GPT 4o-mini finetunedRegion - North Central US We are doing it for our enterprise customer, kindly help us to resolve this issue.

Provisioned Throughput Units (PTUs) in Azure OpenAI Service provide predictable costs and performance by allocating a specific throughput capacity. The token quota limit for fine-tuned models is typically measured in Tokens Per Minute (TPM) rather than a fixed number of tokens. As for deploying multiple fine-tuned models, the pricing is based on the total throughput capacity allocated. You can deploy multiple models as long as the combined throughput remains within the reserved capacity. It's advisable to monitor usage and adjust the capacity as needed to optimize costs.

sachins

Copper Contributor

Mar 28, 2025

Solved

Understanding Azure OpenAI Service Provisioned Reservations

Hello Team,

We are building a Azure OpenAI based finetuned model making use of GPT 4o-mini for long run.

Wanted to understand the costing, here we came up with the following question over Azure OpenAI Service Provisioned Reservations plan PTU units.

Need to understand how it works:

Is there any Token Quota Limit Provisioned finetuned model deployment?
How many finetuned model with Provisioned capacity can be deployed under the plan, How will the pricing affect if we deploy multiple finetune model?

Model Deployment - GPT 4o-mini finetuned

Region - North Central US

We are doing it for our enterprise customer, kindly help us to resolve this issue.

GPT-4o

Provisioned Throughput

ml4u
Apr 20, 2025
Provisioned Throughput Units (PTUs) in Azure OpenAI Service provide predictable costs and performance by allocating a specific throughput capacity. The token quota limit for fine-tuned models is typically measured in Tokens Per Minute (TPM) rather than a fixed number of tokens. As for deploying multiple fine-tuned models, the pricing is based on the total throughput capacity allocated. You can deploy multiple models as long as the combined throughput remains within the reserved capacity. It's advisable to monitor usage and adjust the capacity as needed to optimize costs.

6 Replies

meyerchristiandringend
Copper Contributor
Jun 10, 2025
Provisioned Throughput Units (PTUs) in Azure OpenAI Service provide predictable costs and performance by allocating a specific throughput capacity. The token quota limit for fine-tuned models is typically measured in Tokens Per Minute (TPM) rather than a fixed number of tokens. As for deploying multiple fine-tuned models, the pricing is based on the total throughput capacity allocated. You can deploy multiple models as long as the combined throughput remains within the reserved capacity. It's advisable to monitor usage and adjust the capacity as needed to optimize costs.
sachins
Copper Contributor
Apr 22, 2025
If we are doing it on test server on different subscription for client and want to transfer/share to main subscription, is it possible, on same region?
ml4u
Brass Contributor
Apr 20, 2025
Provisioned Throughput Units (PTUs) in Azure OpenAI Service provide predictable costs and performance by allocating a specific throughput capacity. The token quota limit for fine-tuned models is typically measured in Tokens Per Minute (TPM) rather than a fixed number of tokens. As for deploying multiple fine-tuned models, the pricing is based on the total throughput capacity allocated. You can deploy multiple models as long as the combined throughput remains within the reserved capacity. It's advisable to monitor usage and adjust the capacity as needed to optimize costs.
- sachins
  Copper Contributor
  Apr 22, 2025
  If we are doing it on test server on different subscription for client and want to transfer/share to main subscription, is it possible, on same region?
Abdulrhman
Copper Contributor
Mar 31, 2025
Hi Sachins
OpenAI Service Provisioned Throughput Units (PTUs) are designed to provide predictable costs and performance by allocating a specific throughput capacity. There isn't a specific token quota limit mentioned for finetuned model deployments under PTUs, but the throughput is measured in Tokens Per Minute (TPM).
You can use the Azure capacity planner to estimate the required PTUs based on your expected TPM usage.
https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/

You can deploy multiple finetuned models under the Provisioned capacity plan. The pricing will be affected by the total throughput required for all models combined. Each model's deployment will consume a portion of the allocated PTUs, and you need to ensure that the total TPM usage across all models does not exceed the provisioned capacity.

could you share more about your expected usage patterns? and how many models do you plan to deploy?
- ml4u
  Brass Contributor
  Apr 18, 2025
  Great explanation, Abdulrhman. To add, when deploying multiple finetuned models under the Provisioned capacity plan, it's important to monitor the total throughput to ensure it aligns with the allocated PTUs. This helps avoid exceeding the provisioned capacity and ensures optimal performance. Keep in mind that each model's deployment will contribute to the overall throughput usage, so planning and monitoring are key.

Forum Discussion

Understanding Azure OpenAI Service Provisioned Reservations

6 Replies

Resources