Some customers have concern on how to compute GPU capacity for GPT model. I write it for future reference.
When deploying large language models like GPT‑4o, capacity planning is no longer about picking a GPU SKU. Instead, Azure abstracts GPU compute behind Provisioned Throughput Units (PTUs)—a model‑centric way to reason about GPU usage, throughput, and latency.
This post explains how GPU capacity is computed for GPT‑4o‑class models, and how to translate your workload into the right number of PTUs.
From GPUs to Tokens: The Mental Shift
With GPT‑4o and newer models, Azure does not expose GPUs directly. Instead:
- GPU compute is consumed as token throughput
- Throughput is measured in tokens per minute (TPM)
- Capacity is provisioned using PTUs, which represent a fixed slice of GPU processing capacity
A PTU is not “one GPU.” It is a guaranteed amount of model‑processing capacity, backed by GPUs under the hood and optimized by Azure for that specific model. [learn.microsoft.com], [learn.microsoft.com]
The Key Change with GPT‑4o
For GPT‑4o and later models, input and output tokens are metered separately.
That matters because:
- Input tokens (prompt processing) stress the model differently than
- Output tokens (generation), which are more GPU‑intensive
Azure therefore assigns separate TPM budgets per PTU for input and output tokens.
GPT‑4o Throughput per PTU
For gpt‑4o, the effective per‑PTU capacities are:
|
Metric |
Value |
|
Input TPM per PTU |
~2,500 |
|
Output TPM per PTU |
~625 |
|
Input : Output ratio |
4 : 1 |
These ratios are baked into Azure’s PTU calculators and provisioning logic.
The Core Formula
To compute required GPU capacity (PTUs):
Then:
- Round up
- Apply minimum deployment constraints (e.g., 15 PTUs for Global / Data Zone)
Step‑by‑Step Example
Assume this workload:
- 800 input tokens
- 150 output tokens
- 30 requests per minute
- Compute TPM
Input TPM
Output TPM
- Convert to PTUs
Input side
Output side
- Take the bottleneck
Apply Azure’s minimum deployment size → 15 PTUs required.
This is why tables often show PTUs higher than a simple TPM ÷ constant calculation.
Why Output Tokens Matter More
Output tokens:
- Are generated sequentially
- Consume GPU compute longer per token
- Drive latency and tail performance
That’s why GPT‑4o uses a 4:1 input‑to‑output ratio, and why output TPM often becomes the bottleneck in chatty or agentic workloads. [modelavail...bility.com]
Practical Guidance
- Short prompts, long answers → output‑bound → more PTUs
- Large prompts, short answers → input‑bound → more PTUs
- Stable traffic → PTUs give predictable latency
- Spiky traffic → consider Standard + spillover
Azure recommends validating sizing with the PTU Calculator and real traffic benchmarks before committing long‑term reservations.
Final Takeaway
For GPT‑4o and newer models, GPU sizing is token‑driven, not hardware‑driven.
PTUs abstract GPUs, and the required capacity is simply the maximum of input‑bound and output‑bound throughput needs.
Once you understand that, GPT‑4o capacity planning becomes predictable, explainable, and much easier to operate at scale.