Blog Post

Healthcare and Life Sciences Blog
2 MIN READ

How to Compute GPU Capacity for GPT Models (GPT‑4o and Later)

Yan_Liang's avatar
Yan_Liang
Icon for Microsoft rankMicrosoft
Mar 30, 2026

Some customers have concern on how to compute GPU capacity for GPT model. I write it for future reference.

When deploying large language models like GPT‑4o, capacity planning is no longer about picking a GPU SKU. Instead, Azure abstracts GPU compute behind Provisioned Throughput Units (PTUs)—a model‑centric way to reason about GPU usage, throughput, and latency.

This post explains how GPU capacity is computed for GPT‑4o‑class models, and how to translate your workload into the right number of PTUs.

From GPUs to Tokens: The Mental Shift

With GPT‑4o and newer models, Azure does not expose GPUs directly. Instead:

  • GPU compute is consumed as token throughput
  • Throughput is measured in tokens per minute (TPM)
  • Capacity is provisioned using PTUs, which represent a fixed slice of GPU processing capacity

A PTU is not “one GPU.” It is a guaranteed amount of model‑processing capacity, backed by GPUs under the hood and optimized by Azure for that specific model. [learn.microsoft.com], [learn.microsoft.com]

The Key Change with GPT‑4o

For GPT‑4o and later models, input and output tokens are metered separately.

That matters because:

  • Input tokens (prompt processing) stress the model differently than
  • Output tokens (generation), which are more GPU‑intensive

Azure therefore assigns separate TPM budgets per PTU for input and output tokens.

GPT‑4o Throughput per PTU

For gpt‑4o, the effective per‑PTU capacities are:

Metric

Value

Input TPM per PTU

~2,500

Output TPM per PTU

~625

Input : Output ratio

4 : 1

These ratios are baked into Azure’s PTU calculators and provisioning logic.

The Core Formula

To compute required GPU capacity (PTUs):

Then:

  • Round up
  • Apply minimum deployment constraints (e.g., 15 PTUs for Global / Data Zone)

Step‑by‑Step Example

Assume this workload:

  • 800 input tokens
  • 150 output tokens
  • 30 requests per minute
  1. Compute TPM

Input TPM

Output TPM

  1. Convert to PTUs

Input side

Output side

  1. Take the bottleneck

Apply Azure’s minimum deployment size15 PTUs required.

This is why tables often show PTUs higher than a simple TPM ÷ constant calculation.

Why Output Tokens Matter More

Output tokens:

  • Are generated sequentially
  • Consume GPU compute longer per token
  • Drive latency and tail performance

That’s why GPT‑4o uses a 4:1 input‑to‑output ratio, and why output TPM often becomes the bottleneck in chatty or agentic workloads. [modelavail...bility.com]

Practical Guidance

  • Short prompts, long answers → output‑bound → more PTUs
  • Large prompts, short answers → input‑bound → more PTUs
  • Stable traffic → PTUs give predictable latency
  • Spiky traffic → consider Standard + spillover

Azure recommends validating sizing with the PTU Calculator and real traffic benchmarks before committing long‑term reservations. 

Final Takeaway

For GPT‑4o and newer models, GPU sizing is token‑driven, not hardware‑driven.
PTUs abstract GPUs, and the required capacity is simply the maximum of input‑bound and output‑bound throughput needs.

Once you understand that, GPT‑4o capacity planning becomes predictable, explainable, and much easier to operate at scale.

 

Published Mar 30, 2026
Version 1.0
No CommentsBe the first to comment