Startups at Microsoft

5 MIN READ

Monitoring Azure OpenAI without switching from your existing observability platform

Microsoft

Oct 03, 2025

Recently, one of my customers asked me a simple but powerful question:

“We already use Datadog for observability, but the rate-limit metrics we see in the Azure Portal don’t match what we get in Datadog. Why does Azure show higher TPM numbers?”

That question led to a deeper conversation about how Azure measures rate limits for Azure OpenAI.

They weren’t necessarily trying to move away from Datadog, in fact, they already have a mature observability stack built on it, but they wanted to understand and monitor Azure OpenAI usage directly in the portal.

After reviewing the documentation and confirming with Azure OpenAI Engineering team, the answer made sense:

Azure’s Tokens-Per-Minute (TPM) metric is based on an estimated token count derived from the character length of the request, not the exact tokenized count used for billing.
This estimate accounts for the worst-case request scenario (prompt + max_tokens + best_of), so Azure’s TPM can appear “inflated” compared to Datadog, which measures actual tokens consumed after completion.

That conversation inspired this post because many customers find themselves in a similar spot: they already have powerful observability tools but still want quick, built-in visibility into Azure OpenAI usage and rate limits without adding new integrations or switching platforms.

The two monitoring paths

When it comes to monitoring Azure OpenAI, there are two main options:

1. The full flow (most powerful, requires Log Analytics): This unlocks correlation, deep queries, and exporting metrics/logs to external tools.

Azure OpenAI Service → Azure Monitor → Log Analytics → KQL, Workbooks, Alerts → integrations like Datadog, Grafana.

2. The lightweight flow (fast, free, no Log Analytics): This is what we’ll explore: simple dashboards and quota-based alerts right in the Azure Portal.

Azure OpenAI Service → Azure Monitor (Metrics) → Portal Workbooks + Alerts.

Metrics available in Azure OpenAI

Azure OpenAI publishes several key metrics natively (no ingestion required). According to the official documentation:

Processed Inference Tokens → tokens consumed (prompt + completion).
Azure OpenAI Requests → total API calls.
Request Errors → failed requests (429s, 5xx).
Availability Rate → percentage of successful calls.
Latency metrics → TTFT (time to first token), TTLB (time to last byte).

You can view these under: AOAI Resource → Monitoring → Metrics.

Azure OpenAI exposes native metrics like tokens, requests, errors, and latency directly in the Azure Portal

Quotas: The other half of the picture

Metrics tell you usage. Quotas tell you capacity. Every deployment has fixed Tokens per Minute (TPM) and Requests per Minute (RPM) limits. You can find these under: AOAI Foundry Portal → Deployments → Select Deployment → Rate Limits.

Example:

GPT-4.1-mini deployment → 250,000 TPM / 250 RPM

These are the values you’ll compare against metrics and use in alerts.

Each deployment has fixed TPM/RPM quotas. Here, GPT-4.1-mini is capped at 250,000 TPM and 250 RPM.

If you prefer a more programmatically way, you could run this command:

az rest --method get \
  --url "https://management.azure.com/subscriptions/<subscriptionId>/resourceGroups/<resourceGroup>/providers/Microsoft.CognitiveServices/accounts/<accountName>/deployments/<deploymentName>?api-version=2023-05-01" \
  --query "{deployment:name, TPM:properties.rateLimits[?key=='token'].count | [0], RPM:properties.rateLimits[?key=='request'].count | [0]}"

Sample output:

{
  "RPM": 250,
  "TPM": 250000,
  "deployment": "gpt-4.1-mini"
}

Building a lightweight workbook

Even without Log Analytics, you can build a simple workbook to track usage vs quota:

Go to Azure Monitor → Workbooks → + New.
Add a metric visualization for Processed Inference Tokens (Sum).
- Metric: Processed Inference Tokens
- Aggregation: Sum
- Display name: Token Usage vs Quota.

Resource Type: Azure AI Foundry
Azure AI Foundry: Select your instance
Click to add metric

Add another metric for Azure OpenAI Requests (Count).

Metric: Azure OpenAI Requests
Aggregation: Count
Display name: Requests per Minute vs Quota.

Click to Run Metrics
Save as AOAI Usage vs Capacity.

Workbooks let you visualize token and request usage against your deployment’s fixed quotas

Creating alerts (proactive notification)

From the portal you can also configure alerts directly on metrics:

Go to Azure Monitor → Alerts → + Create → Alert rule.
Scope = your AOAI resource.
Condition step:

Signal name = Processed Inference Tokens.
Threshold type: Static
Value is: Greater than
Unit: Count
Threshold = 200,000 (warning) or 250,000 (critical).

Actions step:

Use Quick Actions → add your email (or Azure mobile push).
Or create an Action Group for Teams/webhook integration.

Details step:

Name = AOAI-TPM-Warning / AOAI-TPM-Critical.
Severity = 2 (Warning) or 0 (Critical).

Review + Create.
Repeat for Azure OpenAI Requests with thresholds of 200 (warning) and 250 (critical).

Alert conditions:

Configure alert conditions directly on metrics. Here, we trigger at 200,000 tokens per minute (80% of quota)

Quick Actions:

Quick Actions let you add email or mobile notifications without creating a full Action Group.

Overview from the Alert:

Give your alert a descriptive name and severity. Here, AOAI-TPM-Warning at Severity 2.

How this helps with 429 errors

One of the most common issues Azure OpenAI customers face is the dreaded “Too Many Requests” (429) error.

Why it happens:

Each deployment enforces hard TPM/RPM quotas.
If you send more tokens or requests than allowed in a minute, the service rejects them with a 429.
You may see headers like x-ms-retry-after-ms telling you how long to wait.

How monitoring helps:

Metrics as early warning: Watching token/request metrics shows when you’re approaching the cap.
Alerts before throttling: Warning alerts at 80% (200k TPM / 200 RPM) give you time to react before 429s hit.
Critical alerts at 100%: Confirm you’ve saturated the quota and need to adjust.

Important note:

Monitoring doesn’t prevent 429s, your app should still implement retry with backoff and consider batching/queuing requests.
But with this setup, you’ll know before the error storm begins, and can respond faster.

Why this matters

For many companies, time-to-value is more important than building a new monitoring stack.

This approach means:

No Log Analytics ingestion.
No need to replace Datadog or Splunk.
Free visibility into usage vs quota.
Proactive notifications on approaching limits.
Fewer surprises with 429 errors.

And if later you want deeper insights, you can still enable Log Analytics and export into your existing observability platform.

References:

Closing thoughts

This article was inspired by a customer request, but I believe many others will benefit from the same approach. In just a few minutes, you can build a dashboard, set alerts, and gain confidence in your Azure OpenAI usage, all without leaving the Azure Portal.

I’d love to hear from you: how is your team monitoring Azure OpenAI today? Share in the comments, your feedback will help shape what we build next.

Updated Oct 09, 2025

Version 8.0

rmmartins

Microsoft

Joined June 01, 2017

View Profile

Startups at Microsoft

Follow this blog board to get notified when there's new activity