Microsoft Foundry Blog

8 MIN READ

Automate Quota Discovery in Azure AI Foundry: A Tale of 3 APIs

cedricvidal

Microsoft

Feb 18, 2025

Answering the question: “Which Azure region can support all the requirements for deploying my AI application?”

By Cedric Vidal, Principal AI Advocate, Microsoft.

It is essential for every AI developer to determine the appropriate Azure region for deploying an AI application. With Azure AI Foundry, deploying a model is straightforward; one simply selects a model from the catalog, clicks deploy, and chooses a region. However, enterprise applications typically require the deployment of multiple models.

For instance, a standard Retrieval-Augmented Generation (RAG) application may necessitate at least three distinct models: a chat model for interaction (e.g. GPT 3.5 Turbo or GPT 4), an embedding model (e.g. ADA 2 or 3) for Q&A vector encoding, and an evaluation model (e.g. GPT 4) to periodically assess the quality of the responses.

Consider the example of the Contoso Creative Writer sample Azure AI application. This application exemplifies the current focus of AI developers. It is a typical multi-agent application, employing several agents to accomplish complex tasks effectively.

Here’s a table showing its requirements:

Requirement	Model	Type	TPM
editor	gpt-35-turbo or gpt-35-turbo-16k	Standard or Global Standard	10k
writer	gpt-4o	Standard or Global Standard	15k
evaluation	gpt-4 or gpt-4-32k	Standard or Global Standard	20k
embeddings	text-embedding-3-small or text-embedding-ada-002	Standard or Global Standard	30k

As your application requirements and choices expand, automation can help with decision fatigue around what regions to choose for deployment. The challenge then lies in how to automate the discovery of models, regions, and available quotas so that a developer immediately knows which regions will meet all their requirements from the start.

(Generated using Dall-E 3)

Why automate the discovery of models, regions, and available quotas?

Besides reducing decision fatigue, automating the discovery of models and quotas can help organizations meet enterprise-wide model deployment standards, enforced, for example, using Azure Policies across various deployments and AI applications. For instance, there might be a preference for deploying all models within a particular region for compliance or performance purposes, or alternatively, prioritizing cost-effectiveness by distributing models across multiple regions.

This article discusses the available APIs on Azure that can help automate deployment decisions and how to integrate them into your model discovery process. Specifically, we will explore the following question: How can we quickly identify which Azure region can support all the requirements for deploying our AI application effectively?

Note: The terms location and region are used interchangeably throughout this discussion.

A tale of three REST APIs

The Model API

Azure AI Foundry platform offers the Models API, enabling users to query the available models for a specific subscription within a chosen location.

You can find detailed documentation on the Models API here. The API URL format is shown below, requiring the subscriptionId and location as inputs:

https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.CognitiveServices/locations/{location}/models?api-version=2024-10-01

Sample response:

{
    "kind": "OpenAI",
    "skuName": "S0",
    "model": {
        "format": "OpenAI",
        "name": "whisper",
        "version": "001",
        "skus": [
            {
                "name": "Standard",
                "usageName": "OpenAI.Standard.whisper",
                "capacity": {
                    "maximum": 9999,
                    "default": 3
                },
                "deprecationDate": "2099-01-01T00:00:00Z",
                "rateLimits": [
                    {
                        "key": "request",
                        "renewalPeriod": 60,
                        "count": 1
                    }
                ]
            }
        ],
    }
}

Note: Some parts of the responses have been stripped for improved readability.

The capacity and rate limits included there are not very useful to us. They don’t reflect what quota is actually available in that location. Another field–the usageName field– will prove more useful, combined with the Usages API.

First 10 results in table format:

region	kind	modelName	modelVersion	skus (Comma separated)
westus	OpenAI	gpt-35-turbo	0613	GlobalBatch
westus	OpenAI	gpt-35-turbo	1106	Standard, GlobalBatch, ProvisionedManaged
westus	OpenAI	gpt-35-turbo	0125	Standard, GlobalBatch, ProvisionedManaged
westus	OpenAI	gpt-4	0125-Preview	ProvisionedManaged
westus	OpenAI	gpt-4	1106-Preview	Standard, ProvisionedManaged
westus	OpenAI	gpt-4	0613	GlobalBatch, ProvisionedManaged
westus	OpenAI	gpt-4-32k	0613	ProvisionedManaged
westus	OpenAI	gpt-4	vision-preview	Standard
westus	OpenAI	gpt-4	turbo-2024-04-09	Standard, GlobalStandard, GlobalBatch, ProvisionedManaged
westus	OpenAI	gpt-4o	2024-05-13	Standard, GlobalStandard, ProvisionedManaged, GlobalBatch, DataZoneStandard

The Usages API

The Models API does not offer subscription-specific limits or remaining quotas. To determine the suitable deployment region for our application, we need to assess current usages and limits, and then infer the available quotas. This requires making a request to the Usages API.

The Usages API is documented here, and the following URL is used for queries. It requires both subscriptionId and location as parameters.

https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.CognitiveServices/locations/{location}/usages?api-version=2023-05-01

Sample response:

{
    "name": {
        "value": "OpenAI.ProvisionedManaged",
        "localizedValue": "Provisioned Managed Throughput Unit"
    },
    "currentValue": 0,
    "limit": 0,
    "unit": "Count"
}

First 10 results in table format:

usageName	currentUsage	usageLimit	remainingQuota	usageUnit
OpenAI.GlobalStandard.gpt-4o	11	30	19	Count
OpenAI.DataZoneStandard.gpt-4o	10	30	20	Count
AccountCount	4	200	196	Count
AIServices.S0.AccountCount	4	50	46	Count
OpenAI.Standard.gpt-4o	1	8	7	Count
OpenAI.Standard.o1	0	0	0	Count
OpenAI.GlobalStandard.gpt-4o-mini	0	30	30	Count
OpenAI.GlobalStandard.gpt-4o-realtime-preview	0	1	1	Count
OpenAI.DataZoneStandard.gpt-4o-mini	0	30	30	Count
OpenAI.Standard.o1-mini	0	0	0	Count

The Locations API

Both the Models API and the Usages API require a region-specific query. Therefore, to begin, you need to obtain a list of all available regions using the Locations API.

The Locations API is fully documented here. To retrieve this data, use the following URL format, substituting in the necessary subscriptionId:

https://management.azure.com/subscriptions/{subscriptionId}/locations?api-version=2021-04-01

Sample response:

{
    "id": "/subscriptions/6415ebd4-1dd7-430f-bd4d-2f5e9419c1cd/locations/eastus",
    "name": "eastus",
    "type": "Region",
    "displayName": "East US",
    "regionalDisplayName": "(US) East US",
    "metadata": {
        "regionType": "Physical",
        "regionCategory": "Recommended",
        "geographyGroup": "US",
        "longitude": "-79.8164",
        "latitude": "37.3719",
        "physicalLocation": "Virginia",
    }
}

Note: Content has been streamlined for improved readability.

First 10 locations (out of approximately 95)

name	displayName	geographyGroup
eastus	East US	US
southcentralus	South Central US	US
westus2	West US 2	US
westus3	West US 3	US
australiaeast	Australia East	Asia Pacific
southeastasia	Southeast Asia	Asia Pacific
northeurope	North Europe	Europe
swedencentral	Sweden Central	Europe
uksouth	UK South	Europe
westeurope	West Europe	Europe

Data model

The data model may appear complex initially. Here’s a UML class diagram highlighting sections of the schemas from three API responses. This diagram clarifies the interconnections between these APIs and data models, making their relationships clearer.

Note: This diagram highlights only the relevant parts of the schema for our discussion, thus simplifying the complex interconnections among APIs.

The process

To answer our original question effectively, we need to follow these steps:

Retrieve a list of all regions using the Locations API.
For each region, obtain all models and their associated SKUs using the Model API.
Query usage data for each region to get current usage limits using the Usage API.
Combine region, model, SKU, and usage data by matching them on usage names.
Calculate the available quota by subtracting current usage from the limit for each model.
Represent model requirements as a series of predicates for evaluation.
Filter and retain models that satisfy the specified requirement predicates.
Identify regions that support models for all specified requirements.

A comprehensive Jupyter notebook containing the implementation of these steps is available in the following repository:

Note: Steps 1 to 5 involve substantial code that can be challenging to condense effectively for a blog post. If you are interested in seeing the detailed implementation of how these three APIs are integrated, please visit the accompanying GitHub repository. Your support by starring or forking the repository would be greatly appreciated and would help us continue providing valuable resources.

So let’s jump straight to step 6, where we’ll encode requirements.

Encoding requirements

The function model_matches creates a predicate that evaluates if the inputs (model, sku, usage) satisfy specific requirements:

def model_matches(model_names, tpm, sku_names):
    def criteria(model, sku, usage, **kwargs):
        # The predicate matches unless proven otherwise
        matches = True

        # Usage API provided the subscription limits and current usage
        limit = usage['limit']
        current = usage['currentValue']

        # We can therefore compute the remaining quota available
        remaining = limit - current

        # The predicate should return true if the model's name is any of the passed model names
        model_name = model['model']['name']
        if model_names:
            matches = matches and model_name in model_names

        # AND if the SKU's name is any of the passed SKU names
        sku_name = sku['name']
        if sku_names:
            matches = matches and sku_name in sku_names

        # AND if the requested tpm is bellow the remaining quota available
        if tpm:
            matches = matches and remaining >= tpm

        return matches
    return criteria

Now, let’s define each of our application requirements using the model_matches function:

editor_req      = model_matches(model_names = ['gpt-35-turbo', 'gpt-35-turbo-16k'], tpm = 10, sku_names = ['Standard', 'GlobalStandard'])
eval_req        = model_matches(model_names = ['gpt-4', 'gpt-4-32k'], tpm = 5, sku_names = ['Standard', 'GlobalStandard'])
writer_req      = model_matches(model_names = ['gpt-4o'], tpm = 15, sku_names = ['Standard', 'GlobalStandard'])
embedding_req   = model_matches(model_names = ['text-embedding-3-small', 'text-embedding-ada-002'], tpm = 30, sku_names = ['Standard', 'GlobalStandard'])

We then collect them into a list:

requirements = [editor_req, eval_req, writer_req, embedding_req]

Find the regions that have models for all requirements

def unique_regions(model_sku_usages):
    return set([model_sku_usage['region'] for model_sku_usage in model_sku_usages])

def filter_models_sku_usages(criteria, models_sku_usages):
    return list(filter(lambda msu: criteria(**msu), models_sku_usages))

def regions_matching_all(requirements, joined_model_sku_usages):
    # For each requirement, get the list of matching models
    filter_msu_sets = [filter_models_sku_usages(req, joined_model_sku_usages) for req in requirements]

    # For each set, extract unique regions
    regions_sets = [unique_regions(model_sku_usages) for model_sku_usages in filter_msu_sets]

    # Compute the intersetion of all region sets, resulting in regions matching all requirements
    return set.intersection(*regions_sets)

final_regions = regions_matching_all(requirements, joined_model_sku_usages)

We end up with a list of regions that meet all our requirements. Consequently, we can deploy our application to one of these regions, ensuring access to all necessary models.

Note: Please note that the code presented here is over simplified and has limitations. It doesn’t account for situations where multiple requirements share the same model, where remaining model quota shared will be under-accounted for and might result in a region being selected with not enough quota for the two requirements.

Let’s take an example, if two requirements each need an instance of the same model and that model has enough capacity in a given region for one requirement but not both of the requirements then the region will be selected when it should not.

Can you provide a better implementation? Feel free to submit your ideas in the comments bellow or by submitting a PR to the repo ;)

Conclusion

Automating the discovery of regions, models and available quotas can be challenging but Azure provides all the APIs required to do so and this article should help you get started.

Once you have your script in place, it's easy to re-run for each subsequent model deployment. This reduces friction for GenAIOps engineers that need to deploy multiple models while optimizing quota usage.

The approach discussed here is available as a Jupyter notebook in the following Github repository:

If you found this article helpful, please consider starring and cloning the repository. This feedback allows us to gauge whether our work resonates with our audience and helps us understand its impact and relevance.

Updated Feb 18, 2025

Version 1.0

apis

artificial intelligence