Answering the question: “Which Azure region can support all the requirements for deploying my AI application?”
By Cedric Vidal, Principal AI Advocate, Microsoft.
It is essential for every AI developer to determine the appropriate Azure region for deploying an AI application. With Azure AI Foundry, deploying a model is straightforward; one simply selects a model from the catalog, clicks deploy, and chooses a region. However, enterprise applications typically require the deployment of multiple models.
For instance, a standard Retrieval-Augmented Generation (RAG) application may necessitate at least three distinct models: a chat model for interaction (e.g. GPT 3.5 Turbo or GPT 4), an embedding model (e.g. ADA 2 or 3) for Q&A vector encoding, and an evaluation model (e.g. GPT 4) to periodically assess the quality of the responses.
Consider the example of the Contoso Creative Writer sample Azure AI application. This application exemplifies the current focus of AI developers. It is a typical multi-agent application, employing several agents to accomplish complex tasks effectively.
Here’s a table showing its requirements:
Requirement |
Model |
Type |
TPM |
editor |
gpt-35-turbo or gpt-35-turbo-16k |
Standard or Global Standard |
10k |
writer |
gpt-4o |
Standard or Global Standard |
15k |
evaluation |
gpt-4 or gpt-4-32k |
Standard or Global Standard |
20k |
embeddings |
text-embedding-3-small or text-embedding-ada-002 |
Standard or Global Standard |
30k |
As your application requirements and choices expand, automation can help with decision fatigue around what regions to choose for deployment. The challenge then lies in how to automate the discovery of models, regions, and available quotas so that a developer immediately knows which regions will meet all their requirements from the start.
(Generated using Dall-E 3)
Why automate the discovery of models, regions, and available quotas?
Besides reducing decision fatigue, automating the discovery of models and quotas can help organizations meet enterprise-wide model deployment standards, enforced, for example, using Azure Policies across various deployments and AI applications. For instance, there might be a preference for deploying all models within a particular region for compliance or performance purposes, or alternatively, prioritizing cost-effectiveness by distributing models across multiple regions.
This article discusses the available APIs on Azure that can help automate deployment decisions and how to integrate them into your model discovery process. Specifically, we will explore the following question: How can we quickly identify which Azure region can support all the requirements for deploying our AI application effectively?
Note: The terms location and region are used interchangeably throughout this discussion.
A tale of three REST APIs
The Model API
Azure AI Foundry platform offers the Models API, enabling users to query the available models for a specific subscription within a chosen location.
You can find detailed documentation on the Models API here. The API URL format is shown below, requiring the subscriptionId and location as inputs:
https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.CognitiveServices/locations/{location}/models?api-version=2024-10-01
Sample response:
{
"kind": "OpenAI",
"skuName": "S0",
"model": {
"format": "OpenAI",
"name": "whisper",
"version": "001",
"skus": [
{
"name": "Standard",
"usageName": "OpenAI.Standard.whisper",
"capacity": {
"maximum": 9999,
"default": 3
},
"deprecationDate": "2099-01-01T00:00:00Z",
"rateLimits": [
{
"key": "request",
"renewalPeriod": 60,
"count": 1
}
]
}
],
}
}
Note: Some parts of the responses have been stripped for improved readability.
The capacity and rate limits included there are not very useful to us. They don’t reflect what quota is actually available in that location. Another field–the usageName field– will prove more useful, combined with the Usages API.
First 10 results in table format:
region |
kind |
modelName |
modelVersion |
skus (Comma separated) |
westus |
OpenAI |
gpt-35-turbo |
0613 |
GlobalBatch |
westus |
OpenAI |
gpt-35-turbo |
1106 |
Standard, GlobalBatch, ProvisionedManaged |
westus |
OpenAI |
gpt-35-turbo |
0125 |
Standard, GlobalBatch, ProvisionedManaged |
westus |
OpenAI |
gpt-4 |
0125-Preview |
ProvisionedManaged |
westus |
OpenAI |
gpt-4 |
1106-Preview |
Standard, ProvisionedManaged |
westus |
OpenAI |
gpt-4 |
0613 |
GlobalBatch, ProvisionedManaged |
westus |
OpenAI |
gpt-4-32k |
0613 |
ProvisionedManaged |
westus |
OpenAI |
gpt-4 |
vision-preview |
Standard |
westus |
OpenAI |
gpt-4 |
turbo-2024-04-09 |
Standard, GlobalStandard, GlobalBatch, ProvisionedManaged |
westus |
OpenAI |
gpt-4o |
2024-05-13 |
Standard, GlobalStandard, ProvisionedManaged, GlobalBatch, DataZoneStandard |
The Usages API
The Models API does not offer subscription-specific limits or remaining quotas. To determine the suitable deployment region for our application, we need to assess current usages and limits, and then infer the available quotas. This requires making a request to the Usages API.
The Usages API is documented here, and the following URL is used for queries. It requires both subscriptionId and location as parameters.
https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.CognitiveServices/locations/{location}/usages?api-version=2023-05-01
Sample response:
{
"name": {
"value": "OpenAI.ProvisionedManaged",
"localizedValue": "Provisioned Managed Throughput Unit"
},
"currentValue": 0,
"limit": 0,
"unit": "Count"
}
First 10 results in table format:
usageName |
currentUsage |
usageLimit |
remainingQuota |
usageUnit |
OpenAI.GlobalStandard.gpt-4o |
11 |
30 |
19 |
Count |
OpenAI.DataZoneStandard.gpt-4o |
10 |
30 |
20 |
Count |
AccountCount |
4 |
200 |
196 |
Count |
AIServices.S0.AccountCount |
4 |
50 |
46 |
Count |
OpenAI.Standard.gpt-4o |
1 |
8 |
7 |
Count |
OpenAI.Standard.o1 |
0 |
0 |
0 |
Count |
OpenAI.GlobalStandard.gpt-4o-mini |
0 |
30 |
30 |
Count |
OpenAI.GlobalStandard.gpt-4o-realtime-preview |
0 |
1 |
1 |
Count |
OpenAI.DataZoneStandard.gpt-4o-mini |
0 |
30 |
30 |
Count |
OpenAI.Standard.o1-mini |
0 |
0 |
0 |
Count |
The Locations API
Both the Models API and the Usages API require a region-specific query. Therefore, to begin, you need to obtain a list of all available regions using the Locations API.
The Locations API is fully documented here. To retrieve this data, use the following URL format, substituting in the necessary subscriptionId:
https://management.azure.com/subscriptions/{subscriptionId}/locations?api-version=2021-04-01
Sample response:
{
"id": "/subscriptions/6415ebd4-1dd7-430f-bd4d-2f5e9419c1cd/locations/eastus",
"name": "eastus",
"type": "Region",
"displayName": "East US",
"regionalDisplayName": "(US) East US",
"metadata": {
"regionType": "Physical",
"regionCategory": "Recommended",
"geographyGroup": "US",
"longitude": "-79.8164",
"latitude": "37.3719",
"physicalLocation": "Virginia",
}
}
Note: Content has been streamlined for improved readability.
First 10 locations (out of approximately 95)
name |
displayName |
geographyGroup |
eastus |
East US |
US |
southcentralus |
South Central US |
US |
westus2 |
West US 2 |
US |
westus3 |
West US 3 |
US |
australiaeast |
Australia East |
Asia Pacific |
southeastasia |
Southeast Asia |
Asia Pacific |
northeurope |
North Europe |
Europe |
swedencentral |
Sweden Central |
Europe |
uksouth |
UK South |
Europe |
westeurope |
West Europe |
Europe |
Data model
The data model may appear complex initially. Here’s a UML class diagram highlighting sections of the schemas from three API responses. This diagram clarifies the interconnections between these APIs and data models, making their relationships clearer.
Note: This diagram highlights only the relevant parts of the schema for our discussion, thus simplifying the complex interconnections among APIs.
The process
To answer our original question effectively, we need to follow these steps:
- Retrieve a list of all regions using the Locations API.
- For each region, obtain all models and their associated SKUs using the Model API.
- Query usage data for each region to get current usage limits using the Usage API.
- Combine region, model, SKU, and usage data by matching them on usage names.
- Calculate the available quota by subtracting current usage from the limit for each model.
- Represent model requirements as a series of predicates for evaluation.
- Filter and retain models that satisfy the specified requirement predicates.
- Identify regions that support models for all specified requirements.
A comprehensive Jupyter notebook containing the implementation of these steps is available in the following repository:
Note: Steps 1 to 5 involve substantial code that can be challenging to condense effectively for a blog post. If you are interested in seeing the detailed implementation of how these three APIs are integrated, please visit the accompanying GitHub repository. Your support by starring or forking the repository would be greatly appreciated and would help us continue providing valuable resources.
So let’s jump straight to step 6, where we’ll encode requirements.
Encoding requirements
The function model_matches creates a predicate that evaluates if the inputs (model, sku, usage) satisfy specific requirements:
def model_matches(model_names, tpm, sku_names):
def criteria(model, sku, usage, **kwargs):
# The predicate matches unless proven otherwise
matches = True
# Usage API provided the subscription limits and current usage
limit = usage['limit']
current = usage['currentValue']
# We can therefore compute the remaining quota available
remaining = limit - current
# The predicate should return true if the model's name is any of the passed model names
model_name = model['model']['name']
if model_names:
matches = matches and model_name in model_names
# AND if the SKU's name is any of the passed SKU names
sku_name = sku['name']
if sku_names:
matches = matches and sku_name in sku_names
# AND if the requested tpm is bellow the remaining quota available
if tpm:
matches = matches and remaining >= tpm
return matches
return criteria
Now, let’s define each of our application requirements using the model_matches function:
editor_req = model_matches(model_names = ['gpt-35-turbo', 'gpt-35-turbo-16k'], tpm = 10, sku_names = ['Standard', 'GlobalStandard'])
eval_req = model_matches(model_names = ['gpt-4', 'gpt-4-32k'], tpm = 5, sku_names = ['Standard', 'GlobalStandard'])
writer_req = model_matches(model_names = ['gpt-4o'], tpm = 15, sku_names = ['Standard', 'GlobalStandard'])
embedding_req = model_matches(model_names = ['text-embedding-3-small', 'text-embedding-ada-002'], tpm = 30, sku_names = ['Standard', 'GlobalStandard'])
We then collect them into a list:
requirements = [editor_req, eval_req, writer_req, embedding_req]
Find the regions that have models for all requirements
def unique_regions(model_sku_usages):
return set([model_sku_usage['region'] for model_sku_usage in model_sku_usages])
def filter_models_sku_usages(criteria, models_sku_usages):
return list(filter(lambda msu: criteria(**msu), models_sku_usages))
def regions_matching_all(requirements, joined_model_sku_usages):
# For each requirement, get the list of matching models
filter_msu_sets = [filter_models_sku_usages(req, joined_model_sku_usages) for req in requirements]
# For each set, extract unique regions
regions_sets = [unique_regions(model_sku_usages) for model_sku_usages in filter_msu_sets]
# Compute the intersetion of all region sets, resulting in regions matching all requirements
return set.intersection(*regions_sets)
final_regions = regions_matching_all(requirements, joined_model_sku_usages)
We end up with a list of regions that meet all our requirements. Consequently, we can deploy our application to one of these regions, ensuring access to all necessary models.
Note: Please note that the code presented here is over simplified and has limitations. It doesn’t account for situations where multiple requirements share the same model, where remaining model quota shared will be under-accounted for and might result in a region being selected with not enough quota for the two requirements.
Let’s take an example, if two requirements each need an instance of the same model and that model has enough capacity in a given region for one requirement but not both of the requirements then the region will be selected when it should not.
Can you provide a better implementation? Feel free to submit your ideas in the comments bellow or by submitting a PR to the repo ;)
Conclusion
Automating the discovery of regions, models and available quotas can be challenging but Azure provides all the APIs required to do so and this article should help you get started.
Once you have your script in place, it's easy to re-run for each subsequent model deployment. This reduces friction for GenAIOps engineers that need to deploy multiple models while optimizing quota usage.
The approach discussed here is available as a Jupyter notebook in the following Github repository:
If you found this article helpful, please consider starring and cloning the repository. This feedback allows us to gauge whether our work resonates with our audience and helps us understand its impact and relevance.