Blog Post

AI - Azure AI services Blog
7 MIN READ

Azure OpenAI Architecture Patterns and implementation steps

ManasaDevadas's avatar
ManasaDevadas
Icon for Microsoft rankMicrosoft
Nov 13, 2023

 

Introduction:

A comprehensive overview of the most frequently used and discussed architecture patterns among our customers in various domains.

 

1) AOAI with Azure Frontdoor for loadbalancing

  • Use Azure Front Door for cross region global load balancing of requests across multiple Azure OpenAI endpoints.
  • In this architecture below Azure Front Door routes requests to multiple instances of Azure OpenAI hosted on multiple regions.
    AFD uses health check on the path /status-0123456789abcdef to determine the health and proximity of each Azure OpenAI endpoints.
  • The deployment name should be the same if you are load balancing, since that would be in the URL path.
  • Use Azure AD authentication for AOAI. You can create an App in AAD and then give the same OpenAI contributor/User role in AOAI. 


Architecture diagram:

 

 

Key Highlights:

  • Global load balancing across multiple Azure OpenAI endpoints in multiple regions with intelligent health probe monitoring.
  • AFD provides scale out and improved performance to your AOAI endpoints using Microsoft’s global cloud CDN and WAN.
  • Unified static and dynamic delivery offered in a single tier of AFD to accelerate and scale through caching, SSL offload, and layer 3-4 DDoS protection.
  • Protection against OWASP top 10 attacks, Common Vulnerabilities and Exposures (CVEs) and malicious bot attack through AFD WAF. Refer more here: https://learn.microsoft.com/en-us/azure/web-application-firewall/afds/protect-azure-open-ai
  • Define your own custom domain with AFD and AFD provides autorotation of managed SSL certificates.
  • As on today AFD cannot connect to your AOAI origin using Private Link.

If you set equal weights for all origins and a high latency sensitivity in Azure Front Door, it will consider all origins that have a latency within the specified range of the fastest origin as eligible for routing traffic. So, all the origins should receive approximately equal amounts of traffic, provided their latencies are within the specified range.

 

However, it’s important to note that this doesn’t guarantee a perfect round-robin distribution. The actual distribution can vary based on factors like network conditions and changes in latency. If you need strict round-robin load balancing, you might need to consider other services or features that specifically support this method.

 

 

 

Use Postman for testing:

Request 1:
 


Request 2:

 

For perfect round robin distribution, you can use Azure Application Gateway with the same health check endpoints.

 

2) AOAI with APIM

 

Architecture diagram:

 

Key highlights:

  • You can use APIM to manage the access, usage, and billing of your Azure OpenAI APIs, and apply policies such as authentication, caching, rate limiting, and transformation.
  • You can monitor and analyze the performance and health of your Azure OpenAI APIs, and troubleshoot any issues using APIM’s built-in tools and integrations with Azure Monitor and Application Insights.
  • You can publish your Azure OpenAI APIs to a developer portal, where you can provide documentation, samples, and interactive testing for your consumers.
  • You can use APIM to create composite APIs that can orchestrate multiple Azure OpenAI models or integrate with other Azure services and external APIs.

a) Round Robin load balancing with Retry logic

 

 

 

 

 

 

 

<policies>
    <inbound>
        <base />
        <cache-lookup-value key="backend-counter" variable-name="backend-counter" />
        <choose>
            <when condition="@(!context.Variables.ContainsKey("backend-counter"))">
                <set-variable name="backend-counter" value="0" />
                <cache-store-value key="backend-counter" value="0" duration="100" />
            </when>
        </choose>
        <choose>
            <when condition="@(int.Parse((string)context.Variables["backend-counter"]) == 0)">
                <set-backend-service base-url="https://aoaipoc.openai.azure.com/" />
                <set-variable name="backend-counter" value="1" />
                <cache-store-value key="backend-counter" value="1" duration="100" />
            </when>
            <when condition="@(int.Parse((string)context.Variables["backend-counter"]) == 1)">
                <set-backend-service base-url="https://aoaipoc2.openai.azure.com/" />
                <set-variable name="backend-counter" value="0" />
                <cache-store-value key="backend-counter" value="0" duration="100" />
            </when>
        </choose>
    </inbound>
    <backend>
        <retry condition="@(context.Response.StatusCode >= 500 || context.Response.StatusCode >= 400)" count="6" interval="10" first-fast-retry="true">
            <choose>
                <when condition="@((context.Response.StatusCode >= 500 || context.Response.StatusCode >= 400) && (int.Parse((string)context.Variables["backend-counter"])) == 0)">
                    <set-backend-service base-url="https://aoaipoc.openai.azure.com/" />
                    <set-variable name="backend-counter" value="1" />
                    <cache-store-value key="backend-counter" value="1" duration="100" />
                </when>
                <when condition="@((context.Response.StatusCode >= 500 || context.Response.StatusCode >= 400) && (int.Parse((string)context.Variables["backend-counter"])) == 1)">
                    <set-backend-service base-url="https://aoaipoc2.openai.azure.com/" />
                    <set-variable name="backend-counter" value="0" />
                    <cache-store-value key="backend-counter" value="0" duration="100" />
                </when>
            </choose>
            <forward-request buffer-request-body="true" />
        </retry>
    </backend>
    <outbound>
        <base />
   </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

 

 

 

 

 

 

 

 

Testing on round robin load balancing using APIM :

 

 

 

b) AAD authentication from APIM to Azure OpenAI

 

Step 1 – Enable Managed Identity in APIM

Step 2 – Provide necessary RBAC:

In the IAM of Azure OpenAI service add the OpenAI user role for the APIM Managed Identity (Managed Identity will have the same name of APIM).

 

 

 

 Step 3 - Add the Managed Identity policy in APIM:

 

 

 

 

 

 

 

<policies>
    <inbound>
        <base />
        <authentication-managed-identity resource="https://cognitiveservices.azure.com" />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

 

 

 

 

 

 

 

 

Testing for Managed Identity Policy:

 

c) Policy to extract callerID (Subject from APIM)

 

For extracting other details from JWT, refer - 

Azure API Management policy expressions | Microsoft Learn

 

 

 

 

 

 

 

 

<validate-jwt header-name="Authorization" 
              failed-validation-httpcode="401"
              failed-validation-error-message="Token is invalid" 
              output-token-variable-name="jwt-token">
	<issuers>
        <issuer>{{myIssuer}}</issuer>
    </issuers>
</validate-jwt>
<!-- Extract the subject and add it to a header -->
<set-header name="caller-objectid" exists-action="override">
    <value>@(((Jwt)context.Variables["jwt-token"]).Subject)</value>
</set-header>

 

 

 

 

 

 

 

 

d) Logging and Monitoring using APIM:

 

Use Azure monitor and APIM to enable enhanced logging and monitoring of the published AOAI APIs. Learn more - Tutorial - Monitor published APIs in Azure API Management | Microsoft Learn 

 

 

 

 

Sample log queries for prompt completion:

 

 

 

 

 

 

 

ApiManagementGatewayLogs
 | extend model = tostring(parse_json(BackendResponseBody)['model'])
 | extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])['prompt_tokens']
 | extend completiontokens = parse_json(parse_json(BackendResponseBody)['usage'])['completion_tokens']
 | extend responsetext = (parse_json(parse_json(BackendResponseBody)['choices'])[0]['message'])
 | extend prompttext = (parse_json(RequestBody)['messages'])

 

 

 

 

 

 

 

For more queries refer to documentation here: Implement logging and monitoring for Azure OpenAI large language models - Azure Architecture Center | Microsoft Learn

 

e) For advanced logging, more than 8192 bytes refer to the documentation here: openai-python-enterprise-logging/advanced-logging at main · Azure-Samples/openai-python-enterprise-logging · GitHub

 

f) For Budgets and cost management using APIM refer this blog - Azure Budgets and Azure OpenAI Cost Management - Microsoft Community Hub

 

 

3) AOAI with Frontdoor and APIM multi-region deployment for a full-fledged multi-region availability

Refer to the DR documentation - Deploy Azure API Management instance to multiple Azure regions - Azure API Management | Microsoft Learn

 

a. In Frontdoor give both APIM regional gateway URLs as backend Origins, example

https://apimname-westeurope-01.regional.azure-api.net & https://apimname-japaneast-01.regional.azure-api.net

b. Configure the API Management regional status endpoints - e.g. https://apimname-westeurope-01.regional.azure-api.net/status-0123456789abcdef

c. Sample policy to be used to make the regional gateways route to respective backends.

 

 

 

 

 

 

 

<policies>
    <inbound>
        <base />
        <choose>
            <when condition="@("West Europe".Equals(context.Deployment.Region, StringComparison.OrdinalIgnoreCase))">
                <set-backend-service base-url="http://aoai-backend-westeurope.com/" />
            </when>
            <when condition="@("Japan East".Equals(context.Deployment.Region, StringComparison.OrdinalIgnoreCase))">
                <set-backend-service base-url="http://aoai-backend-japaneast.com/" />
            </when>
            <otherwise>
                <set-backend-service base-url="http://aoai-backend-other.com/" />
            </otherwise>
        </choose>
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

 

 

 

 

 

 

 


In conclusion, this article will be a starting point to implement scalable architecture patterns using Azure OpenAI models with other Azure services. As we continue to explore the potential of AI, we’ll continue to update our patterns and documents, guiding us towards smarter and more efficient systems.

Updated Jan 02, 2024
Version 7.0
  • Quest198z's avatar
    Quest198z
    Copper Contributor

    Where are the patterns around privatizing connections and achieving the same without any public IPs in the setup or environment?