Artificial Intelligence (AI) is a rapidly evolving field that offers many possibilities for enhancing the capabilities and functionalities of various applications and systems. Developing and deploying your own AI models; however, can be challenging, costly, and time-consuming, especially for development teams and individual developers. That's why many companies offer AI as a service (AIaaS), which allows users to access AI models and services through remote, restful APIs. In this blog post, we will explore how you can use the Azure OpenAI (AOAI) API, one of the most advanced and versatile AIaaS platforms, to access state-of-the-art AI models from your app or system. We will then show how to increase your application’s resiliency using Azure API Management (APIM) with Azure Open AI.
In addition to AI development, Azure OpenAI is very popular right now which means it might not be available in the region you desire. Microsoft is working to make our Azure Open AI services available as quickly as possible. The datacenters where capacity is immediately available for specific models do not always align with the specific datacenters that you have chosen to deploy your apps and systems that will access the LLM endpoints. Additionally, for some customers, the application or system that would need to access the LLM service is still on-premises. The architecture pattern later in this article helps address this issue and shows a multi-region AOAI deployment for increase resiliency.
The Azure OpenAI RESTful API
The Azure OpenAI (AOAI) service is a RESTful API offering access to a suite of powerful and general-purpose AI models, developed by OpenAI, a leading research organization in the field of artificial intelligence. AOAI API allows users to integrate AI capabilities into their apps or systems, without requiring any prior knowledge or expertise in AI development or deployment.
The Azure OpenAI API supports a variety of use cases and domains. The Azure OpenAI API is based on the GPT-3 model, which is one of the largest and most advanced language models in the world, capable of generating coherent and relevant text on almost any topic, given some input or prompt. The Azure OpenAI API also provides access to other specialized models, such as DALL-E, which can generate images from text descriptions, or CLIP, which can classify images based on natural language queries.
The Azure OpenAI API is a powerful and general-purpose service that can produce high-quality and diverse data, but also potentially harmful or inappropriate data, depending on the input, the parameters, and the model that you use. We encourage everyone to be mindful of these implications and review Microsoft’s responsible AI principles. These were created to help AI developers and users successfully navigate ethical and social considerations.
If you’re just getting started, check out this article on deploying the AOAI service. For more detailed information about AOAI API resources, please refer the API documentation and AOAI service models articles.
Design Considerations
Designing Azure OpenAI solutions requires awareness of a few constraints that are specific to the AOAI API. By default, a quota is allocated to AOAI service subscription by region for each respective model. The quota is measured in Tokens-per-Minute (TPM) and is decremented by each model deployment. Please review this article for more information on quotas and planning your consumption.
Key Points:
- Dynamic quota is a preview feature that may lessen the burden of managing quota.
- Quota doesn’t guarantee that the underlying capacity is available.
- Throttling can be experienced if you exceed your quota.
- Response payload sizes might vary depending on prompts, request parameters, and the selected model.
- Provisioned throughput units (PTU) can provide more predictable performance and cost savings.
- Depending on available capacity, your AOAI subscription may be in a different region than where your application and data reside.
How can I ensure application performance?
APIs are foundational to modern application architectures. Latency and availability are key factors that impact performance. Whether an API is within 5m or 1.5k, these factors remain constant and require vigilant testing to ensure performance.
Thankfully, Azure provides native tools to measure performance. Application Insights can measure the latency of calls made to AOAI along with other critical metrics. Additionally, JMeter can be used with Azure Load Testing to automate testing, generate reports, and can be incorporated in your release management pipelines.
Combined with effective prompt engineering techniques, these tools will help ensure your application’s performance meets business requirements. If you are using provisioned-throughput, the Azure OpenAI benchmarking tool is designed for experimentation using variable traffic patterns to help you optimize your solution.
For a more in-depth review, I encourage you to read the performance and latency article that explains these factors in relation to Azure OpenAI.
How can I ensure business continuity?
Resiliency planning is always crucial, but if you are using multiple Azure OpenAI subscriptions, or your subscription resides in a different region(s), you will want a robust solution using an established pattern. Azure API Management’s Circuit Breaker is a public preview feature that can help you improve the resiliency of your app or system in several ways. In the following section, we’ll discuss how you can use it with the Azure OpenAI API.
The Circuit Breaker approach
This approach involves monitoring the response time and error rate of the Azure OpenAI API, and temporarily disabling the API call if it exceeds a certain threshold.
This way, you can prevent your app or system from being overwhelmed by slow or faulty responses, and instead fallback to alternative data sources or behaviors. You can use Azure API Management to implement this approach by configuring the policies and rules for the Circuit Breaker pattern, as described here: Using Azure API Management Circuit Breaker and Load balancing with Azure OpenAI Service - Microsoft Community Hub
The Smart Load Balancing approach
This approach involves distributing the load of the Azure OpenAI API calls among different instances or endpoints, based on their availability and performance.
This way, you can optimize the usage and cost of the Azure OpenAI API and avoid overloading or underutilizing any instance or endpoint. You can use Azure API Management to implement this approach by configuring the policies and rules for the Load Balancing pattern, as described here: 🚀 Smart load balancing for OpenAI endpoints and Azure API Management - Microsoft Community Hub
Conclusion
We hope you enjoyed this blog post and learned something new about how to use Azure OpenAI with Azure API Management to effectively access powerful AI services from your app or system. The Azure OpenAI API is a game-changing platform that can enable you to create amazing and innovative solutions for a variety of domains and scenarios, but it also requires careful planning and design to ensure optimal performance, scalability, security, and cost-efficiency. In this blog post, we discussed some of the key aspects and challenges of integrating the Azure OpenAI API into your architecture, and we presented some best practices and patterns to help you overcome them. If you want to learn more about the Azure OpenAI API and how to use it effectively, you can check out the following resources:
- Azure OpenAI API overview | Microsoft Learn
- Azure OpenAI API documentation | Microsoft Learn
- Azure/aoai-apim: Scaling AOAI using APIM, PTUs and TPMs (github.com)
- Azure/azure-openai-benchmark: Azure OpenAI benchmarking tool (github.com)
Thank you for reading and happy problem solving!
Contributors: Tim Ferro, James Gibbings, Matt Anderson, Shelly Avery, Rob McKenna Randy Nale