If you are a frequent user of Azure CLI or Azure PowerShell, then chances are that you have experienced one of the AI-powered features our service provides for the Azure command line tools. These features include generating up-to-date command usage examples with each new Azure CLI and Azure PowerShell releases to ensure the documentation is always up to date (“--help” or “-h” in the command line), enabling natural language search of a command in the Azure CLI (az find), assisting with failure recovery, and the recently released AZ Predictor (for Azure PowerShell). Our AI service for Azure tools serves thousands of user requests per second. In order to handle such high scale application workloads we use an Azure App Service tier called App Service Environments (ASE). While ensuring our Azure App Service is scalable and performant, we had to address several key challenges like being able to stress test the service with a high volume of traffic, running the service in fully isolated environments without affecting other Azure subscriptions, and throttling bad actors beyond reasonable levels of requests. We will share our journey of discovery, surprises, and learnings that led us to arrive at an enterprise-grade, cloud-scale AI service that powers our own client tools.
When a user interacts with one of our integrations in Azure CLI or Azure PowerShell, a request is made to our App Service. To fulfil the request, the App Service gets the result from the relevant AI model. Note that we use different resources for different types of models. For example, the model that generates examples when a user tries to find a command is powered by Azure Search with natural language search features. The failure recovery and AZ Predictor models are more suited to be stored in Azure Storage and loaded into memory to achieve high performance.
Considering the geo-diversity of our users and the high volume of traffic, the service runs in multiple regions to achieve quick response time and high availability. Each request is routed to the nearest region by Azure Traffic Manager (1). In each region, the REST API might talk to:
Figure 1: App Service Architecture
As we continue to enable more features in the service, it is important to design the App Service in a scalable way. We develop common libraries for accessing Azure resources like Azure Key Vault, Azure Search, and Azure Redis Cache. Those libraries can even be shared with other services as well.
Scalability and performance are critical for the service. A user typing on the command line typically expects a response in less than a second. This requires the service to respond to requests quickly, in the order of milliseconds. We receive thousands of requests per second (RPS), so we need to make sure the service scales fast enough to meet our changing demand. For example, we might experience a peak in the usage pattern when a new version of the Azure CLI comes out. To configure the scaling properly, we started by analyzing Azure CLI usage patterns to understand the expected traffic trends. Figure 2 shows the hourly traffic for the service in the order of 1000s.
Figure 2: Service Hourly Traffic
Based on the extremes from the usage data, the RPS estimation we used for stress testing was in the order of several 1000s of requests. Note that the estimation we used to stress test is higher than the traffic showed in this figure. This is in part because not all Azure CLI or Azure PowerShell users have upgraded to versions with our service integrations. We decided to play it safe with stress testing and prepare for the maximum number of requests we might receive plus some additional margin.
As it turned out, being able to stress test the service with thousands of requests per second was a huge challenge.
We explored a variety of stress testing tools including loader.io and JMeter. At the end we decided to use JMeter because it offers a few key features: the ability to control requests per second, it is fully supported in Azure DevOps, and it is actively maintained. However, using JMeter involved its own challenges:
We then considered running Azure Container Instance (ACI) to overcome these limitations. ACI is Azure CLI-friendly and we found it was quite easy to prepare and run a container in Azure. We were able to easily and quickly launch our images in multiple ACIs to reach our target RPS without the pain of having to individually configure and maintain multiple VMs.
This led us to an unexpected challenge. Hitting our service with 1000s of RPS brought down our service, because it was overloading the load balancing front ends for that stamp. A stamp is a cluster of servers. You can think of it as groups of racks with shared resources like traffic routing within an Azure Data Center. To avoid this issue, we moved to App Service Environments (ASE) for our infrastructure. ASE provides a fully-isolated and dedicated environment for securely running App Services on a high scale. This includes isolated load balancing front ends for network routing.
With the right architecture in place, we were able to scale both horizontally and vertically. Horizontal scaling is done by increasing the number of routing front ends and App instances in proportion to actual traffic. Vertical scaling is done through choosing the right tiers for Azure Redis Cache, Azure Search Service, and App Service plan.
Bad actors and broken scripts could bombard the service beyond reasonable levels of requests. Instead of allowing these clients to potentially bring the service down, we would ideally start blocking individual bad actors. The service implements both Client- and IP-based rate limiting. Having both these options allow us to be flexible, minimize slowdowns, and keeps the service running even when hit by bad actors.
As shown in the architecture diagram, we deploy the service to multiple geographic regions which are aligned with the distribution of Azure CLI users to make sure every user’s request can be served from a deployment with low latency. This helps us ensure we meet our service health targets. Our reliability goal is that 99.99% of requests in the preceding 5 minutes were successful. Our latency goal is that for the requests in the preceding 15 minutes are served in under 700 milliseconds for the 99.9th percentile.
All traffic for the services is routed by Azure Traffic Manager. When Traffic Manager detects that a region is unhealthy (which is checked via a monitor API defined in the service) the traffic is then routed to the next closest region.
We use Azure DevOps and Azure Resource Manager templates (ARM) to deploy the service. Deployment is first tested in Dev and PPE (pre-production environments) environments. An interesting learning is that the first ASE deployment takes 1.5 – 2 hours, though subsequent deployments are much shorter.
Our release pipeline follows the ring-based deployment process with the goal of being able to limit failures to the minimum number of customers.
Monitoring is important to make sure we know right away when something goes wrong. This is essential for us to meet our reliability and availability requirements. The service logs various metrics which are then used in a variety of alerts to the team: when an API call fails, when the App Service performance is degraded, when some resource in a region fails, when a region goes down etc. The metrics data is processed in near-real-time alerts (60s ingestion latency).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.