Introducing Serverless GPUs on Azure Container Apps

Cary_Chai

Microsoft

Nov 19, 2024

Azure Container Apps Serverless GPUs are now in public preview. Serverless GPUs enable you to seamlessly run your AI workloads on-demand with automatic scaling, optimized cold start, per-second billing, and reduced operational overhead.

We're excited to announce the public preview of Azure Container Apps Serverless GPUs accelerated by NVIDIA. This feature provides customers with NVIDIA A100 GPUs and NVIDIA T4 GPUs in a serverless environment, enabling effortless scaling and flexibility for real-time custom model inferencing and other machine learning tasks.

Serverless GPUs accelerate the speed of your AI development team by allowing you to focus on your core AI code and less on managing infrastructure when using NVIDIA accelerated computing. They provide an excellent middle layer option between Azure AI Model Catalog's serverless APIs and hosting models on managed compute. It provides full data governance as your data never leaves the boundaries of your container while still providing a managed, serverless platform from which to build your applications. Serverless GPUs are designed to meet the growing demands of modern applications by providing powerful NVIDIA accelerated computing resources without the need for dedicated infrastructure management.

"Azure Container Apps' serverless GPU offering is a leap forward for AI workloads. Serverless NVIDIA GPUs are well suited for a wide array of AI workloads from real-time inferencing scenarios with custom models to fine-tuning. NVIDIA is also working with Microsoft to bring NVIDIA NIM microservices to Azure Container Apps to optimize AI inference performance.” - Dave Salvator, Director, Accelerated Computing Products, NVIDIA

Create an Azure Container App with serverless GPUs and run stable diffusion

Key benefits of serverless GPUs

Scale-to zero GPUs: Support for serverless scaling of NVIDIA A100 and T4 GPUs.
Per-second billing: Pay only for the GPU compute you use.
Built-in data governance: Your data never leaves the container boundary.
Flexible compute options: Choose between NVIDIA A100 and T4 GPUs.
Middle-layer for AI development: Bring your own model on a managed, serverless compute platform.

Scenarios

Whether you choose to use NVIDIA A100 or T4 GPUs will depend on the types of apps you're creating. The following are a couple example scenarios. For each scenario with serverless GPUs, you pay only for the compute you use with per-second billing, and your apps will automatically scale in and out from zero to meet the demand.

NVIDIA T4

Real-time and batch inferencing: Using custom open-source models with fast startup times, automatic scaling, and a per-second billing model, serverless GPUs are ideal for dynamic applications that don't already have a serverless API in the model catalog.

NVIDIA A100

Compute intensive machine learning scenarios: Significantly speed up applications that implement fine-tuned custom generative AI models, deep learning, or neural networks.
High performance computing (HPC) and data analytics: Applications that require complex calculations or simulations, such as scientific computing and financial modeling as well as accelerated data processing and analysis among massive datasets.

Get started with serverless GPUs

Serverless GPUs are now available for workload profile environments in West US 3 and Australia East regions with more regions to come. You will need to have quota enabled on your subscription in order to use serverless GPUs. By default, all Microsoft Enterprise Agreement customers will have one quota. If additional quota is needed, please request it here.

Note: In order to achieve the best performance with serverless GPUs, use an Azure Container Registry (ACR) with artifact streaming enabled for your image tag. Follow steps here to enable artifact streaming on your ACR.

From the portal, you can select to enable GPUs for your Consumption app in the container tab when creating your Container App or your Container App Job.

You can also add a new consumption GPU workload profile to your existing Container App environment through the workload profiles UX in portal or through the CLI commands for managing workload profiles.

Deploy a sample Stable Diffusion app

To try out serverless GPUs, you can use the stable diffusion image which is provided as a quickstart during the container app create experience:

In the container tab select the Use quickstart image box.
In the quickstart image dropdown, select GPU hello world container.

If you wish to pull the GPU container image into your own ACR to enable artifact streaming for improved performance, or if you wish to manually enter the image, you can find the image at mcr.microsoft.com/k8se/gpu-quickstart:latest. For full steps on using your own image with serverless GPUs, see the tutorial on using serverless GPUs in Azure Container Apps.

Learn more about serverless GPUs

With serverless GPUs, Azure Container Apps now simplifies the development of your AI applications by providing scale-to-zero compute, pay-as you go pricing, reduced infrastructure management, and more. To learn more, visit:

Updated Nov 26, 2024

Version 3.0

application modernization

microsoft ignite 2024

modern apps

serverless

Cary_Chai