Azure AI Confidential Inferencing Preview

Microsoft

Sep 24, 2024

Customers with the need to protect sensitive and regulated data are looking for end-to-end, verifiable data privacy, even from service providers and cloud operators. Azure’s industry-leading confidential computing (ACC) support extends existing data protection beyond encryption at rest and in transit, ensuring that data is private while in use, such as when being processed by an AI model. Customers in highly regulated industries, including the multi-national banking corporation RBC, have integrated Azure confidential computing into their own platform to garner insights while preserving customer privacy.

With the preview of Confidential inference for the Azure OpenAI Service Whisper model for speech to text transcription today, Microsoft is the first cloud provider offering confidential AI. Confidential Whisper offers end-to-end privacy of prompts containing audio and transcribed text responses by ensuring that the prompts are decrypted only within Trusted Execution Environments (TEE) on Azure Confidential GPU virtual machines (VMs).

These VMs offer enhanced protection of the inferencing application, prompts, responses and models both within the VM memory and when code and data is transferred to and from the GPU. Confidential AI also allows application developers to anonymize users accessing using cloud models to protect identity and from attacks targeting a user.

If you are interested in discussing Confidential AI uses cases with us and trying out confidential inferencing with the Azure OpenAI Service Whisper model, please visit this preview sign-up page. Read on for more details on how Confidential inferencing works, what developers need to do, and our confidential computing portfolio.

Who Confidential Inferencing Is For

Confidential inferencing is designed for enterprise and cloud native developers building AI applications that need to process sensitive or regulated data in the cloud that must remain encrypted, even while being processed. They also require the ability to remotely measure and audit the code that processes the data to ensure it only performs its expected function and nothing else. This enables building AI applications to preserve privacy for their users and their data.

How Confidential Inferencing Works

Confidential inferencing utilizes Azure confidential virtual machines with NVIDIA H100 Tensor Core GPU, which is now generally available. These VMs use a combination of SEV-SNP technology in AMD CPUs and Confidential Computing support in H100 GPUs to ensure integrity and privacy of all code and data loaded within the VM and the protected area of GPU memory.

For example, SEV-SNP encrypts and integrity-protects the entire address space of the VM using hardware managed keys. This means that any data processed within the TEE is protected from unauthorized access or modification by any code outside the environment, including privileged Microsoft code such as our virtualization host operating system and Hyper-V hypervisor. When the VMs are paired with H100 GPU in confidential computing mode, all traffic between the VM and GPU is encrypted and integrity protected from advanced attackers.

Confidential inferencing supports Oblivious HTTP with Hybrid Public Key Encryption (HPKE) to protect user privacy and encrypt and decrypt inferencing requests and responses. Enterprises and application providers can use an Oblivious HTTP proxy to encrypt prompts, which are routed through Azure Front Door and the Azure OpenAI Service load balancer to OHTTP gateways hosted within Confidential GPU VMs in a Kubernetes cluster managed by Azure Machine Learning’s Project Forge. The Front Door and load balancers are relays, and only see the ciphertext and the identities of the client and gateway, while the gateway only sees the relay identity and the plaintext of the request. The private data remains encrypted.

OHTTP gateways obtain private HPKE keys from the KMS by producing attestation evidence in the form of a token obtained from the Microsoft Azure Attestation service. This proves that all software that runs within the VM, including the Whisper container, is attested.

After obtaining the private key, the gateway decrypts encrypted HTTP requests, and relays them to the Whisper API containers for processing. When a response is generated, the OHTTP gateway encrypts the response and sends it back to the client.

Image: Confidential inference architecture

Confidential inferencing utilizes VM images and containers built securely and with trusted sources. A software bill of materials (SBOM) is generated at build time and signed for attestation of the software running in the TEE.

How to Integrate Confidential Inferencing

You can integrate with Confidential inferencing by hosting an application or enterprise OHTTP proxy that can obtain HPKE keys from the KMS, and use the keys for encrypting your inference data before leaving your network and decrypting the transcription that is returned. We are providing a reference implementation of such a proxy. The Whisper REST API and payload is unchanged.

There is overhead to support confidential computing, so you will see additional latency to complete a transcription request compared to standard Whisper. We are working with Nvidia to reduce this overhead in future hardware and software releases.

Our Confidential Computing Portfolio

Azure OpenAI Service Whisper is the first Azure AI Model-as-a-Service from Microsoft with confidential computing protection. As part of our long-term investment in confidential computing, we’ll continue to engage with our privacy-sensitive customers to best support their unique AI scenarios. We really want to hear from you about your use cases, application design patterns, AI scenarios, and what other models you want to see.

If you are interested in discussing Confidential AI uses cases with us and trying out confidential inferencing with the Azure OpenAI Service Whisper model, please visit this preview sign-up page.

Resources: