Azure Confidential Computing Blog

7 MIN READ

BigDL Privacy Preserving Machine Learning with Occlum OSS on Azure Confidential Computing

Copper Contributor

Nov 14, 2022

Introduction

Typical security measures may assist data at rest and in transit but can fall short of fully protecting data while it is actively used in memory. Intel® Software Guard Extensions (Intel® SGX) provides a protective hardware environment to secure data used in memory. For confidential computing, users can create Virtual Machines (VMs) with Intel® SGX to secure their applications during computation. However, building an end-to-end confidential computing application is not only knowledge intensive, but also requires a sound understanding of the application, Intel® SGX, and other security components.

This blog introduces you to a confidential computing solution for Privacy-Preserving Machine Learning (PPML) made available by Occlum and BigDL on the Azure cloud. This blog demonstrates the solution using a sample analytics application built for the NYTaxi dataset. This sample application leverages Azure confidential computing (ACC) components such as SGX Nodes for Azure Kubernetes Service (AKS), Microsoft Azure Attestation, Azure Key Vault (AKV), etc., as well as Occlum LibOS and BigDL PPML.

Solution Architecture and Sample Application

Let’s first review a typical PPML workflow on Kubernetes cluster as illustrated below. The solution on Azure cloud is built by applying the same workflow on Azure cloud using ACC components.

Users can follow the steps in the diagram above to walkthrough the PPML flow:

User submits job to K8s (Kubernetes) and creates the driver node
Client attests Attestation Service and submit policy
Driver initiates additional Executor nodes
Driver and Executor nodes attest with Attestation Service
Driver and Executor request keys from KMS (Key Management Service)
Executors read and decrypt input data
Executors run distributed Big Data, ML and DL programs
Executors encrypt and write output data

Now, let’s apply this same workflow on the Azure cloud. The diagram below illustrates the Azure PPML solution built with ACC components, Occlum LibOS and BigDL PPML.

In this Azure PPML solution, the User Application is a Spark application that can be written in Scala or Java. For our sample application, it’s a simple Spark application for querying the NYTaxi dataset. In this case, the Spark Driver takes the Spark jobs submitted by the Spark Client, distributes and schedules the work across the Spark Executors, and responds to the User Application. The Spark Executors execute the code assigned to them and report the state of the computation back to the Spark Driver

Intel® BigDL, which is the core enabler behind the end-to-end and distributed AI processing, works together with the Occlum LibOS to enable the User Application, Spark Driver and Spark Executors to run on a SGX-enabled AKS cluster. The Azure Attestation is used to fulfill the attestation process; while the Azure Data Lake Storge is used to host the data to be processed, and the Azure Key Vault can be KMS in end-to-end workflow.

Deployment

We use a sample NYTaxi dataset analytics application to demonstrate the PPML deployment procedure on ACC cluster. Following are the steps to deploy the solution:

Step 0: Deploy the Azure cloud services

Create the AKS cluster with Intel® SGX
Create an Azure storage account and upload data to the account. The NYTaxi in this example, is 50GB Dataset containing 1.5 billion records. The NYTaxi dataset is pre-existing on the Microsoft public Azure Storage Account here.
Set up Microsoft Azure Attestation Service. In this example, we use the default Microsoft Azure Attestation Service Provider and the default policies.
Create an Azure Linux VM: download, extract, and install Spark Client 3.1.2 on the VM. It doesn’t need to be Azure confidential computing VM.
Then install OpenJDK-8 and export the SPARK_HOME=${Spark_Binary_dir}

Note that NYTaxi data is not encrypted on storage, so secret key provisioning is not included in this demo. In real-world deployment, it's recommended to encrypt data on storage (encyrption at rest), then set up key management service (e.g., Azure Key Vault) and secret key provisioning in deployment.

Step 1: Build the sample application

Create NYTaxi query sample application using standard Spark SQL with Azure Storage data source.

Step 2: Submit job to the AKS Cluster

On the Azure VM, submit NYTaxi query on AKS by:

Clone the repository

git clone https://github.com/intel-analytics/BigDL-PPML-Azure-Occlum-Example.git

Configure the environment variables in the run_nytaxi_k8s.sh, driver.yaml, and executor.yaml files.
Configure the AKS address and submit the Nytaxi query task by running the run_nytaxi_k8s.sh.

bash run_nytaxi_k8s.sh

Step 3: execute the job on the AKS Cluster

The job is executed on the AKS cluster: in Spark driver/executor pod, Micrsoft Azure Attestation Service runs the attestation process to verify the trustworthiness of the platform and the integrity of the binaries running inside it. Upon completion of the attestation process, Spark executors will then run the data analysis with the Spark SQL query.

Step 4: Review the results

You should get a NYTaxi dataframe count and aggregation duration upon successful completion.

Performance data

To evaluate the performance of this solution, we make a simple benchmark based on our sample application. The benchmark runs on SGX environment and Non-SGX environment to give an intuitive performance compare.

Scenarios:

Scenario	Description
No Intel SGX	The driver and the executors are running without SGX support using regular Spark image (vanilla Spark).
Occlum	The driver and the executors are encrypted and run on Intel CPUs with SGX support using BigDL Occlum image.

Cluster info:

The cluster consist of 4 Standard_DC8ds_v3 nodes, same for two scenarios.
There is one spark driver plus several (1, 2 and 3) executors for benchmarking.
Each spark driver/executor is running on different cluster node.
All Pods (driver or executor) have the same CPU (4 cores) and memory requests/limits (8GB for No SGX, 8GB EPC for Occlum).

Results:

We run the benchmark with executors number 1, 2 and 3 for multiple times, and put the average duration time to the chart above.

The run time of the sample appilcaiton consists of Initialization Time and Execution Time. For this specific sample application, the Execution Time of BigDL PPML on Occlum is 130% of vanilla Spark when running on 1 executor, and reduced to 116% of vanilla Spark when running on 3 executors. That indicates BigDL PPML on Occlum has very limited performance impact (at most 30 %) to existing Spark applications, and this performance overhead will reduce when adding more executors.

The Initialization Time is considered a fixed time for this SGX environment, it takes around 50 seconds regardless of running on 1 executor or 3 executors. This Initialization Time is related to SGX enclave size. The larger enclave is used, the longer time it will need to initialization. In near future, Initialization Time will be greatly reduced by SGX Enclave Dynamic Memory Management (EDMM). For real-world Big Data or AI applications, when the execution time is longer, the performance impact introduced by the initialization time will be reduced.

Solution Components

These key components have been leveraged to build the end-to-end confidential computing workflow.

Azure Cloud Services:

Azure Data Lake Storage: a secure cloud storage platform that provides scalable, cost-effective storage for big data analytics.
Key Vault: Safeguard cryptographic keys and other secrets used by cloud apps and services. Although, this solution works for all Azure Key Vault types, it is recommended to use Azure Key Vault Managed HSM (FIPS 140-2 Level 3) for better safety.
Microsoft Azure Attestation Service: A unified solution for remotely verifying the trustworthiness of a platform and integrity of the binaries running inside it.

Intel® SGX

Intel® SGX helps protect data in use via application isolation technology. By protecting selected code and data from modification, developers can partition their applications into hardened enclaves or trusted execution modules to help increase application security.

Occlum

Occlum is a memory-safe, multi-process library OS (LibOS) for Intel SGX. As a LibOS, it enables legacy applications to run on Intel® SGX with little to no modifications of source code, thus protecting the confidentiality and integrity of user workloads transparently.

Here is the high-level overview of Occlum.

Occlum also has a unique “Occlum -> init ->application” boot flow. Generally, all operations which are required but not part of the application, such as remote attestation, could be put into the “init” process. This feature makes Occlum highly compatible with any remote attestation solution without involving application change. For example, to support Azure Attestation, Occlum provides the below boot flow.

This design offload the remote attestation burden from the application. For more details, please refer to the Occlum MAA init demo and the Occlum GitHub repo.

BigDL PPML

BigDL PPML provides a distributed platform for securing and protecting the end-to-end Big Data AI pipeline including data ingestion, data analysis, machine learning, and deep learning. In addition, it extends the single-node Trusted Execution Environment (TEE) to a Trusted Cluster Environment and allow unmodified Big Data analysis and ML/DL programs to run securely on a private or public cloud. The diagram and tasks below show the work behind BigDL PPML:

Computation and memory protected by Intel® SGX Enclaves
Network communication is protected by remote attestation and Transport Layer Security (TLS)
Storage (e.g., data and model) protected by encryption
Optional Federated Learning support