NVIDIA GPU Acceleration for Apache Spark™ in Azure Synapse Analytics

Former Employee

May 25, 2021

Azure recently announced support for NVIDIA’s T4 Tensor Core Graphics Processing Units (GPUs) which are ideal for deploying machine learning inferencing or analytical workloads in a cost-effective manner. With Apache Spark™ deployments tuned for NVIDIA GPUs, plus pre-installed libraries, Azure Synapse Analytics offers a simple way to leverage GPUs to power a variety of data processing and machine learning tasks. With built-in support for NVIDIA’s RAPIDS acceleration, the Azure Synapse version of GPU-accelerated Spark offers gains of 2x on standard analytical benchmarks compared to running on CPUs, all without any code changes. Additionally, for machine learning workloads Azure Synapse offers Microsoft's Hummingbird out-of-box which can leverage these GPUs to offer significant acceleration on traditional ML workloads.

Beginning today, this GPU acceleration feature in Azure Synapse is available for private preview by request.

The benefits of GPU Acceleration

GPUs offer extraordinarily low price-per-performance and high compute performance by speeding up multi-core servers for parallel processing. While a CPU consists of a few cores, optimized for sequential serial processing, a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed to handle multiple tasks simultaneously. Considering that data scientists spend up to 80% of their time on data pre-processing, GPUs are an asset in one’s data processing pipelines compared to relying on pipelines containing CPUs alone.

The benefits of GPU acceleration in Apache Spark™ include:

Data processing, queries and model training are completed faster; allowing accelerated time to insight.
The same GPU-accelerated infrastructure can be used for both Spark and ML/DL frameworks, eliminating the need for complex decision making and tuning.
Fewer compute nodes are required; reducing infrastructure cost and potentially helping avoid scale-related problems.

Collaboration with NVIDIA

NVIDIA and Azure Synapse have teamed up to bring GPU acceleration to data scientists and data engineers. This collaboration is primarily focused on integrating RAPIDS Accelerator for Apache Spark™ into Azure Synapse. This integration will allow customers to use NVIDIA GPUs for Apache Spark™ applications with no-code change and with an experience identical to a CPU cluster. In addition, this collaboration will continue to add support for the latest NVIDIA GPUs and networking products and provide continuous enhancements for big data customers who are looking to improve productivity and save costs with a single pipeline for data engineering, data preparation, and machine learning.

When asked about the collaboration and the importance of having GPUs in Azure Synapse, Scott McClellan, Senior Director, Data Science at NVIDIA said, “The synergy between Azure Synapse and NVIDIA is critical to democratize AI for citizen data scientists on Azure as businesses look to gain competitive advantage with advanced analytics, artificial intelligence (AI), and machine learning (ML). Azure Synapse is transforming siloed enterprise analytics into an integrated platform to accelerate time to insights across data warehouses and big data systems. The on-going collaboration will seamlessly integrate RAPIDS Accelerator for Apache Spark, accelerate the Azure Synapse platform, and fast track new feature development for Accelerated Data Engineering and Data Science applications.”

To learn more about this collaboration, check out our presentation at NVIDIA’s GTC 2021 Conference.

Apache Spark™ 3.0 GPU Acceleration in Azure Synapse

While Apache Spark™ provides GPU support out-of-box, configuring all the required hardware and installing all the low-level libraries can take significant effort. When you attempt to use GPU-enabled Apache Spark™ pools in Azure Synapse, you will immediately notice a surprisingly simple user experience:

Behind the scenes heavy lifting: To be able to run GPU libraries, hardware libraries like NVIDIA CUDA are required for communication with the graphics card on the host machine. Downloading and installing these libraries takes both time and effort. Through integration with Azure, Azure Synapse takes care of pre-installing these libraries and setting up all the complex networking amongst compute nodes to offer you GPU Apache Spark™ pools within just a few minutes so you can stop worrying about setup and focus instead on solving your business problems.

Optimized Spark configuration: By collaborating with NVIDIA, we have come up with optimal configurations for your GPU-enabled Apache Spark™ pools so your workloads run most optimally saving you both time and operational costs.

Packed with Data Prep and ML Libraries: The GPU-enabled Apache Spark™ pools in Azure Synapse come built-in with two popular libraries with support for more on the way:

RAPIDS for Data Prep: RAPIDS is a suite of open-source software libraries and APIs for executing end-to-end data science and analytics pipelines entirely on GPUs, allowing for a substantial speed up, particularly on large data sets. Built on top of NVIDIA CUDA and UCX, the RAPIDS Accelerator for Apache Spark™ enables GPU-accelerated SQL and DataFrame operations and Spark shuffles. Since there are no code changes required to leverage these accelerations, you can also accelerate your data pipelines that rely on Linux Foundation's Delta Lake or Microsoft's Hyperspace indexing (both of which are available on Synapse out-of-box).
Hummingbird for accelerating scoring and inference over your traditional ML models. Hummingbird is a library for converting traditional ML operators to tensors, with the goal of accelerating inference (scoring/prediction) for traditional machine learning models.

When running NVIDIA Decision Support (NDS) test queries, derived from industry-known benchmarks, over 1 TB of Parquet data our early results indicate that GPUs can deliver up to 2x acceleration in overall query performance, without any code changes.