Running Spark on a GPU enabled cluster with AZTK

Microsoft

Mar 21, 2019

First published on MSDN on Feb 07, 2018

The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies.

Many institutions now want to provide infrastructure for high performance hardware such as GPUs with big data engines such as Spark, thus allowing academics, data scientists and data engineering students to enable many scenarios that would otherwise be difficult to achieve and without the power of the cloud.

More importantly with the ability of cloud resources and opex cost models, this can be now done with minimal capital expenditure.

Microsoft has made significant invent in NVIDA latest GPU SKUs and we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK) .

Azure Distributed Data Engineering Toolkit in a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch's infrastructure, helping you take your high performance implementations that are usually single-node only and distribute it across your Spark cluster.

The Microsoft team has created GPU-enabled Docker images for AZTK, including a python image that comes packaged with Anaconda, Jupyter and PySpark, and a R image that comes packaged with Tidyverse, RStudio-Server and SparklyR.

Getting Started

Download and get started with the Azure Distributed Data Engineering Toolkit (AZTK)

Please feel free to submit issues via Github

Additional resources

See Azure Batch , the underlying Azure service used by the Azure Distributed Data Engineering Toolkit
More general purpose HPC on Azure
gentle intro to Spark - https://go.databricks.com/gentle-intro-spark
The Data Engineer's Guide to Spark - https://go.databricks.com/data-engineer-spark-guide - which covers things like Spark batch and structured streaming etc.
The Data Scientist's Guide to Spark - https://go.databricks.com/data-scientist-spark-guide - which focuses on things like pre-processing, cleaning, structuring, modelling, tuning etc.