Blog Post

Educator Developer Blog
2 MIN READ

Running Spark on a GPU enabled cluster with AZTK

Lee_Stott's avatar
Lee_Stott
Icon for Microsoft rankMicrosoft
Mar 21, 2019
First published on MSDN on Feb 07, 2018

The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies.

Many institutions now want to provide infrastructure for high performance hardware such as GPUs with big data engines such as Spark, thus allowing academics, data scientists and data engineering students  to enable many scenarios that would otherwise be difficult to achieve and without the power of the cloud.

More importantly with the ability of cloud resources and opex cost models, this can be now done with minimal capital expenditure.

Microsoft has made significant invent in NVIDA latest GPU SKUs and we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK) .

Azure Distributed Data Engineering Toolkit in a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch's infrastructure, helping you take your high performance implementations that are usually single-node only and distribute it across your Spark cluster.

The Microsoft team has created GPU-enabled Docker images for AZTK, including a python image that comes packaged with Anaconda, Jupyter and PySpark, and a R image that comes packaged with Tidyverse, RStudio-Server and SparklyR.

Getting Started

Additional resources
Updated Mar 21, 2019
Version 2.0
No CommentsBe the first to comment