The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies.
Many institutions now want to provide infrastructure for high performance hardware such as GPUs with big data engines such as Spark, thus allowing academics, data scientists and data engineering students to enable many scenarios that would otherwise be difficult to achieve and without the power of the cloud.
More importantly with the ability of cloud resources and opex cost models, this can be now done with minimal capital expenditure.
Microsoft has made significant invent in NVIDA latest GPU SKUs and we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK) .
Azure Distributed Data Engineering Toolkit in a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch's infrastructure, helping you take your high performance implementations that are usually single-node only and distribute it across your Spark cluster.
The Microsoft team has created GPU-enabled Docker images for AZTK, including a python image that comes packaged with Anaconda, Jupyter and PySpark, and a R image that comes packaged with Tidyverse, RStudio-Server and SparklyR.
Getting Started
- Download and get started with the Azure Distributed Data Engineering Toolkit (AZTK)
- Please feel free to submit issues via Github
-
See
Azure Batch
, the underlying Azure service used by the Azure Distributed Data Engineering Toolkit
- More general purpose HPC on Azure
- gentle intro to Spark - https://go.databricks.com/gentle-intro-spark
- The Data Engineer's Guide to Spark - https://go.databricks.com/data-engineer-spark-guide - which covers things like Spark batch and structured streaming etc.
- The Data Scientist's Guide to Spark - https://go.databricks.com/data-scientist-spark-guide - which focuses on things like pre-processing, cleaning, structuring, modelling, tuning etc.