Provision on-demand Spark clusters on Docker using Azure Batch's infrastructure

Community Manager

Since its release 3 years ago, Apache Spark has soared in popularity amongst Big Data users, but is also increasingly common in the HPC space. However, spinning up a Spark cluster, on-demand, can often be complicated and slow. Instead, Spark developers often share pre-existing clusters managed by their company's IT team. In these scenarios, Spark developers run their Spark applications on static clusters that are in constant flux between under-utilization and insufficient capacity. You're either out of capacity, or you're burning dollars on idle nodes.

 

I'm excited to announce our beta release of the Azure Distributed Data Engineering Toolkit - an open source python CLI tool that allows you to provision on-demand Spark clusters and submit Spark jobs directly from your CLI.

 

821fd04a-7e79-4bc4-8fff-64a8225ecaa6.png

 

Read about it in the Azure blog

 

0 Replies