Single Node Data Exploration and ML on Azure Databricks

Former Employee

Oct 19, 2020

Oftentimes data scientists and other users working on smaller data sets in Azure Databricks explore data and build machine learning (ML) models using single-machine python and R libraries. This exploration and modeling doesn’t always require the distributed computing power of the Delta Engine and Apache Spark offered in Azure Databricks. Doing this type of work on a traditional multi-node cluster often results in wasted/underutilized compute resources on worker machines which results in unnecessary cost.

Single Node clusters is a new cluster mode that allows users to use their favorite libraries like Pandas, Scikit-learn, PyTorch, etc. without wasting unnecessary compute/cost associated with traditional multi-node clusters. Single Node clusters also support running Spark operations if needed, where the single node will host both the driver and executors spread across the available cores on the node. This provides the ability to load and save data using the efficient Spark APIs (with security features such as User Credential Passthrough) and also doing efficient exploration and ML using the most popular single-machine libraries.

If/when a data scientist wants to use distributed compute to do things like hyperparameter tuning and AutoML or work with larger datasets, they can simply switch over to a standard cluster with more nodes.

When the Single Node capability is combined with other capabilities like:

ETL and data transformation with the Delta Engine, Spark and Delta Lake
machine learning lifecycle management and model tracking, deployment and serving with MLflow
a specialized ML runtime with pre-installed and optimized data science and ML libraries like Scikit-learn, TensorFlow, PyTorch, and XGBoost
a collaborative workspace environment supporting notebooks and/or your favorite IDE