Blog Post

Analytics on Azure Blog
2 MIN READ

Single Node Data Exploration and ML on Azure Databricks

MikeCornell-Databricks's avatar
Oct 19, 2020

Oftentimes data scientists and other users working on smaller data sets in Azure Databricks explore data and build machine learning (ML) models using single-machine python and R libraries. This exploration and modeling doesn’t always require the distributed computing power of the Delta Engine and Apache Spark offered in Azure Databricks. Doing this type of work on a traditional multi-node cluster often results in wasted/underutilized compute resources on worker machines which results in unnecessary cost.

 

 

Single Node clusters is a new cluster mode that allows users to use their favorite libraries like Pandas, Scikit-learn, PyTorch, etc. without wasting unnecessary compute/cost associated with traditional multi-node clusters. Single Node clusters also support running Spark operations if needed, where the single node will host both the driver and executors spread across the available cores on the node. This provides the ability to load and save data using the efficient Spark APIs (with security features such as User Credential Passthrough) and also doing efficient exploration and ML using the most popular single-machine libraries.  

 

 

If/when a data scientist wants to use distributed compute to do things like hyperparameter tuning and AutoML or work with larger datasets, they can simply switch over to a standard cluster with more nodes. 

 

When the Single Node capability is combined with other capabilities like:

Azure Databricks provides a truly unified experience intended to make data scientists and other analysts more efficient and effective.

 

Updated Oct 16, 2020
Version 1.0

1 Comment

  • rgupta129's avatar
    rgupta129
    Copper Contributor

    Love this new flexibility with significant cost savings. I believe, with this feature, there won't be need to install a copy of Spark on one's personal computer. With Spark's 3 month release cycle, it is really hard to keep up with new versions and explore the changes/improvements. This new cost effective solution is certainly well thought through.