Big data preparation in Azure Machine Learning – powered by Azure Synapse Analytics
Published Apr 20 2021 09:00 AM 4,722 Views

Many customers who embark on a machine learning journey deal with big data, and need the power of distributed data processing engines to prepare their data for ML. By offering Apache Spark® (powered by Azure Synapse Analytics) in Azure Machine Learning (Azure ML), we are empowering customers to work on their end-to-end ML lifecycle including large-scale data preparation, featurization, model training, and deployment within Azure ML workspace without the need to switching between multiple tools for data preparation and model training. The ability to build the full ML lifecycle within Azure ML will reduce the time required for customers to iterate on a machine learning project which typically includes multiple rounds of data preparation and training. 


With the preview of managed Apache Spark in Azure ML, customers can use Azure ML notebooks to connect to Spark pools in Azure Synapse Analytics, to do interactive data preparation using PySpark. Customers have the option to configure Spark sessions to quickly experiment and iterate on the data. Once ready, they can leverage Azure ML pipelines to automate their end-to-end ML workflow from data preparation to model deployment all in one environment, while maintaining their data and model lineage. Customers who prefer to train in the Spark environment can choose to install relevant libraries such as Spark MLlib, MMLSpark, etc. to complete their training on Spark pools.


Customers in preview will be able to benefit from the following key capabilities:

Reuse Spark pools from Azure Synapse workspace in Azure ML

Customers can leverage existing Spark pools from Azure Synapse Analytics (Azure Synapse) in Azure ML by just linking their Azure ML and Synapse workspaces via the Azure ML Studio, the Python SDK, or the ARM template. Customers just need to follow the widget in UI or leverage a few lines of code as described in the documentation here.


Once the workspaces are linked, customers can attach existing Spark pools into Azure ML workspace and can also register the supported linked services (data store sources).




Perform interactive data preparation via Spark magic from Azure ML notebooks

Customers can use Azure ML notebooks to start Spark sessions in PySpark via Spark Magic on attached Spark pools. Customers can register Azure ML datasets to load data from storage of choice. For data in Gen1 and Gen2, customers can use their own identities to authenticate access to data by leveraging AML datasets. The attached Spark pools can be used normally in Azure ML experiments, pipelines, and designer. More information on leveraging Spark Magic for data preparation on AML notebooks here




Productionize via Azure ML pipelines to orchestrate E2E ML steps including data preparation

After completing the interactive data preparation, customers can leverage Azure ML pipelines to automate data preparation on Apache Spark runtime as a step in the overall machine learning workflow. Customers can use the SynapseSparkStep for data preparation and choose either TabularDataset or FileDataset as input. Customers can also set up HDFSOutputDatasetConfig to generate the sparkstep output as a FileDataset, to be consumed by the following AzureML pipeline step. More details on How to use Apache Spark (powered by Azure Synapse) in your machine learning pipeline here.


Get started with big data preparation in Azure ML via Apache Spark powered by Azure Synapse

Get started by visiting our documentation and let us know your thoughts. We are committed to making the data preparation experience in Azure ML better for you!

Learn more about the Azure Machine Learning service and get started with a free trial.

Version history
Last update:
‎Apr 19 2021 10:58 PM
Updated by: