Many customers who embark on a machine learning journey deal with big data, and need the power of distributed data processing engines to prepare their data for ML. By offering Apache Spark® (powered by Azure Synapse Analytics) in Azure Machine Learning (Azure ML), we are empowering customers to work on their end-to-end ML lifecycle including large-scale data preparation, featurization, model training, and deployment within Azure ML workspace without the need to switching between multiple tools for data preparation and model training. The ability to build the full ML lifecycle within Azure ML will reduce the time required for customers to iterate on a machine learning project which typically includes multiple rounds of data preparation and training.
With the preview of managed Apache Spark in Azure ML, customers can use Azure ML notebooks to connect to Spark pools in Azure Synapse Analytics, to do interactive data preparation using PySpark. Customers have the option to configure Spark sessions to quickly experiment and iterate on the data. Once ready, they can leverage Azure ML pipelines to automate their end-to-end ML workflow from data preparation to model deployment all in one environment, while maintaining their data and model lineage. Customers who prefer to train in the Spark environment can choose to install relevant libraries such as Spark MLlib, MMLSpark, etc. to complete their training on Spark pools.
Customers in preview will be able to benefit from the following key capabilities:
Reuse Spark pools from Azure Synapse workspace in Azure ML
Customers can leverage existing Spark pools from Azure Synapse Analytics (Azure Synapse) in Azure ML by just linking their Azure ML and Synapse workspaces via the Azure ML Studio, the Python SDK, or the ARM template. Customers just need to follow the widget in UI or leverage a few lines of code as described in the documentation here.
Perform interactive data preparation via Spark magic from Azure ML notebooks
Customers can use Azure ML notebooks to start Spark sessions in PySpark via Spark Magic on attached Spark pools. Customers can register Azure ML datasets to load data from storage of choice. For data in Gen1 and Gen2, customers can use their own identities to authenticate access to data by leveraging AML datasets. The attached Spark pools can be used normally in Azure ML experiments, pipelines, and designer. More information on leveraging Spark Magic for data preparation on AML notebooks here
Productionize via Azure ML pipelines to orchestrate E2E ML steps including data preparation
After completing the interactive data preparation, customers can leverage Azure ML pipelines to automate data preparation on Apache Spark runtime as a step in the overall machine learning workflow. Customers can use the SynapseSparkStep for data preparation and choose either TabularDataset or FileDataset as input. Customers can also set up HDFSOutputDatasetConfig to generate the sparkstep output as a FileDataset, to be consumed by the following AzureML pipeline step. More details on How to use Apache Spark (powered by Azure Synapse) in your machine learning pipeline here.
Get started with big data preparation in Azure ML via Apache Spark powered by Azure Synapse
Get started by visiting our documentation and let us know your thoughts. We are committed to making the data preparation experience in Azure ML better for you!