Best way to orchestrate pyspark scripts?

Copper Contributor

Hey there,

 

I have 12 pyspark scripts that run in sequence to create a data pipeline for data prep and modeling.

 

What is the best tool to be able to have multiple runs of the end to end pipeline and be able to pass parameters to the job?

 

Azure Jobs? Data Factory? Azure Pipelines?

 

My experience has been with AWS and I would use glue jobs and airflow to do this. Any ideas on what is the best way to create what is essentially a DAG to manage these Pyspark scripts on Azure?

 

Thank you,

 

B

1 Reply
Good evening B,

I do not have much AWS experience, however for Azure you have a number of options to orchestrating PySpark in a pipeline.

Azure Data Factory: This is a cloud-based data integration service that allows you to create, schedule, and orchestrate data workflows. Data Factory supports Python code execution, including PySpark, and allows you to schedule and parameterize your PySpark scripts. It also supports integration with Azure Databricks, which can provide a fully managed PySpark environment.

Azure Databricks: This is a fully managed Apache Spark-based analytics platform that provides an integrated environment for PySpark development and execution. It supports scheduling, parameterization, and integration with other Azure services.

Azure HDInsight: This is a fully managed cloud service that makes it easy to process big data using popular open-source frameworks, including Spark. HDInsight supports PySpark execution and scheduling, and provides a range of deployment options including cluster and edge node deployments.

Azure Batch: This is a fully managed service for running large-scale parallel and batch compute jobs in the cloud. It can be used to schedule and run PySpark scripts on a cluster of VMs, and provides a range of customization options for your compute environment.

If you are already using Azure services for other parts of your data pipeline, such as storage and analytics, then using Data Factory or Databricks may provide a more seamless integration. If you need more control over your compute environment, or need to run your PySpark scripts on a cluster of VMs, then Azure Batch may be a better option.

Hope that helps,

Luke