Synapse genie visual image @Rajeshdadi , @KranthiMedam, @ravitejabhallamudi @InnovatorsClub
Synapse Genie Framework improves Spark pool utilization by executing multiple Synapse notebooks on the same Spark pool instance. It considers the sequence and dependencies between notebook activities in an ETL pipeline which results in higher usage of a full cluster for resources available in a Spark pool.
Synapse Genie is helpful when you have a set of related notebooks that are part of a Synapse pipeline. In any group of notebooks, the workload varies from small to large. By executing such varied workloads on a single Spark pool instance, you can improve resource utilization and reduce Spark pool costs. It's ideal to execute notebooks through the Genie framework when the workloads are a mix of CPU and IO intensive operations.
During development of a data application in Synapse, we tend to develop multiple notebooks for a pipeline. This is likely due to multiple developers working on tasks independently or due to different transformation logic required to process different datasets.
In this case, the workload per notebook is of varied size. Some notebooks have a small workload and others have large workloads. When these notebooks are invoked from a pipeline, each notebook spins up a separate Spark pool and reserves these nodes till the execution is complete. This results in additional cost, execution time and inefficient usage of pool resources. Even a small notebook with a few lines of code reserves at least two nodes and incurs cost for spin up, execution and deallocation time. Genie utility tries to avoid this by attempting to run all notebooks within a single Spark context.
The Genie framework is a metadata driven utility written in Python. It is implemented using threading (ThreadPoolExecutor module) and directed acyclic graph (Networkx library). It consists of a wrapper notebook, that reads metadata of notebooks and executes them within a single Spark session. Each notebook is invoked on a thread with MSSparkutils.run() command based on the available resources in the Spark pool. The dependencies between notebooks are understood and tracked through a directed acyclic graph.
Here is an example of metadata:
A sample graph screenshot that Genie builds for its execution:
Early observations have shown a reduction of execution time for a pipeline by 30% to 50%. Of course, this will vary per workload. In a few cases, we reduced the size of our Spark pool cluster rather than opting for a reduction in just pipeline execution duration after onboarding to Genie framework. Over time, both choices translate to significant cost savings.
Genie utility consists of a Python Wheel package and a Synapse notebook. Notebook information is stored as metadata for the framework and the wrapper notebook is to be invoked through a Synapse pipeline.
For detailed instructions about onboarding and installation of the framework to your Synapse workspace, check out the onboarding process for Genie framework.
The Genie framework is not a direct release of Azure Synapse Analytics products and will not be officially supported by Synapse. It is a utility developed by the HR Data Insights team within the Microsoft Digital Employee Experience organization. This framework was developed as a way in which notebooks can be run on the same instance of Spark pool and can help higher resource utilization, thus reducing execution time and cost.
This framework is not ideal for notebooks which use parallel threading heavily and run-on exceptionally large datasets (in TB’s). It is better suited when there is a mix of small and large workloads as mentioned in the “When to use Genie framework?” section of this blog.
In the Genie framework, all notebooks run with the same Spark context and Spark session by default. This can cause any temporary views defined in one notebook to be available in another. At times, this may lead to a name collision if two notebooks have different definitions, but one definition is overwritten in the Spark session.
To avoid this, use a naming convention for views that denotes the notebook in which it is run. Alternatively, to isolate the Spark session on a notebook from other notebooks in the Genie framework, you can set the Spark session on the first cell of the notebook in the below manner:
Using the Synapse Genie utility can reduce execution time of your pipeline, thereby reducing the overall costs. One can try and reduce the Spark pool node sizes to verify if the workload can be run on a smaller cluster as all Spark pool resources are available for the master genie notebook.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.