I've been reviewing customer questions centered around "Have I tried using MSSparkUtils to solve the problem?"
One of the questions asked was how to share results between notebooks. Every time you hit "run" in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario.
What is MSSparkUtils?
MSSparkUtils is a built-in package to help you easily perform common tasks called Microsoft Spark utilities. It is like a Swiss knife inside of the Synapse Spark environment.
Some scenarios where it could be used are for example:
These scenarios are covered in more detail in the doc: Introduction to Microsoft Spark utilities - Azure Synapse Analytics
This blog will focus on the chained execution and sharing results between notebooks.
You can execute more than one notebook through a root notebook using the method run or exit.
Azure Synapse will return an exit value, complete the pipeline run, and stop the Spark session.
Follow an example in 2 steps:
2. Now, I created a second notebook called sharing_results which is executing the first notebook - Simple_read_ created in step 1. Fig. 2 - sharing_results will show the results of the chained execution:
How to share a data frame execution between notebooks?
That is a quite simple enhancement of the chaining notebook logic. If we transfer the results from the simple_read_ notebook into a view and execute it from the main notebook the execution will be happening on the same session. Therefore, the Sharing_results notebook will be able to see the results from the simple_read notebook.
For more information about what it means same session, review the docs:
Code example for the notebook Simple_read_ in Pyspark:
%%pyspark
df = spark.read.load('abfss://parquet@contianername.dfs.core.windows.net/test.parquet', format='parquet')
#display(df.limit(10))
df.createOrReplaceTempView("pysparkdftemptable")
The following Fig. 3 - Enhanc. Simple_read_ shows this idea:
Code example for the notebook Sharing_results in Pyspark:
from pyspark.sql.functions import col, when
from pyspark.sql import SparkSession
mssparkutils.notebook.run("/Simple_read_", 1000)
dfread = spark.sql("select * from pysparkdftemptable")
display(dfread.limit(10))
Fig 4- Enhance, shows the results for the notebook that was chained and kept on the same session:
The following image shows the process of how it works:
MSSparkUtils is like a Swiss knife inside the Synapse Spark environment. That allows you to achieve more from the Synapse Spark environment and even share the same session between notebooks which could be also used even in other scenarios for example when you want to reuse parameters between notebooks on the same session.
That's it for this blog and I hope this can help you and your learning journey with Synapse!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.