MSSparkUtils is the Swiss Army knife inside Synapse Spark
Published Dec 15 2022 08:00 AM 8,061 Views
Microsoft

I've been reviewing customer questions centered around "Have I tried using MSSparkUtils to solve the problem?"

 

One of the questions asked was how to share results between notebooks. Every time you hit "run" in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario. 

 

What is MSSparkUtils?

MSSparkUtils is a built-in package to help you easily perform common tasks called Microsoft Spark utilities. It is like a Swiss knife inside of the Synapse Spark environment.

 

Some scenarios where it could be used are for example:

  • Work with file systems
  • Get environment variables
  • Chain notebooks together
  • Get a data frame shared execution between notebooks

These scenarios are covered in more detail in the doc: Introduction to Microsoft Spark utilities - Azure Synapse Analytics

 

This blog will focus on the chained execution and sharing results between notebooks.

 

Chain notebooks together

You can execute more than one notebook through a root notebook using the method run or exit.

  • Run allows running another notebook referenced on the same session of the main notebook
  • Exit allows a run of another notebook referenced on the same session of the main notebook

Azure Synapse will return an exit value, complete the pipeline run, and stop the Spark session.

 

Follow an example in 2 steps:

  1. A notebook called Simple_read_ is just reading data from a test.parquet into a data frame called df shown in Fig. 1- simple_read:

Liliam_Leme_2-1668072139294.png

Fig. 1 - simple_read

 

2.  Now, I created a second notebook called sharing_results which is executing the first notebook - Simple_read_ created in step 1.  Fig. 2 - sharing_results will show the results of the chained execution:

Liliam_Leme_3-1668072139296.png

Fig. 2 - sharing_results

 

How to share a data frame execution between notebooks?

That is a quite simple enhancement of the chaining notebook logic. If we transfer the results from the simple_read_ notebook into a view and execute it from the main notebook the execution will be happening on the same session. Therefore, the Sharing_results notebook will be able to see the results from the simple_read notebook

 

For more information about what it means same session, review the docs:

 

Code example for the notebook Simple_read_ in Pyspark:

 

 

%%pyspark

df = spark.read.load('abfss://parquet@contianername.dfs.core.windows.net/test.parquet', format='parquet')

#display(df.limit(10))

df.createOrReplaceTempView("pysparkdftemptable")

 

 

 

The following Fig. 3 - Enhanc. Simple_read_ shows this idea:

Liliam_Leme_4-1668072139297.png

Fig. 3 - Enhanc. Simple_read_

 

Code example for the notebook Sharing_results in Pyspark:

 

 


from pyspark.sql.functions import col, when

from pyspark.sql import SparkSession

mssparkutils.notebook.run("/Simple_read_", 1000) 

dfread = spark.sql("select * from pysparkdftemptable")



display(dfread.limit(10))

 

 

 

Fig 4- Enhance, shows the results for the notebook that was chained and kept on the same session:

 

Liliam_Leme_5-1668072139298.png

Fig 4 - Enhance

 

The following image shows the process of how it works:

 

Liliam_Leme_1-1669299340345.png

Fig. 5 - Flux

 

Summary

MSSparkUtils is like a Swiss knife inside the Synapse Spark environment. That allows you to achieve more from the Synapse Spark environment and even share the same session between notebooks which could be also used even in other scenarios for example when you want to reuse parameters between notebooks on the same session.

 

That's it for this blog and I hope this can help you and your learning journey with Synapse!

11 Comments
Co-Authors
Version history
Last update:
‎Dec 14 2022 09:19 AM
Updated by: