MSSparkUtils is the Swiss Army knife inside Synapse Spark
Published Dec 15 2022 08:00 AM 8,128 Views
Microsoft

I've been reviewing customer questions centered around "Have I tried using MSSparkUtils to solve the problem?"

 

One of the questions asked was how to share results between notebooks. Every time you hit "run" in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario. 

 

What is MSSparkUtils?

MSSparkUtils is a built-in package to help you easily perform common tasks called Microsoft Spark utilities. It is like a Swiss knife inside of the Synapse Spark environment.

 

Some scenarios where it could be used are for example:

  • Work with file systems
  • Get environment variables
  • Chain notebooks together
  • Get a data frame shared execution between notebooks

These scenarios are covered in more detail in the doc: Introduction to Microsoft Spark utilities - Azure Synapse Analytics

 

This blog will focus on the chained execution and sharing results between notebooks.

 

Chain notebooks together

You can execute more than one notebook through a root notebook using the method run or exit.

  • Run allows running another notebook referenced on the same session of the main notebook
  • Exit allows a run of another notebook referenced on the same session of the main notebook

Azure Synapse will return an exit value, complete the pipeline run, and stop the Spark session.

 

Follow an example in 2 steps:

  1. A notebook called Simple_read_ is just reading data from a test.parquet into a data frame called df shown in Fig. 1- simple_read:

Liliam_Leme_2-1668072139294.png

Fig. 1 - simple_read

 

2.  Now, I created a second notebook called sharing_results which is executing the first notebook - Simple_read_ created in step 1.  Fig. 2 - sharing_results will show the results of the chained execution:

Liliam_Leme_3-1668072139296.png

Fig. 2 - sharing_results

 

How to share a data frame execution between notebooks?

That is a quite simple enhancement of the chaining notebook logic. If we transfer the results from the simple_read_ notebook into a view and execute it from the main notebook the execution will be happening on the same session. Therefore, the Sharing_results notebook will be able to see the results from the simple_read notebook

 

For more information about what it means same session, review the docs:

 

Code example for the notebook Simple_read_ in Pyspark:

 

 

%%pyspark

df = spark.read.load('abfss://parquet@contianername.dfs.core.windows.net/test.parquet', format='parquet')

#display(df.limit(10))

df.createOrReplaceTempView("pysparkdftemptable")

 

 

 

The following Fig. 3 - Enhanc. Simple_read_ shows this idea:

Liliam_Leme_4-1668072139297.png

Fig. 3 - Enhanc. Simple_read_

 

Code example for the notebook Sharing_results in Pyspark:

 

 


from pyspark.sql.functions import col, when

from pyspark.sql import SparkSession

mssparkutils.notebook.run("/Simple_read_", 1000) 

dfread = spark.sql("select * from pysparkdftemptable")



display(dfread.limit(10))

 

 

 

Fig 4- Enhance, shows the results for the notebook that was chained and kept on the same session:

 

Liliam_Leme_5-1668072139298.png

Fig 4 - Enhance

 

The following image shows the process of how it works:

 

Liliam_Leme_1-1669299340345.png

Fig. 5 - Flux

 

Summary

MSSparkUtils is like a Swiss knife inside the Synapse Spark environment. That allows you to achieve more from the Synapse Spark environment and even share the same session between notebooks which could be also used even in other scenarios for example when you want to reuse parameters between notebooks on the same session.

 

That's it for this blog and I hope this can help you and your learning journey with Synapse!

11 Comments
Microsoft

Note1 of clarification: If you want to send a variable through notebook execution you can use the magic command %run

doc: How to use Synapse notebooks - Azure Synapse Analytics | Microsoft Learn

Note 2: If you want to execute notebooks in parallel. You can create one notebook as root and execute the others through MSSparkUtils keeping the same session as the post proposed, hence speeding up the execution as it will not need to start a new cluster for the notebook execution.

Iron Contributor

@Liliam_C_Leme 
on "Note 2": if you want to process multiple notebooks in parallel within the same session it is a little bit more complicated.

If you just have a plain for-each-loop in python calling `mssparkutils.notebook.run`, this will run the notebooks sequentially.

For parallel execution you have to make use of threadding in Python...

from concurrent.futures import ThreadPoolExecutor

timeout = 3600 # 3600 seconds = 1 hour

notebooks = [
    {"path": "notebook1", "params": {"param1": "value1"}},
    {"path": "notebook2", "params": {"param2": "value2"}},
    {"path": "notebook3", "params": {"param3": "value3"}},
]

with ThreadPoolExecutor() as ec:
    for notebook in notebooks:
        ec.submit(mssparkutils.notebook.run, notebook["path"], timeout, notebook["params"])

 

Microsoft
Copper Contributor

@Liliam_C_Leme currently if Azure Synapse is connected to GIT, the %run command doesn't recognize notebooks not published, when notebooks are triggered through pipelines in feature branch and secondly %run command doesn't accept variables as parameters. Is there any work around? And curious to know if any fix is planned in future from Azure Synapse team?

Iron Contributor

@Riyaz1979 concerning running unpublished notebooks via %run / mssparkutils.notebook.run within a debug pipeline (from a feature branch):
we also ran into this isue and opened a support case since in my eyes this is clearly a bug / feature gap. However, after long discussion Microsoft stated that from thier point of view this is a feature request and that they are not going to fix it... We are disappointed :unamused:
So basically proper isolated development on a feature branch is currently not possible; you have to publish your changes so you can run a pipeline-test.

Microsoft

thanks @_MartinB and @Riyaz1979 I was off due to the holidays. I will check this, and if it is the case I will also add it as a feature request.

Microsoft

How to use Synapse notebooks - Azure Synapse Analytics | Microsoft Learn. It is one the docs as well, I added as feedback as it could be an improvement "

  • %run command currently only supports to pass a absolute path or notebook name only as parameter, relative path is not supported.
  • %run command currently only supports to 4 parameter value types: int, float, bool, string, variable replacement operation is not supported.
  • The referenced notebooks are required to be published. You need to publish the notebooks to reference them unless Reference unpublished notebook is enabled. Synapse Studio does not recognize the unpublished notebooks from the Git repo.
Iron Contributor

Hi @Liliam_C_Leme ,
the limitation concerning running unpublished notebooks in a debug pipeline is still missing 

Copper Contributor

Hi @Liliam_C_Leme 

 

It would be great to see support for notebook utils in C#.

 

Implementing a task factory to execute Notebook.Run(); seems like a match made in heaven, what are the plans for C# support?

Microsoft

hi, @mzivtins I will provide this feedback to the PG. thank you!

Microsoft

@mzivtins thank you, please do! I also filled one feedback, but feel free to provide to them as well. 

Co-Authors
Version history
Last update:
‎Dec 14 2022 09:19 AM
Updated by: