How to set Spark / Pyspark custom configs in Synapse Workspace spark pool
Published Feb 05 2021 06:51 AM 18.4K Views
Microsoft

In Azure Synapse, system configurations of spark pool look like below, where the number of executors, vcores, memory is defined by default.

mubhashk_0-1612465171017.png

 

There could be the requirement of few users who want to manipulate the number of executors or memory assigned to a spark session during execution time.

 

Usually, we can reconfigure them by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:

 

mubhashk_1-1612465200341.png

 

mubhashk_2-1612465230427.png

 

But in the Synapse spark pool, few of these user-defined configurations get overridden by the default value of the Spark pool.

 

What should be the next step to persist these configurations at the spark pool Session level?

 

For notebooks

If we want to set config of a session with more than the executors defined at the system level (in this case there are 2 executors as we saw above), we need to write below sample code to populate the session with 4 executors. This sample code helps to logically get more executors for a session.

mubhashk_3-1612465255185.png

Execute the below code to confirm that the number of executors is the same as defined in the session which is 4 :

mubhashk_4-1612465272088.png

In the sparkUI you can also see these executors if you want to cross verify :

mubhashk_5-1612465293726.png

A list of many session configs is briefed here 

 

We can also setup the desired session-level configuration in Apache Spark Job definition :

 

For Apache Spark Job:

 

If we want to add those configurations to our job, we have to set them when we initialize the Spark session or Spark context, for example for a PySpark job:

 

Spark Session:

 

from pyspark.sql import SparkSession

 

if __name__ == "__main__":

     

      # create Spark session with necessary configuration

      spark = SparkSession \

    .builder \

    .appName("testApp") \

    .config("spark.executor.instances","4") \

    .config("spark.executor.cores","4") \

    .getOrCreate()

 

Spark Context:

 

from pyspark import SparkContext, SparkConf

 

if __name__ == "__main__":

     

      # create Spark context with necessary configuration

      conf = SparkConf().setAppName("testApp").set("spark.hadoop.validateOutputSpecs", "false").set("spark.executor.cores","4").set("spark.executor.instances","4")

      spark = SparkContext(conf=conf)

 

 

Hope this helps you to configure a job/notebook as per your convenience with the number of executors.

 

 

Co-Authors
Version history
Last update:
‎Sep 15 2021 12:12 PM
Updated by: