Azure Synapse Spark Notebook – Unit Testing
Published Mar 02 2023 08:00 AM 11.6K Views
Microsoft

asethia_0-1674683863066.png

 

Author(s): Arun Sethia is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.

 

Introduction

In this blog post, we will cover how to test and create unit test cases for Spark jobs developed using Synapse Notebook. This is an extension of my previous blog, Synapse - Choosing Between Spark Notebook vs Spark Job Definition, where we discussed selecting between Spark Notebook and Spark Job Definition. Unit testing is an automated approach that developers use to test individual self-contained code units. By verifying code behavior early, it helps to streamline coding practices for larger systems.

 

For spark job definition developer usually develops the code using the preferred IDE and deploys the compiled package binaries using Spark job definition. In addition, developers can use their choice of unit test framework (ScalaTest, pytest, etc.) to create test cases as part of their project codebase. 

 

This blog is more focused on writing unit test cases for a Notebook so that you can test them before you roll them out to higher environments. The common programming languages used by Synapse Notebook are Python and Scala. Both languages follow functional and object-oriented programming paradigms.

 

I will refrain from getting into the deep-inside selection of the best programming paradigm for Spark programming; maybe some other day, we will pick this topic.

 

Code organization

The enterprise systems are modular, maintainable, configurable, and easy to test, apart from scalable and performant. In this blog, our focus will be on creating unit test cases for the Synapse Notebook in a modular and maintainable way.

 

Using Azure Synapse, we can organize the Notebook code in multiple ways using various configurations provided by Synapse.  

 

  • External Library - Libraries provide reusability and modularity to your application. It also helps to share business functions and enterprise code across multiple applications. Azure Synapse allows you to configure dependencies using library management. The Notebook can leverage installed packages within their jobs. We should avoid writing unit test cases for such an installed library inside the Notebook. A fair amount of test frameworks is available to create unit tests for those libraries within the library source code (or outside). The Notebook will leverage APIs from the installed libraries to orchestrate the business process.

asethia_1-1674684108592.png

Pros

 

  • It ensures that developers follow the enterprise guidelines from business (for example, computing net amount from a retail order or validation of data like phone number, etc.) and software engineering best practices (code coverage, styling, etc.).
  • Easy to integrate unit test framework either part of the library or outside the Notebook.
  • You can use the same library outside of Notebook as well (for example Spark Job Definition)
  • Easy to integrate various quality plugins part of IDE and build process, like code coverage, linter, code style, etc.

Cons

  • This approach would require constant library versions and enterprise governance.
  • Additional build tools (maven/sbt/gradle/setuptools) are required.
  • Local development environment setup

 

  • Functions and unit test in different Notebook – Azure Synapse allows you to run/load a Notebook from another Notebook. Given that, you can create a reusable code part of a Notebook and write test cases part of another Notebook. Using continuous integration and source control, you can control versions and releases.

asethia_2-1674684273967.png

 

Pros

 

  • Easy to develop using Notebook without any additional build tools.
  • Quick and easy to integrate with other Notebooks.
  • You don’t need any desktop IDE (integrated development environment), Synapse notebooks are integrated with the Monaco editor to bring IDE-style IntelliSense to the cell editor.

Cons

  • It restricts the scope of code reusability to Notebook; you can’t use the code written in Notebook outside of the Notebook (like Spark Job Definition)
  • It is difficult to maintain as the number of Notebooks grows because an additional Test case Notebook is needed for each business function.
  • You are restricted to testing via Notebook only.
  • No direct support for linter, code coverage, styling, etc.

 

  • Functions and unit test in same Notebook – The difference between this and the earlier approach is that only creating functions and test cases should be part of the same Notebook. You can still use continuous integration and source control for versioning and releases.

asethia_3-1674684476775.png

 

Pros

 

  • Easy to develop using Notebook without any additional build tools.
  • Easy to maintain compared to earlier approach (Functions and unit test in different Notebook), a smaller number of Notebooks.
  • Easy to refer test cases and business functions with a Notebook
  • You don’t need any desktop IDE (integrated development environment), Synapse notebooks are integrated with the Monaco editor to bring IDE-style IntelliSense to the cell editor.

Cons

  • It restricts the scope of code reusability to Notebook; you can’t use the code written in Notebook outside of the Notebook (like Spark Job Definition)
  • You are restricted to testing via Notebook only.
  • Additional code to skip test cases in the production (maybe comment it out or use custom annotation, etc.)
  • No direct support for linter, code coverage, styling, etc.

 

Unit test examples

This blog will cover some example codes using Approach#1 and Approach#2. The example codes are written in Scala; in the future, we will also add more code for PySpark.

 

An example project code is available on github; you can clone the github on your local computer.

  • The businessfunctions folder has code related to the approach using an external library (package). The source code of business functions APIs and unit test cases for these functions are part of the same module.
  • The Notebook folder has various Synapse Notebook used for these examples.

 

Unit test with external library

As we described earlier in this blog, this approach does not require us to write any unit test cases part of Notebook. Instead, the library source code and unit test cases coexist outside Notebook.

 

You can execute test cases and build a package library using the maven command. The target folder will have a business function jar, and the command console will show executed test cases (alternatively, you can download the pre-build jar)

 

asethia_4-1674684598082.png

 

Using your Synapse workspace studio; you can add the business function library.

 

asethia_5-1674684615946.png

 

Within Azure Synapse, an Apache Spark pool can leverage custom libraries that are uploaded as Workspace Packages.

 

asethia_6-1674684645979.png

 

Using Spark pool, the Notebook can use business library functions to build business processes. The source code of this Notebook is available on the GitHub.

 

asethia_7-1674684664855.png

 

 

Unit test - functions and unit test Notebook

As we described earlier in this blog, this approach will require a minimum of two Notebooks, one for the business functions and the other one for unit test cases. The example code is available inside notebook folder (git clone code).

 

The business functions are available inside the BusinessFunctionsLibrary Notebook and respective test cases are in UnitTestBusinessFunctionsLibrary Notebook.

asethia_8-1674684705027.png

Summary

Using multiple Notebooks or library approaches depends on your enterprise guidelines, individual choice, and timelines.

 

My next upcoming blog will explore more code and data quality in Azure Synapse.

 

 

4 Comments
Iron Contributor

Looking forward to the example how to do unit testing in Python/PySpark

Microsoft

It has been a long time since I was looking for this information! thanks for this post! looking forward for the next

Brass Contributor

Very good!

Microsoft

Was wondering if anyone explored getting these tests to run in the CiCd pipeline? If so, what where the outcomes? Any successful stories?

Version history
Last update:
‎Mar 01 2023 03:42 PM
Updated by: