Author(s): Arun Sethia is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.
In this blog post, we will cover how to test and create unit test cases for Spark jobs developed using Synapse Notebook. This is an extension of my previous blog, Synapse - Choosing Between Spark Notebook vs Spark Job Definition, where we discussed selecting between Spark Notebook and Spark Job Definition. Unit testing is an automated approach that developers use to test individual self-contained code units. By verifying code behavior early, it helps to streamline coding practices for larger systems.
For spark job definition developer usually develops the code using the preferred IDE and deploys the compiled package binaries using Spark job definition. In addition, developers can use their choice of unit test framework (ScalaTest, pytest, etc.) to create test cases as part of their project codebase.
This blog is more focused on writing unit test cases for a Notebook so that you can test them before you roll them out to higher environments. The common programming languages used by Synapse Notebook are Python and Scala. Both languages follow functional and object-oriented programming paradigms.
I will refrain from getting into the deep-inside selection of the best programming paradigm for Spark programming; maybe some other day, we will pick this topic.
The enterprise systems are modular, maintainable, configurable, and easy to test, apart from scalable and performant. In this blog, our focus will be on creating unit test cases for the Synapse Notebook in a modular and maintainable way.
Using Azure Synapse, we can organize the Notebook code in multiple ways using various configurations provided by Synapse.
Pros
|
|
Cons |
|
Pros
|
|
Cons |
|
Pros
|
|
Cons |
|
This blog will cover some example codes using Approach#1 and Approach#2. The example codes are written in Scala; in the future, we will also add more code for PySpark.
An example project code is available on github; you can clone the github on your local computer.
As we described earlier in this blog, this approach does not require us to write any unit test cases part of Notebook. Instead, the library source code and unit test cases coexist outside Notebook.
You can execute test cases and build a package library using the maven command. The target folder will have a business function jar, and the command console will show executed test cases (alternatively, you can download the pre-build jar)
Using your Synapse workspace studio; you can add the business function library.
Within Azure Synapse, an Apache Spark pool can leverage custom libraries that are uploaded as Workspace Packages.
Using Spark pool, the Notebook can use business library functions to build business processes. The source code of this Notebook is available on the GitHub.
As we described earlier in this blog, this approach will require a minimum of two Notebooks, one for the business functions and the other one for unit test cases. The example code is available inside notebook folder (git clone code).
The business functions are available inside the BusinessFunctionsLibrary Notebook and respective test cases are in UnitTestBusinessFunctionsLibrary Notebook.
Using multiple Notebooks or library approaches depends on your enterprise guidelines, individual choice, and timelines.
My next upcoming blog will explore more code and data quality in Azure Synapse.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.