Query serverless SQL pool from an Apache Spark Scala notebook
Published Apr 16 2021 04:21 AM 12.9K Views
Microsoft

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.

You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.

In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

 

Configuring connection to the serverless SQL pool endpoint

Azure Synapse Analytics enables you to run your queries on an external SQL query engine (Azure SQL, SQL Server, a dedicated SQL pool in Azure Synapse) using standard JDBC connection. With the Apache Spark runtime in Azure Synapse, you are also getting pre-installed driver that enables you to send a query to any T-SQL endpoint. This means that you can use this driver to run a query on a serverless SQL pool.

First, you need to initialize the connection with the following steps:

  • Define connection string to your remote T-SQL endpoint (serverless SQL pool in this case),
  • Specify properties (for example username/password)
  • Set the driver for connection.

The following Scala code contains the code that initializes connection to the serverless SQL pool endpoint:

 

 

 

// Define connection:
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")

val hostname = "<WORKSPACE NAME>-ondemand.sql.azuresynapse.net"
val port = 1433
val database = "master" // If needed, change the database 
val jdbcUrl = s"jdbc:sqlserver://${hostname}:${port};database=${database}"

// Define connection properties:
import java.util.Properties

val props = new Properties()
props.put("user", "<sql login name>")
props.put("password", "<sql login password>")

// Assign driver to connection:
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
props.setProperty("Driver", driverClass)

 

 

This code should be placed in some cell in the notebook and you will be able to use this connection to query external T-SQL endpoints. In the following sections you will see how to read data from some SQL table or view or run ad-hoc query using this connection.

 

Reading content of SQL table

The serverless SQL pool in Azure Synapse enables you to create views and external tables over data stored in your Azure Data Lake Storage account or Azure CosmosDB analytical store. With the connection that is initialized in the previous step, you can easily read the content of the view or external table.

In the following simplified example, the Scala code will read data from the system view that exists on the serverless SQL pool endpoint:

 

 

val objects = spark.read.jdbc(jdbcUrl, "sys.objects", props).
objects.show(10)

 

 

If you create view or external table, you can easily read data from that object instead of system view.

You can easily specify what columns should be returned and some conditions:

 

 

val objects = spark.read.jdbc(jdbcUrl, "sys.objects", props).
                            select("object_id", "name", "type").
                            where("type <> 'S'")
objects.show(10)

 

 

 

Executing remote ad-hoc query

You can easily define T-SQL query that should be executed on remote serverless SQL pool endpoint and retrieve results. The Scala sample that can be added to the initial code is shown in the following listing:

 

 

val tsqlQuery =
"""
select top 10 *
from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
"""

val cases = spark.read.jdbc(jdbcUrl, s"(${tsqlQuery}) res", props)
cases.show(10)

 

 

The text of T-SQL query is defined the variable tsqlQuery. Spark notebook will execute this T-SQL query on the remote serverless Synapse SQL pool using spark.read.jdbc() function.

The results of this query are loaded into local data frame and displayed in the output.

 

Conclusion

Azure Synapse Analytics enables you to easily integrate analytic runtimes and run a query from the Apache Spark runtime on the Synapse SQL pool. Although Apache Spark has built-in functionalities that enable you to access data on Azure Storage, there some additional Synapse SQL functionalities that you can leverage in Spark jobs:

  • Accessing storage using SAS tokens or workspace managed identity. This way you can use serverless SQL pool to access Azure Data Lake storage protected with private endpoints or time limited keys.
  • Using custom language processing text rules. Synapse SQL contains text comparison and sorting rules for most of the world language. If you need to use case or accent insensitive searches or filter text using Japanese, France, German, or any other custom language rules, Synapse SQL provides native support for text processing.

The samples described in this article might help you to reuse functionalities that are available in serverless Synapse SQL pool to load data from Azure Data Lake storage or Azure Cosmos DB analytical store directly in your Spark Data Frames. Once you load your data, Apache Spark will enable you to analyze data sets using advanced data transformation and machine learning functionalities that exist in Spark libraries.

3 Comments
Co-Authors
Version history
Last update:
‎Sep 15 2021 12:04 PM
Updated by: