Azure Synapse Analytics Blog

4 MIN READ

Query serverless SQL pool from an Apache Spark Scala notebook

Microsoft

Apr 16, 2021

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.

You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.

In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

Configuring connection to the serverless SQL pool endpoint

Azure Synapse Analytics enables you to run your queries on an external SQL query engine (Azure SQL, SQL Server, a dedicated SQL pool in Azure Synapse) using standard JDBC connection. With the Apache Spark runtime in Azure Synapse, you are also getting pre-installed driver that enables you to send a query to any T-SQL endpoint. This means that you can use this driver to run a query on a serverless SQL pool.

First, you need to initialize the connection with the following steps:

Define connection string to your remote T-SQL endpoint (serverless SQL pool in this case),
Specify properties (for example username/password)
Set the driver for connection.

The following Scala code contains the code that initializes connection to the serverless SQL pool endpoint:

// Define connection:
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")

val hostname = "<WORKSPACE NAME>-ondemand.sql.azuresynapse.net"
val port = 1433
val database = "master" // If needed, change the database 
val jdbcUrl = s"jdbc:sqlserver://${hostname}:${port};database=${database}"

// Define connection properties:
import java.util.Properties

val props = new Properties()
props.put("user", "<sql login name>")
props.put("password", "<sql login password>")

// Assign driver to connection:
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
props.setProperty("Driver", driverClass)

This code should be placed in some cell in the notebook and you will be able to use this connection to query external T-SQL endpoints. In the following sections you will see how to read data from some SQL table or view or run ad-hoc query using this connection.

Reading content of SQL table

The serverless SQL pool in Azure Synapse enables you to create views and external tables over data stored in your Azure Data Lake Storage account or Azure CosmosDB analytical store. With the connection that is initialized in the previous step, you can easily read the content of the view or external table.

In the following simplified example, the Scala code will read data from the system view that exists on the serverless SQL pool endpoint:

val objects = spark.read.jdbc(jdbcUrl, "sys.objects", props).
objects.show(10)

If you create view or external table, you can easily read data from that object instead of system view.

You can easily specify what columns should be returned and some conditions:

val objects = spark.read.jdbc(jdbcUrl, "sys.objects", props).
                            select("object_id", "name", "type").
                            where("type <> 'S'")
objects.show(10)

Executing remote ad-hoc query

You can easily define T-SQL query that should be executed on remote serverless SQL pool endpoint and retrieve results. The Scala sample that can be added to the initial code is shown in the following listing:

val tsqlQuery =
"""
select top 10 *
from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
"""

val cases = spark.read.jdbc(jdbcUrl, s"(${tsqlQuery}) res", props)
cases.show(10)

The text of T-SQL query is defined the variable tsqlQuery. Spark notebook will execute this T-SQL query on the remote serverless Synapse SQL pool using spark.read.jdbc() function.

The results of this query are loaded into local data frame and displayed in the output.

Conclusion

Azure Synapse Analytics enables you to easily integrate analytic runtimes and run a query from the Apache Spark runtime on the Synapse SQL pool. Although Apache Spark has built-in functionalities that enable you to access data on Azure Storage, there some additional Synapse SQL functionalities that you can leverage in Spark jobs:

Accessing storage using SAS tokens or workspace managed identity. This way you can use serverless SQL pool to access Azure Data Lake storage protected with private endpoints or time limited keys.
Using custom language processing text rules. Synapse SQL contains text comparison and sorting rules for most of the world language. If you need to use case or accent insensitive searches or filter text using Japanese, France, German, or any other custom language rules, Synapse SQL provides native support for text processing.

The samples described in this article might help you to reuse functionalities that are available in serverless Synapse SQL pool to load data from Azure Data Lake storage or Azure Cosmos DB analytical store directly in your Spark Data Frames. Once you load your data, Apache Spark will enable you to analyze data sets using advanced data transformation and machine learning functionalities that exist in Spark libraries.

Updated Sep 15, 2021

Version 2.0

Microsoft

Joined March 07, 2019

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity

3 Comments

srini365
Copper Contributor
Jun 12, 2023
Hi ,
Want to execute server less SQL pool external views from synapse pySpark notebook using azure active directory authentication.
Please let me know how to implement.

Thank you,
Sri
Pichlerpa1440
Copper Contributor
Mar 20, 2023
JovanPopthanks for sharing this article. Is it possible to use the workspace Managed Identity for authentication instead of plain username/password? I already tried it using pyodbc and enabling Run as managed identity with having set Authentication=ActiveDirectoryMsi in the connection string, but didn't work, it only works when passing on the user credentials as in your article which raises some security concerns. I have seen the following article doing the same for interacting with the storage by reusing the Linked Service and wonder if there is something similar for serverless SQL databases: https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/using-the-workspace-msi-to-authenticate-a-synapse-notebook-when/ba-p/2330029

Thanks,
Patrick
azuresam
Copper Contributor
Mar 19, 2023
Thank you for sharing this - This code worked for me but I ran into a problem in that I have a space in the column name. I saved the dataframe as a temp view and tried querying it with sparksql, everything works except when I try to reference a column that has a space in the where clause even when using the backtick. Any suggestions?
```
df1.createOrReplaceTempView('temp1')
df1 = sqlContext.sql("""select `Customer Id` from temp1 where `Customer Id` = 100 """)
I also tried:
%%sql
select `Customer Id` from temp1 where `Customer Id` = 100

Error:
Syntax error at or near 'Id': extra input 'Id'(line 1, pos 6)
```

Blog Post

Query serverless SQL pool from an Apache Spark Scala notebook

Configuring connection to the serverless SQL pool endpoint

Reading content of SQL table

Executing remote ad-hoc query

Conclusion