Blog Post

Azure Synapse Analytics Blog
5 MIN READ

Step by step to transfer data from Azure Synapse instance using a Spark data source with Databrick

Liliam_C_Leme's avatar
Liliam_C_Leme
Icon for Microsoft rankMicrosoft
May 06, 2020

It is been a long time since I made a post about SQL DW. I was trying to remember and I found. it was in 2017: 

https://docs.microsoft.com/en-gb/archive/blogs/dataplatform/sql-azure-dw-what-is-it-how-it-works

Anyway, this one is not really about the DW ( it is kind of indirect about it).  I am writing just to provide a simple guide based on this documentation https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html#language-sql. I want to create a lab on databricks cluster and  I also want to transfer data from azure synapse instance using a spark data source implementation.

 

Again, this is a very beginner guide on how to do it step by step. For you successfully follow this guide you will need:

 

  1. Resource group
  2. Storage account
  3. Databricks cluster
  4. DW database

 

Please refer to each link for more detail.

 

  1. Resource Group: https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal
  2. Storage Account https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal
  3. Databricks Cluster: https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal
  4. DW Database: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-overview-what-is

 

Once you have all the prerequisites in place. Open Databricks and create a new notebook (fig 1):

 

 

Fig1: New Notebook

 

 

 

Define a name for the notebook, select the cluster that you created and a language. Our example will be SQL as you can see on figure 2:

 

Figure 2: Define notebook configuration

 

As I already mentioned I based this post in the databricks documentation. So most of the code is available here: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html#language-sql

 

However,  our code will be slightly different. The next step involves working on the notebook so you will need information regards your storage account to add this later to the notebook code.

 

This is the piece of code where we will set up the storage account:

SET fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net=<your-storage-account-access-key>;

In order to copy the storage key required and replace the information on the syntax above. Open the portal and select  the service storage account-> select the name of your storage account that was previous create for this lab -> Select the option access key from your storage account as the figure 3

 

For example, my storage account is: acctnameadf

 

 

Figure 3 Storage account access key

 

Once you have the key you can change teh code as the example below:

 

-- Set up the Blob storage account access key in the notebook session conf.

SET fs.azure.account.key.acctnameadf.blob.core.windows.net=3333333333+121222222222222/N1eZ6rLvNn2RF/111111==;

 

 

  1. You will need to define a table on in your SQL DW to be the source of the data. I defined mine as: my_table_in_dw

 

 

 

=================

This is the code that will be used on the notebook to create the table in Spark with the source on SQL DW:

 

 

 

 

 

-- Read data using SQL.
CREATE TABLE my_table_in_spark_read
USING com.databricks.spark.sqldw
OPTIONS (
  url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw', ---> Source table on the DW. Replace by your table name
  tempDir 'wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>'
);

 

 

 

 

In other words, applying the code to our example: 

Note: I am using SQL Server a user to connect. With the purpose to make it simple, I am not using AAD, just SQL User on DW:   'jdbc:sqlserver://YOURServerNAME.database.windows.net:1433;databaseName=YOURDATAbase name;user=USERtoconnectonSQL;password=PASSWORDofthatUser',

 

 

 

 

 
-- Read data using SQL.
CREATE TABLE if not exists my_table_in_spark_read
USING com.databricks.spark.sqldw
OPTIONS (
  url 'jdbc:sqlserver://YOURServerNAME.database.windows.net:1433;databaseName=YOURDATAbase name;user=USERtoconnectonSQL;password=PASSWORDofthatUser',
  forwardSparkAzureStorageCredentials 'true',
 dbTable 'my_table_in_dw',
  tempDir 'wasbs://adftutorial@acctnameadf.blob.core.windows.net/'
); ---------------------->replace by  your container and storage account( mine is adftutorial and acctnameadf respectively) 

 

 

 

 

 

  1. Next step will be to create a table on SQLDW using a table created previously on Spark:

 

 

 

 
-- Write data using SQL.
-- Create a new table, throwing an error if a table with the same name already exists:
CREATE TABLE my_table_in_spark_write
USING com.databricks.spark.sqldw
OPTIONS (
  url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw_copy',
  tempDir 'wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>'
)
AS SELECT * FROM table_to_save_in_spark;

 

 

 

 

Our example will be:

Note: I am using the same table created on the last step and also I added the not exist clause in the table creation.

The table to be created on SQL DW is my_table_in_dw_copy

 

 

 

 

CREATE TABLE if not exists  my_table_in_spark_write ---> Spark table
USING com.databricks.spark.sqldw
OPTIONS (
  url 'jdbc:sqlserver://YOURSERVERNAME.database.windows.net:1433;databaseName=SQLUSERNAME;user=Test;password=SQLUSERPASSWORD',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw_copy', ---> table to be created on SQLDW
  tempDir 'wasbs://adftutorial@acctnameadf.blob.core.windows.net/'
) ---------------------->replace by  your container and storage account( mine is adftutorial and acctnameadf respectively) 
AS SELECT * FROM my_table_in_spark_read;

 

 

 

 

 

 Now hit the option run all. The results we expected are in figure 4:

 

Figure 4: Jobs

 

If you are wondering where are the databrick tables that we just created -> check the data option as per figure 5 below:

 

 

Figure 5: Spark Tables

 

If you check our SQLDW you will also find a new object – my_table_in_dw_copy as figure 6 show:

Figure 6 DW Objects

 

Table created for the example on the DW ( if you want to use my table example as source):

 

 

 

 

 

SET ANSI_NULLS ON
GO
 
SET QUOTED_IDENTIFIER ON
GO
 
CREATE TABLE [dbo].[my_table_in_dw]
(
[NAME_] [varchar](50) NULL,
[sURNAME_] [varchar](50) NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO

 

 

 

 

 

So maybe you are asking yourself. Why do I need a Storage account???

Note the storage account is used as temporary data management. As you can see per documentation: "

Azure Synapse instead creates a subdirectory of the form: <tempDir>/<yyyy-MM-dd>/<HH-mm-ss-SSS>/<randomUUID>/. You can set up periodic jobs (using the Databricks jobs feature or otherwise) to recursively delete any subdirectories that are older than a given threshold (for example, 2 days), with the assumption that there cannot be Spark jobs running longer than that threshold.

The Azure Synapse connector automates data transfer between a Databricks cluster and an Azure Synapse instance. For reading data from an Azure Synapse table or query or writing data to an Azure Synapse table, the Azure Synapse connector creates temporary objects, including DATABASE SCOPED CREDENTIAL, EXTERNAL DATA SOURCE, EXTERNAL FILE FORMAT, and EXTERNAL TABLE behind the scenes. These objects live only throughout the duration of the corresponding Spark job and should automatically be dropped thereafter.”

 

You will be able to check for yourself those temporary files under your storage account as figure 7 demonstrate:

Figure 7 Temporary files

 

Or you can also query them by running this:

  • SELECT * FROM sys.database_scoped_credentials WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_data_sources WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_file_formats WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_tables WHERE name LIKE 'tmp_databricks_%'

 

Note: if the Spark crashes or is forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be dropped.  So it is important to know how to track those files.

 

Liliam C Leme

Updated May 07, 2020
Version 3.0
No CommentsBe the first to comment