Step by step to transfer data from Azure Synapse instance using a Spark data source with Databrick
Published May 06 2020 08:57 AM 2,106 Views

It is been a long time since I made a post about SQL DW. I was trying to remember and I found. it was in 2017:

Anyway, this one is not really about the DW ( it is kind of indirect about it).  I am writing just to provide a simple guide based on this documentation I want to create a lab on databricks cluster and  I also want to transfer data from azure synapse instance using a spark data source implementation.


Again, this is a very beginner guide on how to do it step by step. For you successfully follow this guide you will need:


  1. Resource group
  2. Storage account
  3. Databricks cluster
  4. DW database


Please refer to each link for more detail.


  1. Resource Group:
  2. Storage Account
  3. Databricks Cluster:
  4. DW Database:


Once you have all the prerequisites in place. Open Databricks and create a new notebook (fig 1):




Fig1: New Notebook




Define a name for the notebook, select the cluster that you created and a language. Our example will be SQL as you can see on figure 2:



Figure 2: Define notebook configuration


As I already mentioned I based this post in the databricks documentation. So most of the code is available here:


However,  our code will be slightly different. The next step involves working on the notebook so you will need information regards your storage account to add this later to the notebook code.


This is the piece of code where we will set up the storage account:


In order to copy the storage key required and replace the information on the syntax above. Open the portal and select  the service storage account-> select the name of your storage account that was previous create for this lab -> Select the option access key from your storage account as the figure 3


For example, my storage account is: acctnameadf




Figure 3 Storage account access key


Once you have the key you can change teh code as the example below:


-- Set up the Blob storage account access key in the notebook session conf.




  1. You will need to define a table on in your SQL DW to be the source of the data. I defined mine as: my_table_in_dw






This is the code that will be used on the notebook to create the table in Spark with the source on SQL DW:






-- Read data using SQL.
CREATE TABLE my_table_in_spark_read
USING com.databricks.spark.sqldw
  url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw', ---> Source table on the DW. Replace by your table name
  tempDir 'wasbs://<your-container-name>@<your-storage-account-name><your-directory-name>'





In other words, applying the code to our example: 

Note: I am using SQL Server a user to connect. With the purpose to make it simple, I am not using AAD, just SQL User on DW:   'jdbc:sqlserver://;databaseName=YOURDATAbase name;user=USERtoconnectonSQL;password=PASSWORDofthatUser',





-- Read data using SQL.
CREATE TABLE if not exists my_table_in_spark_read
USING com.databricks.spark.sqldw
  url 'jdbc:sqlserver://;databaseName=YOURDATAbase name;user=USERtoconnectonSQL;password=PASSWORDofthatUser',
  forwardSparkAzureStorageCredentials 'true',
 dbTable 'my_table_in_dw',
  tempDir 'wasbs://'
); ---------------------->replace by  your container and storage account( mine is adftutorial and acctnameadf respectively) 






  1. Next step will be to create a table on SQLDW using a table created previously on Spark:




-- Write data using SQL.
-- Create a new table, throwing an error if a table with the same name already exists:
CREATE TABLE my_table_in_spark_write
USING com.databricks.spark.sqldw
  url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw_copy',
  tempDir 'wasbs://<your-container-name>@<your-storage-account-name><your-directory-name>'
AS SELECT * FROM table_to_save_in_spark;





Our example will be:

Note: I am using the same table created on the last step and also I added the not exist clause in the table creation.

The table to be created on SQL DW is my_table_in_dw_copy





CREATE TABLE if not exists  my_table_in_spark_write ---> Spark table
USING com.databricks.spark.sqldw
  url 'jdbc:sqlserver://;databaseName=SQLUSERNAME;user=Test;password=SQLUSERPASSWORD',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw_copy', ---> table to be created on SQLDW
  tempDir 'wasbs://'
) ---------------------->replace by  your container and storage account( mine is adftutorial and acctnameadf respectively) 
AS SELECT * FROM my_table_in_spark_read;






 Now hit the option run all. The results we expected are in figure 4:



Figure 4: Jobs


If you are wondering where are the databrick tables that we just created -> check the data option as per figure 5 below:




Figure 5: Spark Tables


If you check our SQLDW you will also find a new object – my_table_in_dw_copy as figure 6 show:


Figure 6 DW Objects


Table created for the example on the DW ( if you want to use my table example as source):






CREATE TABLE [dbo].[my_table_in_dw]
[NAME_] [varchar](50) NULL,
[sURNAME_] [varchar](50) NULL






So maybe you are asking yourself. Why do I need a Storage account???

Note the storage account is used as temporary data management. As you can see per documentation: "

Azure Synapse instead creates a subdirectory of the form: <tempDir>/<yyyy-MM-dd>/<HH-mm-ss-SSS>/<randomUUID>/. You can set up periodic jobs (using the Databricks jobs feature or otherwise) to recursively delete any subdirectories that are older than a given threshold (for example, 2 days), with the assumption that there cannot be Spark jobs running longer than that threshold.

The Azure Synapse connector automates data transfer between a Databricks cluster and an Azure Synapse instance. For reading data from an Azure Synapse table or query or writing data to an Azure Synapse table, the Azure Synapse connector creates temporary objects, including DATABASE SCOPED CREDENTIAL, EXTERNAL DATA SOURCE, EXTERNAL FILE FORMAT, and EXTERNAL TABLE behind the scenes. These objects live only throughout the duration of the corresponding Spark job and should automatically be dropped thereafter.”


You will be able to check for yourself those temporary files under your storage account as figure 7 demonstrate:


Figure 7 Temporary files


Or you can also query them by running this:

  • SELECT * FROM sys.database_scoped_credentials WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_data_sources WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_file_formats WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_tables WHERE name LIKE 'tmp_databricks_%'


Note: if the Spark crashes or is forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be dropped.  So it is important to know how to track those files.


Liliam C Leme

Version history
Last update:
‎May 07 2020 01:30 AM
Updated by: