Once you have all the prerequisites in place. Open Databricks and create a new notebook (fig 1):
Fig1: New Notebook
Define a name for the notebook, select the cluster that you created and a language. Our example will be SQL as you can see on figure 2:
Figure 2: Define notebook configuration
As I already mentioned I based this post in the databricks documentation. So most of the code is available here: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html#language-sql
However, our code will be slightly different. The next step involves working on the notebook so you will need information regards your storage account to add this later to the notebook code.
This is the piece of code where we will set up the storage account:
In order to copy the storage key required and replace the information on the syntax above. Open the portal and select the service storage account-> select the name of your storage account that was previous create for this lab -> Select the option access key from your storage account as the figure 3
For example, my storage account is: acctnameadf
Figure 3 Storage account access key
Once you have the key you can change teh code as the example below:
-- Set up the Blob storage account access key in the notebook session conf.
This is the code that will be used on the notebook to create the table in Spark with the source on SQL DW:
-- Read data using SQL. CREATE TABLE my_table_in_spark_read USING com.databricks.spark.sqldw OPTIONS ( url 'jdbc:sqlserver://<the-rest-of-the-connection-string>', forwardSparkAzureStorageCredentials 'true', dbTable 'my_table_in_dw', ---> Source table on the DW. Replace by your table name tempDir 'wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>' );
In other words, applying the code to our example:
Note: I am using SQL Server a user to connect. With the purpose to make it simple, I am not using AAD, just SQL User on DW: 'jdbc:sqlserver://YOURServerNAME.database.windows.net:1433;databaseName=YOURDATAbase name;user=USERtoconnectonSQL;password=PASSWORDofthatUser',
-- Read data using SQL. CREATE TABLE if not exists my_table_in_spark_read USING com.databricks.spark.sqldw OPTIONS ( url 'jdbc:sqlserver://YOURServerNAME.database.windows.net:1433;databaseName=YOURDATAbase name;user=USERtoconnectonSQL;password=PASSWORDofthatUser', forwardSparkAzureStorageCredentials 'true', dbTable 'my_table_in_dw', tempDir 'wasbs://email@example.com/' ); ---------------------->replace by your container and storage account( mine is adftutorial and acctnameadf respectively)
-- Write data using SQL. -- Create a new table, throwing an error if a table with the same name already exists: CREATE TABLE my_table_in_spark_write USING com.databricks.spark.sqldw OPTIONS ( url 'jdbc:sqlserver://<the-rest-of-the-connection-string>', forwardSparkAzureStorageCredentials 'true', dbTable 'my_table_in_dw_copy', tempDir 'wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>' ) AS SELECT * FROM table_to_save_in_spark;
Our example will be:
Note: I am using the same table created on the last step and also I added the not exist clause in the table creation.
The table to be created on SQL DW is my_table_in_dw_copy
CREATE TABLE if not exists my_table_in_spark_write ---> Spark table USING com.databricks.spark.sqldw OPTIONS ( url 'jdbc:sqlserver://YOURSERVERNAME.database.windows.net:1433;databaseName=SQLUSERNAME;user=Test;password=SQLUSERPASSWORD', forwardSparkAzureStorageCredentials 'true', dbTable 'my_table_in_dw_copy', ---> table to be created on SQLDW tempDir 'wasbs://firstname.lastname@example.org/' ) ---------------------->replace by your container and storage account( mine is adftutorial and acctnameadf respectively) AS SELECT * FROM my_table_in_spark_read;
Now hit the option run all. The results we expected are in figure 4:
Figure 4: Jobs
If you are wondering where are the databrick tables that we just created -> check the data option as per figure 5 below:
Figure 5: Spark Tables
If you check our SQLDW you will also find a new object – my_table_in_dw_copy as figure 6 show:
Figure 6 DW Objects
Table created for the example on the DW ( if you want to use my table example as source):
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[my_table_in_dw] ( [NAME_] [varchar](50) NULL, [sURNAME_] [varchar](50) NULL ) WITH ( DISTRIBUTION = ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX ) GO
So maybe you are asking yourself. Why do I need a Storage account???
Note the storage account is used as temporary data management. As you can see per documentation: "
Azure Synapse instead creates a subdirectory of the form: <tempDir>/<yyyy-MM-dd>/<HH-mm-ss-SSS>/<randomUUID>/. You can set up periodic jobs (using the Databricks jobs feature or otherwise) to recursively delete any subdirectories that are older than a given threshold (for example, 2 days), with the assumption that there cannot be Spark jobs running longer than that threshold.
The Azure Synapse connector automates data transfer between a Databricks cluster and an Azure Synapse instance. For reading data from an Azure Synapse table or query or writing data to an Azure Synapse table, the Azure Synapse connector creates temporary objects, including DATABASE SCOPED CREDENTIAL, EXTERNAL DATA SOURCE, EXTERNAL FILE FORMAT, and EXTERNAL TABLE behind the scenes. These objects live only throughout the duration of the corresponding Spark job and should automatically be dropped thereafter.”
You will be able to check for yourself those temporary files under your storage account as figure 7 demonstrate:
Figure 7 Temporary files
Or you can also query them by running this:
Note: if the Spark crashes or is forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be dropped. So it is important to know how to track those files.
Liliam C Leme
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.