A common data engineering task is explore, transform, and load data into data warehouse using Azure Synapse Apache Spark. The Azure Synapse Dedicated SQL Pool Connector for Apache Spark is the way to read and write a large volume of data efficiently between Apache Spark to Dedicated SQL Pool in Synapse Analytics. The connector supports Scala and Python language on Synapse Notebooks to perform this operations.
The intention of this Guide is not explain all the Connector features if you require a deeper understanding of how this connector works start here.
Usually, customers do this kind of operation using Synapse Apache Spark to load data to Dedicated Pool within Azure Synapse Workspace, but today, I would like to reproduce a different scenario that I was working on one of my support cases. Consider a scenario where you are trying to load data from Synapse Spark to Dedicated pool (formerly SQL DW) using Synapse Pipelines, and additionally you are using Synapse Workspace deployed with Managed Virtual Network.
The intention of this guide is to help you with which configuration will be required if you need to load data from Azure Synapse Apache Spark to Dedicated SQL Pool (formerly SQL DW). If you prefer take advantage of the new feature-rich capabilities now available via the Synapse workspace and Studio and load data directly from Azure Apache Spark to Dedicated Pool in Azure Synapse Workspace is recommended that you enable Synapse workspace features on an existing dedicated SQL pool (formerly SQL DW).
Before we start, here is some initial considerations.
To write data to internal tables, the connector now uses COPY statement instead of CETAS/CTAS approach. It also requires storage permissions.
There are 3 deployments needed to make this configuration possible:
Fig 1 - Resources
Fig 2 - Creating an Azure Synapse Notebook
Here is the Spark script used within my sample notebook to generate data:
%%sql
CREATE DATABASE IF NOT EXISTS SampleDB
%%sql
USE SampleDB
%%sql
CREATE TABLE IF NOT EXISTS SampleDB.SampleTable (id int, name string, age int) USING PARQUET
%%sql
INSERT INTO SampleDB.SampleTable VALUES (1, 'Your Name', 18)
%%pyspark
df = spark.sql("SELECT * FROM `SampleDB`.`SampleTable`")
display(df)
Below is the sample to write using AAD Authentication (Internal Table) using python connector:
# Write using AAD Auth to internal table
# Add required imports
import com.microsoft.spark.sqlanalytics
from com.microsoft.spark.sqlanalytics.Constants import Constants
# Configure and submit the request to write to Synapse Dedicated SQL Pool
# Sample below is using AAD-based authentication approach; See further examples to leverage SQL Basic auth.
(df.write
# If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
# to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
.option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net")
# Like-wise, if `Constants.TEMP_FOLDER` is not provided, the connector will use the runtime staging directory config (see section on Configuration Options for details).
.option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<some_base_path_for_temporary_staging_folders>")
# Choose a save mode that is apt for your use case.
# Options for save modes are "error" or "errorifexists" (default), "overwrite", "append", "ignore".
# refer to https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes
.mode("overwrite")
# Required parameter - Three-part table name to which data will be written
.synapsesql("sqlpool01.dbo.SampleTable"))
Here is the sample of Synapse Notebook with the code:
Fig 3 - Synapse Notebook with sample code
After creating the synapse notebook, click Publish all to save your modifications.
Fig 4 - Publish all
Go to Storage account access control -> Click +Add -> add role assigned -> select Storage Blob Data Contributor role -> Assign access to -> Select your Synapse Workspace Managed Identity and also your logged account -> Review + assign.
Fig 5 - Storage account permissions
If you are trying to read or write on the dedicated pool using synapse notebook and you are not SQL Active Directory Admin, the following SQL authorization is necessary on Dedicated SQL Pool for the user logged in account:
Read:
EXEC sp_addrolemember 'db_exporter', [<your_domain_user>@<your_domain_name>.com];
Write:
--Make sure your user has the permissions to CREATE tables in the [dbo] schema
GRANT CREATE TABLE TO [<your_domain_user>@<your_domain_name>.com];
--Make sure your user has ALTER permissions
GRANT ALTER ON SCHEMA::<target_database_schema_name> TO [<your_domain_user>@<your_domain_name>.com];
--Make sure your user has ADMINISTER DATABASE BULK OPERATIONS permissions
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [<your_domain_user>@<your_domain_name>.com];
--Make sure your user has Select permissions on the target table
GRANT SELECT ON <your_table> TO[<your_domain_user>@<your_domain_name>.com]
--Make sure your user has INSERT permissions on the target table
GRANT INSERT ON <your_table> TO [<your_domain_user>@<your_domain_name>.com]
--Make sure your user has DELETE permissions on the target table
GRANT DELETE ON <your_table> TO[<your_domain_user>@<your_domain_name>.com]
Now, let's Run the Synapse Notebook to test if logged-in user account will have the right permissions to perform all actions.
When the notebook was executed, the error below was thrown:
Fig 6 - Error when notebook tries to access SQL Server
This error is pretty clear: Synapse Pipeline IP is not allowed to access the server.
We need a allow Azure services and resources to access SQL Server.
To allow this configuration: Go to your SQL Server -> Security -> Networking - > Exceptions -> Allow Azure services and resources to access this server -> Save
Fig 7 - Allowing Azure services and resources to access SQL Server
Let's re-run the code that loads data into Dedicated SQL Pool (formerly SQL DW).
Fig 8 - Error related to Storage permissions
Now we can see a different error threw:
ERROR: 'Not able to validate external location because The remote server returned an error: (403) Forbidden.
This generic error means that now we can get on SQL Server, but System assigned managed identity that will be used to perform this operation may not be enabled.
To allow this configuration: Go to your SQL Server Security -> Identity- > Turn Status On -> Save
Fig 9 - Allowing System assigned managed identity from SQL Server
Now when we run the write cell again, it will work because logged-in user account has a Storage Blob Data Contributor role to perform all actions.
Fig 10 - Write on Dedicated SQL Pool succeeded
To allow this configuration: Go to your Synapse Notebook -> On the upper right side Click on Add pipeline -> New pipeline.
Fig 11 - Adding Synapse Notebook to Pipeline
Next screen:
1 - On properties, choose a Name for your pipeline.
2 - Publish your pipeline.
Fig 12 - Naming and publishing.
Now let's execute our pipeline to ensure that all configuration is done.
Go to Integrate -> Select the pipeline that was created on the last step -> Click on +Add trigger -> Trigger now -> Ok.
Fig 13 - Triggering Pipeline
Go to Monitor-> integration -> Pipeline runs.
As we can see on the screenshot below, now we have an error when executing from Synapse Pipeline.
Fig 14 - Monitoring Pipeline
Click on Error to see the complete error:
: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: Internal error - Attempt to run SQL Transaction failed with {ErrorNumber - 105215, ErrorMessage - Not able to validate external location because The remote server returned an error: (403) Forbidden.}
This error is saying that System assigned managed identity doesn't have permission to perform this operation.
It means that the System assigned managed identity should be assigned Storage Blob Data Contributor on the Azure Storage.
Usually, the name of the System assigned managed identity is the same as the SQL Server name, as shown in the screenshot below.
Fig 15 - Checking system-assigned managed identity name
Go back to Step 2 and grant Storage permissions for System assigned managed identity.
Fig 16 - Checking system assigned managed identity Storage permissions
Again, Let's trigger the pipeline.
Now it works.
Fig 17 - Pipeline succeeded.
Fig 18 - Checking data on Dedicated SQL Pool table.
When using The Azure Synapse Dedicated SQL Pool Connector for Apache Spark, users can take advantage of reading and writing a large volume of data efficiently between Apache Spark to Dedicated SQL Pool in Synapse Analytics. The connector supports Scala and Python language on Synapse Notebooks to perform these operations.
If you are unable to complete the steps in the article, please do reach out to us by logging a Support Request.
SIDNEY CIRQUEIRA
Synapse Analytics Support Engineer
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.