Azure Synapse Analytics Blog

4 MIN READ

Spark Table to Serverless Pool Metadata Sync

Former Employee

Feb 14, 2021

Synapse provides an exciting feature which allows you to sync Spark database objects to Serverless pools and to query these objects without the Spark pool being active or running. Synapse workspaces are accessed exclusively through an Azure AD Account and objects are created within this context in the Spark pool. In some scenarios I would like to share the data which I've created in my Spark database with other users for reporting or analysis purposes. This is possible with Serverless and in this article I will show you how to complete the required steps from creation of the object to successful execution.

In attempt to reproduce a real-world scenario where one Domain user creates a database in a Spark Pool with objects which are synced to Serverless Pool I made use of an admin user and Test user.

I started by creating a database and a table with parquet storage format in Spark pool Notebook, as per the example found here Shared metadata tables - Azure Synapse Analytics | Microsoft Docs

(Only Parquet Format tables will be synced to Serverless, no other format will be synchronized)

%%sql

Create database CharlSync

CREATE TABLE CharlSync.myParquetTable(id int, name string, birthdate date) USING Parquet

%%csharp

using Microsoft.Spark.Sql.Types;

var data = new List<GenericRow>();

data.Add(new GenericRow(new object[] { 1, "Charl", new Date(2010, 1, 1)}));

data.Add(new GenericRow(new object[] { 2, "Bob", new Date(2020, 1, 1)}));

data.Add(new GenericRow(new object[] { 3, "Bob", new Date(2021, 1, 1)}));

data.Add(new GenericRow(new object[] { 4, "Bob", new Date(2019, 1, 1)}));

var schema = new StructType

(new List<StructField>()

{

new StructField("id", new IntegerType()),

new StructField("name", new StringType()),

new StructField("birthdate", new DateType())

}

);

var df = spark.CreateDataFrame(data, schema);

df.Write().Mode(SaveMode.Append).InsertInto("CharlSync.myParquetTable");

The data has been written to the table and I can perform a SQL select from the table in my Sparkpool

The table is visible in the and the object which I created is listed in the Spark database.

From SQL Serverless I can select from the metadata database without any problem as I am the owner and creator of the database and object, from a storage account perspective my admin account has been granted storage blob contributor access and therefore I am able to write to the storage account and access the data without failure.

Now let’s move on to SSMS or Azure Data Studio

In SSMS I open up an Azure AD connection and connect with my domain account , from the metadata database I run a select statement on the table

It returns invalid object name; this is not supposed to be the case.

I verify the object name in sys.objects and as you can see it is lowercase, the object is case sensitive in Serverless.

I change my object to the name as per the sys.objects table and it returns my data as expected from the metadata sync table.

I will now create an Azure AD User in my Serverless Pool and grant the user access in order to execute statements.

As per our documentation the access is based on the service principal permissions at a storage level, and if your user has access to the service it can execute statements, with server less all access is governed by storage account level permissions and through database or servers scoped credentials.

I therefore do not have to assign any additional permissions or role access within the server itself.

The test user has been created in the sqlodtest database and I connect to the database and attempt to access the table, which fails. The reason for the failure is that the user account has not been granted access to the default storage account and subsequent container and folders.

The Domain user account requires access to the folder which has been created in the default storage account, a folder per table is created as follows: /<containername>/synapse/workspaces/<workspacename>/warehouse/<sparkdatabasename>.db/<tablename>

As the user execution context is being used to connect to the storage account, I grant the user account Storage blob reader RBAC Permissions on the container and underlying folders.

As per the published documentation on Workspace permissions I only have to grant the User Permissions in my On-Demand Database.

Understand the roles required to perform common tasks in Synapse - Azure Synapse Analytics | Microsoft Docs

Within the workspace under Access Control I validate that my test account does not have any workspace level permissions, the highlighted account below is my admin account.

Once you have completed all of the above-mentioned actions and created the User account in the Serverless database and granted the account folder level RBAC permissions, It is important that you disconnect from the Serverless session and re-connect.

I connect to my Serverless database which I created earlier and to which I granted my Test user access, I am then able to execute the statement against the “Spark replicated database and table and return data.