In this blogpost, we will work with the below Unity catalog features in Azure Databricks.
Unity catalog provides bundle of features that help the administrator to manage the access policies on the data and helps data owner to manage/view the data schema, lineage. Unity catalog is a metastore that uses storage accounts to keep the metadata. The same catalog can be shared across different workspacsecure everywhere). By sharing the metadata across different Azure Databricks workspace, you can collaborate between different teams who work in their individual workspaces and store the metadata of the tables and schema in the centralized metastore.
What is stored inside a metastore?
Metastore is the top level container of objects that basically stores the metadata of the objects and the object level permissions for the Azure Databricks environment. Once the metastore is configured, we generally access the object in the form of a three-level naming convention (catalog.schema.table).
Each metastore requires a root storage location ( generally a file system in ADLS g2) to store the metadata and the actual data for the managed table.
We can have one-to-many mapping from top to bottom in the above diagram. For example, in a metastore, we can have multiple catalogs. In one catalog, we can have multiple schemas/databases. In one database/schema, we can have multiple tables.
External Table stores the physical data in external location (blob storage or ADLS g2). Not in the same root storage account that has been used to store the Unity Catalog metastore. We need to take some additional steps like creating storage credential and external location to register the external table in the metastore. When we apply the DROP statement, only the logical schema gets deleted, the physical data remain as-is. We might need to apply dbutils.fs.rm command to remove the physical files from the external storage location.
Managed Table stores the physical data in the same location where the unity catalog stores the metastore. If we don’t specify the location path while creating the table, it gets stored in the Unity catalog metastore location. When we apply the DROP SQL statement, the physical data also gets deleted from the storage account.
Here are some prerequisites that we need to consider using the Unity catalog in Azure Databricks.
Here are the steps that we need to perform to create our first Unity catalog metastore in the EAST US region.
From the Azure Portal create a ADLS g2 account, make sure that hierarchical namespace option is enabled.
Note down the URI of the storage account.
In our case the URI is abfss://email@example.com/FirstMetastore
Important to note that AAD Global admin has to login to the account console for the first time. Once AAD Global admin login, they will add any user or group as the account administrator, so that going forward the account console can be managed by the added user. The added user will become the Azure Databricks Account Administrator.
Let’s create our first catalog, and managed table.
spark.sql('''create catalog if not exists myfirstcatalog ''')
spark.sql('''create database if not exists myfirstcatalog.mytestDB''')
#read the sample data into dataframe
df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True)
create table if not exists myfirstcatalog.mytestDB.myFirstManagedTable
Select * from temp_tbl
As we can see from the below screenshot, the managed table got created in the metastore location.
Let’s consider that we have the data present in some external location like Azure Blob Storage or ADLS g2 account in hive or delta format. We would like to attach that table into our Unity Catalog metastore.
Important to note that Unity Catalog metastore might not have direct access to the external location. To establish that access, we need to create the storage credential and External Locations by providing the actual path. Unity Catalog will use the access connector managed identity (Access Connector for Azure Databricks) to access the external location. We need to provide the “Storage Blob Data Contributor” access to the access connector URI to the external storage location.
Creation of the storage credential
We are using the access connector URI (managed identity) to create the storage credential. We can also provide the service principal details as well.
Creation of the External Location
Before we create the external table, we need to map the physical data location with the storage credential that we created earlier.
Let’s create an external table.
spark.sql('''create catalog if not exists myfirstcatalog ''') spark.sql('''create database if not exists myfirstcatalog.mytestDB''') #read the sample data into dataframe df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True) #create the delta table to the mount point that we have created earlier dbutils.fs.rm("abfss://firstname.lastname@example.org/mytestDB/MyFirstExternalTable", recurse=True) df_flight_data.write.format("delta").mode("overwrite").save("abfss://email@example.com/mytestDB/MyFirstExternalTable") %sql create table if not exists myfirstcatalog.mytestDB.MyFirstExternalTable USING DELTA LOCATION 'abfss://firstname.lastname@example.org/mytestDB/MyFirstExternalTable'
Now we have created a new workspace called dbkworkspace2 in the same region eastus. From the Admin console we will attach the metastore with the new Azure Databricks workspace
We can see the tables & metadata even before creating clusters. This is possible because metastore data is present in the ADLS g2 which can be accessible without creating cluster. However, to see the physical data we need to create a cluster.
Here are the securable objects in Unity Catalog. The access level is inherited from the parent-level objects to the child-level objects. For example, if we provide read access to the catalog, it gets inherited till the last child in the hierarchy (i.e. Schema/ Managed Table/ View... etc.). We can manage privileges for metastore objects by using SQL commands, Unity Catalog CLI, or in Data Explorer.
IMPORTANT: Privileges that are given in the Unity Catalog metastore are not inherited. Privileges can be granted by the metastore admin.
We have a test user. We would like to give access to a specific table (myfirstcatalog.mytestdb.mytestflightdata).
While we can easily provide access using the Unity Catalog portal, here is example of providing access using the databricks sql.
Step 1: Lets create a Azure Databricks group that will contain all the users that is going to have the readonly permission to the table (myfirstcatalog.mytestdb.mytestflightdata). For that we need to navigate to the databricks account console group section. We need to then add the user to the group.
Step 2: Run the GRANT command in Azure Databricks. This should be run the by the metastore admin.
If we login using the Test User we will be able to see only the mytestflightdata table.
You can find more details about the different privileges type here.
We can capture the runtime data lineage across queries in any language executed on an Azure Databricks cluster. The lineage is being captured from the table level and column level. Lineage data includes notebooks, workflows, and dashboards related to the query.
Lineage graphs share the same permission model as Unity Catalog that was being discussed in the previous section. The user who does not have access to a specific table that won’t be shown in the Lineage graph.
Post executing the script, we can access the lineage from the Unity Catalog UI.
In Unity Catalog we can use dynamic views to configure the fine-grained access control in rows and columns. Also, we can mask the data depending on the permission.
The new in-built function is_account_group_member() returns TRUE if the current user is a member of a specific account-level group. Recommended for use in dynamic views against Unity Catalog data.
Column Level Permission
Example 1: The normal test user should not be able to see the delay column value, however the flight admin would be able to see it.
In Azure Databricks Account console, we have created a group flight_admin, we have added the user in this group.
The user is part of flight_admin group, hence able to see the delay data.
Now a non flight_admin user is not able to see the data.
Row Level Permission
Now we are going to extend our previous example. We have several origin specific groups, for example ABE_admin would only be able to see the data where Origin =’ABE’, and similarly ATL_admin group should only be able to see the data where origin=’ATL’. If any user is not part of any group, they won’t be able to see the data. The user from flight_admin group would be able to see all the data.
Flight_admin group members can see all the records.
ATL_admin group members can only see ATL related records.
That’s all for this blogpost. Hope this helps!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.