Contents
Important Steps for the configuration.
Hive External metastore database.
Every Databricks deployment comes with a managed built-in Hive metastore. (If you aren’t familiar, a Hive metastore is a database that holds metadata about our data, such as the paths to the data in the data lake and the format of the data (parquet, delta, CSV, etc.)) Instead of using the out-of-the-box Hive metastore wouldn't it be great to have one central metastore and have multiple workspaces point to that metastore. Some scenarios call for an external metastore that is shareable and independent of the workspace's existence. Well, this article will cover the end-to-end process of setting up an external metastore using Azure SQL DB and connecting multiple workspaces to that metastore.
Scenario 1: Consider the scenario where we have several developer groups, and they all want to use their own Databricks Workspace for their development. However, they have a requirement that they must collaborate with each other. By default, Databricks uses the local built-in metastore in DBFS file system to keep the logical schema of all the Delta and Hive tables. If they start registering the databases and tables in their workspaces separately, we will end up having multiple metadata spread across the different workspaces which would make it harder for the teams to collaborate.
A solution to this is to create Hive external metastore that different Databricks Workspaces can share, and each of the Workspaces can register and use the commonly shared metastore. We will be detailing the end-to-end process that is required to set this up in the following steps.
Scenario 2: Now let's paint the picture for Disaster Recovery. Imagine we have a requirement where we need to create a Disaster Recovery site for the primary Databricks Workspace. In addition, one of our requirements calls for minimum down time while failing over from the primary site to the Disaster Recovery site.
We could use the external Hive metastore database to store the metadata and make our lives easier. The moment disaster happens we can basically attach the replicated Hive metastore database in secondary region with the secondary Databricks Workspace (DR site). One of the huge advantages is we won’t have the overhead to run the DDL scripts again and create the databases or table in the secondary site as the Hive external metastore will already have the logical schema.
(Note: We will discuss the Disaster Recovery scenario above and how to use an external Hive metadata store with Azure Databricks for DR in this blogpost in our series.)
Reference link(s):
Data Redundancy in Azure Storage
Disaster Recovery for azure sql db
Here are the steps that we can follow to create the external Hive metastore. Remember the config values are dependent on the Hive version that we are using, and the Hive version is dependent on the Databricks runtime version.
Below we are going to create a Hive external metastore that will be shared by two Databricks Workspaces (scenario 1 from the background section above).
spark.hadoop.javax.jdo.option.ConnectionUserName <sql user name>
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<azure sql server name>.database.windows.net:1433;database=< azure sql db name>;encrypt=true;trustServerCertificate=false;loginTimeout=30;
spark.hadoop.javax.jdo.option.ConnectionPassword <azure sql password>
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 2.3.7
Note:
application_ID = "<< provide-your-application-ID"
application_secret="<<provide-your-app-secret-key>>"
directory_id = "<< provide-your-directory-ID>>"
storage_account_name = "<<storage-account-name>>"
file_system_name="<<file-system-name>>"
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": application_ID,
"fs.azure.account.oauth2.client.secret": application_secret,
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+directory_id+"/oauth2/token"}
dbutils.fs.mount(
source = "abfss://"+file_system_name+"@"+storage_account_name+".dfs.core.windows.net/",
mount_point = "/mnt/deltalake",
extra_configs = configs)
spark.sql('''create database if not exists mytestDB''')
#read the sample data into dataframe
df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True)
#create the delta table to the mount point that we have created earlier
dbutils.fs.rm("/mnt/deltalake/mytestDB/flight_data", recurse=True)
df_flight_data.write.format("delta").mode("overwrite").save("/mnt/deltalake/mytestDB/flight_data")
spark.sql('''drop table if exists mytestDB.flight_data''')
spark.sql('''create table if not exists mytestDB.flight_data using delta location "/mnt/deltalake/mytestDB/flight_data"''
In dbkworkspace1
Let’s make some change to the table from the dbworkspace2 –
In dbkworkspace1, we can see the changes-
Running the below query will give us the information about the tables that are registered in hive external metastore for a given database
select
databases.name
,tables.tbl_name
,tables.tbl_type
,sds.location
from [dbo].[DBS] as databases
inner join [dbo].[TBLS] as tables on tables.db_id=databases.db_id
inner join [dbo].[SDS] as sds on sds.sd_id=tables.sd_id
where databases.name = 'mytestdb'
The Hive external metostore contains the pointer to the external storage.
Here is the pointer for the database - dbfs:/user/hive/warehouse/mytestdb.db
Now when we look at the table, we see the actual path is being mapped with the table.
Whether you have a simple Databricks setup or a complicated one, every workspace in Azure Databricks has a Hive metastore that is critical to your Databricks jobs, your collaboration and your disaster recovery. Leveraging the above setup will allow you to paint a better picture for sharing tables across the business, sharing a metastore so that different workspaces can register their data into a commonly shared metastore and simplifying your disaster recovery setup. We hope that you got some new insights and they come in handy. The only thing left is to get your hands dirty and have fun!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.