FastTrack for Azure

8 MIN READ

Disaster Recovery Strategy in Azure Databricks using the Hive External Metastore (SQL DB)

Microsoft

Dec 15, 2022

Disaster Recovery Strategy in Databricks using the Hive External Metastore

In this article we will discuss the disaster recovery approach to take with an Azure Databricks environment using an Hive External Metadata store. We will be covering the following items:

Background
What needs to be replicated?
Recovery Solution Strategy
Implementation Steps
Sync of notebooks and jobs from the control plane:
Data Replication
Metastore Replication
Implementation

Background

A proper DR strategy and recovery pattern is critical for your Databricks enviornment. You need to weigh the appropriate tools, the costs, the replication of the data, the RPO (Recovery Point Objective) and the RTO (Recover Time Objective). You have to make sure you don’t bake in too much complexity into your recovery process; and you need to ask yourself what you can live with and what can you live without. Unfortunately, there is no out-of-the-box, or one size fits all implementation that will solve everything for you but there are ways of simplifying disaster recovery for Databricks. One of those ways is leveraging and using the simplicity of an External Hive Metastore. In our previous article in the series we focused on how to setup an External Hive Metastore so you can share metadata across different Databricks Workspaces. In this article we will focus on how to apply the External Hive Metastore in a DR Scenario.

There are two important industry terms that are related to the Disaster Recovery (DR) planning process:

RPO (Recovery Point Objective) is the maximum targeted period in which data or transactions might be lost from an IT service due to the outage.
RTO (Recovery Time Objective) is the targeted duration of time and a service level within which a business process must be restored after a disaster.

The idea of a good DR strategy is to have lesser RPO and RTO.

Also note:

An Azure Databricks Environment does not store the actual data. The actual data gets stored in a Storage Account (Azure Blob Storage or Azure Data Lake Gen2).
Azure Databricks control plane stores Notebooks, and Jobs.

What needs to be replicated?

In a DR environment, we should mainly focus on replicating the following:

Notebooks and Jobs from the control plane.
The actual data.
Logical schema of the Hive and Delta Tables (metastore).
Network configurations.

Reference

Recovery Solution Strategy

In general, a DR solution involves two Databricks workspaces. An active-passive solution is the most common and easiest solution to implement. In the active-passive solution, the primary Databricks workspace site is in read/write mode, while the secondary Databricks workspace is in passive mode (optionally can be in read-only mode as well). During a disaster recovery event, the passive deployment in the secondary region becomes your active deployment. In this blog post, we will concentrate on an active-passive deployment where one Databricks workspace is in the primary region (East US) and the other Databricks workspace is in the secondary region (West US).

Implementation Steps

Let’s identify the tooling that is required to synchronize the primary and secondary regions. We will refer to our last section above where we mentioned what needs to be replicated, and we will discuss in detail each of the options.

(Note: We are not going to cover the CI/CD approaches in this blog post, we will focus on the sync of data and metadata across the region. The standard CI/CD with Azure databricks can be found here.)

Notebook and Job synchronization from the Control Plane:

We need to use CI/CD pipelines to sync the notebooks between the primary region and the secondary region. The best way is to have two tasks for the same operations in the deployment pipeline. One for the primary and one for the secondary region.

(More information regarding the CI/CD process in DR can be found here.)

Let’s discuss a few of the important steps that are required in the deployment pipeline:

Source code/Notebook deployment: databricks-cli workspace command can be used to deploy the notebooks in both workspaces.
Job Configurations - Save the job definition as JSON in git. Using the databricks-cli jobs command we can deploy in both regions. For the secondary deployment, deploy the job and set the concurrency to zero in the job definition. ["max_concurrent_runs": 0,] This is required because the secondary region will be in passive mode and will not run any jobs unless the disaster recovery event happens. If you are using any other orchestration tool like Azure Data Factory or Azure Synapse pipeline, we might not need this step. (Please note, we need to disable the trigger from the secondary orchestration tool.)
Cluster configuration: save the cluster configuration JSON template in git, and deploy it using the databricks-cli cluster command. It must be deployed in both primary and secondary region. Remember to terminate the cluster in the secondary region.
Databricks Secrets: We can create the Azure Key Vault backed Databricks Secrets Scope for the primary and the secondary region with the secrets that we need for the Databricks notebook to execute. We can use the Databricks CLI secrets command to automatically create this from the CI/CD pipeline.

Here are some other steps you might consider automating depending on the environment you have.

Data Replication

The actual data in Databricks is stored in either Azure Blob Storage or Azure Data Lake. In Databricks, if we are saving the data in Delta format or as a Hive table, the physical schema of the data is also stored along with the actual data. We can basically replicate the data into different regions/ geographies choosing the right redundancy option. For example:

Geo-Redundant Storage (GRS) or Geo-Zone-Redundant Storage (GZRS) copies the data asynchronously in two geographic regions. If the primary region suffers an outage, we can failover to the secondary region. The storage endpoint remains the same and we do not need to change the Endpoint URLs from the databricks.
Read-Access Geo-Redundant Storage (RA-GRS) or Read-Access Geo-Zone-Redundant Storage (RA-GZRS) provides geo-redundant storage with the additional benefit of read access to the secondary endpoint. If an outage occurs in the primary endpoint, applications configured for read access to the secondary endpoint, and designed for high availability, can continue to read from the secondary endpoint. Microsoft recommends RA-GZRS for maximum availability and durability for your applications.

(More information related to how the failover works can be found here.)

Metastore Replication

Databricks has several metastore options. By default, Databricks stores the metastore in DBFS. We can use the Unity Catalog to store the metadata. We can also use the external hive metastore to store the metadata. External Hive metastores allows for additional replication by leveraging cloud-based replication technology. For example, we can use Azure SQL DB geo replication to replicate the database from the primary region to the secondary region. Once the metastore is available in the secondary region, and data is replicated in the secondary region, we can attach both with a Databricks workspace to create a DR site.

Implementation

Let’s implement a basic DR strategy to replicate the data and metadata between the primary region to the secondary region. We will evaluate the steps required to perform when the disaster recovery event occurred.

High-level architecture

Resources to be provisioned

In the primary region – East US

Azure Databricks workspace. (Standard or Premium Pricing tier).
Azure Storage account/ Azure Data Lake Gen2.
1. Hierarchical namespace is enabled.
2. RA-GRS replication is enabled.

We are going to use it to store our actual data.

Azure SQL Server and a blank database.

In the Secondary region – West US.

Azure Databricks workspace.
Azure SQL Server without a database.

Setting up the external Hive metastore

Please follow this blogpost to find the steps for setting up the external hive metastore. We just need to set it up for only one workspace and it will be treated as primary here. We are going to set up the Secondary workspace in this blog post.

Once the external hive metastore is configured as per the blog post, we will have a delta table created just for our testing. We can find the logical schema in the azure sql database, and physical schema and actual data in the storage.

Azure Databricks Workspace

External Hive metastore (Azure SQL DB.)

Azure Data Lake g2(Storage Account)

Setting up the data replication

Since we have selected the RA-GRS replication type while creating the storage account. The data is already being replicated in the secondary region. The secondary region data can be accessed by changing the connection string. We need to append “-secondary” at the end of the storage account name. “<storage-account-name>-secondary”. Secondary region data is read-only.

In case of a disaster recovery event when the primary storage is unavailable, failover happens (Microsoft-managed failover/customer-managed failover). Post failover the secondary region becomes the read/write endpoint for the application. There is no change in the storage endpoint URL post failover, hence no configuration changes are needed from the Databricks side.

(More information related to storage disaster recovery can be found here.)

Setting up metastore replication

We are going to use the active geo-replication in Azure SQL Server technique to have a read-only secondary database. Active geo-replication is a feature that lets you create a continuously synchronized readable secondary database for a primary database.

Setting up the Secondary Databricks workspace.

Post creating the secondary workspace from the Aure Portal, we create the cluster with the same configuration as the primary region’s workspace. We will edit the configurations in such a way that it points to our secondary region storage and Azure SQL Server for the Hive external metastore.

Important to note that the storage endpoint should be the same for the primary and secondary region. As post failover, the storage endpoint URL remains same. Just for our demo purpose, we will point to the read/only replica of the storage to check the DR site’s connectivity.

Also, if you are using the auto failover group for the azure sql db replication, there is no need to change the connection string. Connection string remains same post failover. In our demo, we are using Active Geo replication, hence we need to provide the secondary region auzre sql server/db connection string.

Spark configuration

spark.hadoop.javax.jdo.option.ConnectionUserName <sql user name>
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<secondary region azure sql server name>.database.windows.net:1433;database=< azure sql db name>;encrypt=true;trustServerCertificate=false;loginTimeout=30;
spark.hadoop.javax.jdo.option.ConnectionPassword <azure sql password>
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 2.3.7

Creating the Mount Point

Since we are going to connect to the read-only replica of the storage account, we will create the mount point pointing to read-only replica with the same mount point name that we have used in our primary databricks workspace.

Important to note that the storage endpoint should be the same for the primary and secondary region. As post failover, the storage endpoint URL remains same. For our demo, we will point to the read/only replica of the storage to check the DR site’s connectivity.

Testing

In the data section, we can read the data successfully on the secondary site.

What happens if we try to change the schema of the table from the secondary workspace?

Since we have connected to read/only storage replica, we will get a request forbidden error ( status code: 403).

Final Thoughts

The purpose of a disaster recovery plan is to reduce damage or disruption and recover as quickly as possible in the event of a disaster that leads to system failure. Using the External Hive metastore in Azure Databricks simplifies the whole disaster recovery event with lesser downtime and maintenance overhead.

Updated Dec 15, 2022

Version 2.0

data & ai

SamPanda

Microsoft

Joined October 22, 2018

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity