In this article we will discuss the disaster recovery approach to take with an Azure Databricks environment using an Hive External Metadata store. We will be covering the following items:
A proper DR strategy and recovery pattern is critical for your Databricks enviornment. You need to weigh the appropriate tools, the costs, the replication of the data, the RPO (Recovery Point Objective) and the RTO (Recover Time Objective). You have to make sure you don’t bake in too much complexity into your recovery process; and you need to ask yourself what you can live with and what can you live without. Unfortunately, there is no out-of-the-box, or one size fits all implementation that will solve everything for you but there are ways of simplifying disaster recovery for Databricks. One of those ways is leveraging and using the simplicity of an External Hive Metastore. In our previous article in the series we focused on how to setup an External Hive Metastore so you can share metadata across different Databricks Workspaces. In this article we will focus on how to apply the External Hive Metastore in a DR Scenario.
There are two important industry terms that are related to the Disaster Recovery (DR) planning process:
The idea of a good DR strategy is to have lesser RPO and RTO.
In a DR environment, we should mainly focus on replicating the following:
In general, a DR solution involves two Databricks workspaces. An active-passive solution is the most common and easiest solution to implement. In the active-passive solution, the primary Databricks workspace site is in read/write mode, while the secondary Databricks workspace is in passive mode (optionally can be in read-only mode as well). During a disaster recovery event, the passive deployment in the secondary region becomes your active deployment. In this blog post, we will concentrate on an active-passive deployment where one Databricks workspace is in the primary region (East US) and the other Databricks workspace is in the secondary region (West US).
Let’s identify the tooling that is required to synchronize the primary and secondary regions. We will refer to our last section above where we mentioned what needs to be replicated, and we will discuss in detail each of the options.
(Note: We are not going to cover the CI/CD approaches in this blog post, we will focus on the sync of data and metadata across the region. The standard CI/CD with Azure databricks can be found here.)
We need to use CI/CD pipelines to sync the notebooks between the primary region and the secondary region. The best way is to have two tasks for the same operations in the deployment pipeline. One for the primary and one for the secondary region.
(More information regarding the CI/CD process in DR can be found here.)
Let’s discuss a few of the important steps that are required in the deployment pipeline:
Here are some other steps you might consider automating depending on the environment you have.
The actual data in Databricks is stored in either Azure Blob Storage or Azure Data Lake. In Databricks, if we are saving the data in Delta format or as a Hive table, the physical schema of the data is also stored along with the actual data. We can basically replicate the data into different regions/ geographies choosing the right redundancy option. For example:
(More information related to how the failover works can be found here.)
Databricks has several metastore options. By default, Databricks stores the metastore in DBFS. We can use the Unity Catalog to store the metadata. We can also use the external hive metastore to store the metadata. External Hive metastores allows for additional replication by leveraging cloud-based replication technology. For example, we can use Azure SQL DB geo replication to replicate the database from the primary region to the secondary region. Once the metastore is available in the secondary region, and data is replicated in the secondary region, we can attach both with a Databricks workspace to create a DR site.
Let’s implement a basic DR strategy to replicate the data and metadata between the primary region to the secondary region. We will evaluate the steps required to perform when the disaster recovery event occurred.
In the primary region – East US
We are going to use it to store our actual data.
In the Secondary region – West US.
Please follow this blogpost to find the steps for setting up the external hive metastore. We just need to set it up for only one workspace and it will be treated as primary here. We are going to set up the Secondary workspace in this blog post.
Once the external hive metastore is configured as per the blog post, we will have a delta table created just for our testing. We can find the logical schema in the azure sql database, and physical schema and actual data in the storage.
Azure Databricks Workspace
External Hive metastore (Azure SQL DB.)
Azure Data Lake g2(Storage Account)
Since we have selected the RA-GRS replication type while creating the storage account. The data is already being replicated in the secondary region. The secondary region data can be accessed by changing the connection string. We need to append “-secondary” at the end of the storage account name. “<storage-account-name>-secondary”. Secondary region data is read-only.
In case of a disaster recovery event when the primary storage is unavailable, failover happens (Microsoft-managed failover/customer-managed failover). Post failover the secondary region becomes the read/write endpoint for the application. There is no change in the storage endpoint URL post failover, hence no configuration changes are needed from the Databricks side.
(More information related to storage disaster recovery can be found here.)
We are going to use the active geo-replication in Azure SQL Server technique to have a read-only secondary database. Active geo-replication is a feature that lets you create a continuously synchronized readable secondary database for a primary database.
Post creating the secondary workspace from the Aure Portal, we create the cluster with the same configuration as the primary region’s workspace. We will edit the configurations in such a way that it points to our secondary region storage and Azure SQL Server for the Hive external metastore.
Important to note that the storage endpoint should be the same for the primary and secondary region. As post failover, the storage endpoint URL remains same. Just for our demo purpose, we will point to the read/only replica of the storage to check the DR site’s connectivity.
Also, if you are using the auto failover group for the azure sql db replication, there is no need to change the connection string. Connection string remains same post failover. In our demo, we are using Active Geo replication, hence we need to provide the secondary region auzre sql server/db connection string.
Since we are going to connect to the read-only replica of the storage account, we will create the mount point pointing to read-only replica with the same mount point name that we have used in our primary databricks workspace.
Important to note that the storage endpoint should be the same for the primary and secondary region. As post failover, the storage endpoint URL remains same. For our demo, we will point to the read/only replica of the storage to check the DR site’s connectivity.
In the data section, we can read the data successfully on the secondary site.
What happens if we try to change the schema of the table from the secondary workspace?
Since we have connected to read/only storage replica, we will get a request forbidden error ( status code: 403).
The purpose of a disaster recovery plan is to reduce damage or disruption and recover as quickly as possible in the event of a disaster that leads to system failure. Using the External Hive metastore in Azure Databricks simplifies the whole disaster recovery event with lesser downtime and maintenance overhead.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.