Author's: Amudha Palani amudhapalani, Eric Kwashie ekwashie, Peter Lo PeterLo and Rafia Aqil Rafia_Aqil
Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions.
Identify Business Critical Workloads
Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform.
Azure Databricks
Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions.
Full System Failover (Active-Passive) Strategy
A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include:
Infrastructure Components:
-
-
- Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform
-
-
-
- Deploy network infrastructure (subnets) in the secondary region
-
-
-
- Establish data synchronization mechanisms
-
Data Replication Strategy:
-
-
- Use Deep Clone for Delta tables rather than geo-redundant storage
-
-
-
- Implement periodic synchronization jobs using Delta's incremental replication
-
-
-
- Measure data transfer results using time travel syntax
-
Workspace Asset Synchronization:
-
-
- Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD
-
-
-
- Utilize Terraform and SCIM for identity and access management
-
-
-
- Keep job concurrencies at zero in the secondary region to prevent execution
-
Fully Redundant (Active-Active) Strategy
The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy:
-
- Requires complex data synchronization between regions
-
- Incurs highest operational costs due to duplicate processing
-
- Typically needed only for mission-critical workloads with zero-tolerance for downtime
-
- Can be implemented as partial active-active, processing most workload in primary with subset in secondary
Enabling Disaster Recovery
-
- Create a secondary workspace in a paired region.
-
- Use CI/CD to keep Workspace Assets Synchronized continuously.
|
Requirement |
Approach |
Tools |
|
Cluster Configurations |
Co-deploy to both regions as code |
Terraform |
|
Code (Notebooks, Libraries, SQL) |
Co-deploy with CI/CD pipelines |
Git, Azure DevOps, GitHub Actions |
|
Jobs |
Co-deploy with CI/CD, set concurrency to zero in secondary |
Databricks Asset Bundles, Terraform |
|
Permissions (Users, Groups, ACLs) |
Use IdP/SCIM and infrastructure as code |
Terraform, SCIM |
|
Secrets |
Co-deploy using secret management |
Terraform, Azure Key Vault |
|
Table Metadata |
Co-deploy with CI/CD workflows |
Git, Terraform |
|
Cloud Services (ADLS, Network) |
Co-deploy infrastructure |
Terraform |
-
- Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution.
-
- Replicate all dependent services (Key Vault, Storage accounts, SQL DB).
-
- Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions.
-
- Introduce an application‑level “Sync Tool” that redirects:
-
-
- data ingestion
-
-
-
- compute execution
-
-
- Enable parallel processing in both regions for selected or all workloads.
-
- Use bi‑directional synchronization for Delta data to maintain consistency across regions.
-
- For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm.
Implement Three-Pillar DR Design
- Primary Workspace: Your production Databricks environment running normal operations
- Secondary Workspace: A standby Databricks workspace in a different(paired) Azure region that remains ready to take over if the primary fails. This architecture ensures business continuity while optimizing costs by keeping the secondary workspace dormant until needed.
The DR solution is built on three fundamental pillars that work together to provide comprehensive protection:
1. Infrastructure Provisioning (Terraform)
The infrastructure layer creates and manages all Azure resources required for disaster recovery using Infrastructure as Code (Terraform).
What It Creates:
-
-
- Secondary Resource Group: A dedicated resource group in your paired DR region (e.g., if primary is in East US, secondary might be in West US 2)
- Secondary Databricks Workspace: A standby Databricks workspace with the same SKU as your primary, ready to receive failover traffic
- DR Storage Account: An ADLS Gen2 storage account that serves as the backup destination for your critical data
- Monitoring Infrastructure: Azure Monitor Log Analytics workspace and alert action groups to track DR health
- Protection Locks: Management locks to prevent accidental deletion of critical DR resources
- Key Design Principle: The Terraform configuration references your existing primary workspace without modifying it. It only creates new resources in the secondary region, ensuring your production environment remains untouched during setup.
-
2. Data Synchronization (Delta Notebooks)
The data synchronization layer ensures your critical data is continuously backed up to the secondary region.
How It Works: The solution uses a Databricks notebook that runs in your primary workspace on a scheduled basis. This notebook:
-
-
- Connects to Backup Storage: Uses Unity Catalog with Azure Managed Identity for secure, credential-free authentication to the secondary storage account
- Identifies Critical Tables: Reads from a configuration list you define (sales data, customer data, inventory, financial transactions, etc.)
- Performs Deep Clone: Uses Delta Lake's native CLONE functionality to create exact copies of your tables in the backup storage
- Tracks Sync Status: Logs each synchronization operation, tracks row counts, and reports on data freshness
-
Authentication Flow: The synchronization process leverages Unity Catalog's managed identity capabilities:
-
-
- An existing Access Connector for Unity Catalog is granted "Storage Blob Data Contributor" permissions on the backup storage.
- Storage credentials are created in Databricks that reference this Access Connector.
- The notebook uses these credentials transparently—no storage keys or secrets are required.
-
What Gets Synced: You define which tables are critical to your business operations. The notebook creates backup copies including:
-
-
- Full table data and schema
- Table partitioning structure
- Delta transaction logs for point-in-time recovery
-
3. Failover Automation (Python Scripts)
The failover automation layer orchestrates the switch from primary to secondary workspace when disaster strikes.
Microsoft Fabric
Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication.
Power BI Business Continuity
Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering:
-
- No opt-in required: DR capabilities are automatically included.
-
- Azure storage geo-redundant replication: Ensures backup instances exist in other regions.
-
- Read-only access during disasters: Semantic models, reports, and dashboards remain accessible.
-
- Always supported: BCDR for Power BI remains active regardless of OneLake DR setting.
Microsoft Fabric
Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers:
Microsoft's Responsibilities:
-
- Ensure baseline infrastructure and platform services availability
-
- Maintain Azure regional pairings for geo-redundancy.
-
- Provide DR capabilities for Power BI as default.
Customer Responsibilities:
-
- Enable disaster recovery settings for capacities
-
- Set up secondary capacity and workspaces in paired regions
-
- Replicate data and configurations
Enabling Disaster Recovery
Organizations can enable BCDR through the Admin portal under Capacity settings:
-
- Navigate to Admin portal → Capacity settings
- Select the appropriate Fabric Capacity
- Access Disaster Recovery configuration
- Enable the disaster recovery toggle
Critical Timing Considerations:
-
- 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted.
-
- 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective.
Azure Databricks & Microsoft Fabric DR Considerations
Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different.
Recovery Procedures
|
Procedure |
Databricks |
Fabric |
|
Failover |
Stop workloads, update routing, resume in secondary region. |
Microsoft initiates failover; customers restore services in DR capacity. |
|
Restore to Primary |
Stop secondary workloads, replicate data/code back, test, resume production. |
Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. |
|
Asset Syncing |
Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. |
Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. |
Business Considerations
|
Consideration |
Databricks |
Fabric |
|
Control |
Customers manage DR strategy, failover timing, and asset replication. |
Microsoft manages failover; customers restore services post-failover. |
|
Regional Dependencies |
Must ensure secondary region has sufficient capacity and services. |
DR only available in Azure regions with Fabric support and paired regions. |
|
Power BI Continuity |
Not applicable. |
Power BI offers built-in BCDR with read-only access to semantic models and reports. |
|
Activation Timeline |
Immediate upon configuration. |
DR setting takes up to 72 hours to activate; 30-day wait before changes allowed. |