Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions.
Identify Business Critical Workloads
Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform.
Azure Databricks
Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions.
Full System Failover (Active-Passive) Strategy
A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include:
Infrastructure Components:
-
-
- Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform
-
-
-
- Deploy network infrastructure (subnets) in the secondary region
-
-
-
- Establish data synchronization mechanisms
-
Data Replication Strategy:
-
-
- Use Deep Clone for Delta tables rather than geo-redundant storage
-
-
-
- Implement periodic synchronization jobs using Delta's incremental replication
-
-
-
- Measure data transfer results using time travel syntax
-
Workspace Asset Synchronization:
-
-
- Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD
-
-
-
- Utilize Terraform and SCIM for identity and access management
-
-
-
- Keep job concurrencies at zero in the secondary region to prevent execution
-
Fully Redundant (Active-Active) Strategy
The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy:
-
- Requires complex data synchronization between regions
-
- Incurs highest operational costs due to duplicate processing
-
- Typically needed only for mission-critical workloads with zero-tolerance for downtime
-
- Can be implemented as partial active-active, processing most workload in primary with subset in secondary
Enabling Disaster Recovery
-
- Create a secondary workspace in a paired region.
-
- Use CI/CD to keep Workspace Assets Synchronized continuously.
|
Requirement |
Approach |
Tools |
|
Cluster Configurations |
Co-deploy to both regions as code |
Terraform |
|
Code (Notebooks, Libraries, SQL) |
Co-deploy with CI/CD pipelines |
Git, Azure DevOps, GitHub Actions |
|
Jobs |
Co-deploy with CI/CD, set concurrency to zero in secondary |
Databricks Asset Bundles, Terraform |
|
Permissions (Users, Groups, ACLs) |
Use IdP/SCIM and infrastructure as code |
Terraform, SCIM |
|
Secrets |
Co-deploy using secret management |
Terraform, Azure Key Vault |
|
Table Metadata |
Co-deploy with CI/CD workflows |
Git, Terraform |
|
Cloud Services (ADLS, Network) |
Co-deploy infrastructure |
Terraform |
-
- Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution.
-
- Replicate all dependent services (Key Vault, Storage accounts, SQL DB).
-
- Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions.
-
- Introduce an application‑level “Sync Tool” that redirects:
-
-
- data ingestion
-
-
-
- compute execution
-
-
- Enable parallel processing in both regions for selected or all workloads.
-
- Use bi‑directional synchronization for Delta data to maintain consistency across regions.
-
- For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm.
Microsoft Fabric
Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication.
Power BI Business Continuity
Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering:
-
- No opt-in required: DR capabilities are automatically included.
-
- Azure storage geo-redundant replication: Ensures backup instances exist in other regions.
-
- Read-only access during disasters: Semantic models, reports, and dashboards remain accessible.
-
- Always supported: BCDR for Power BI remains active regardless of OneLake DR setting.
Microsoft Fabric
Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers:
Microsoft's Responsibilities:
-
- Ensure baseline infrastructure and platform services availability
-
- Maintain Azure regional pairings for geo-redundancy.
-
- Provide DR capabilities for Power BI as default.
Customer Responsibilities:
-
- Enable disaster recovery settings for capacities
-
- Set up secondary capacity and workspaces in paired regions
-
- Replicate data and configurations
Enabling Disaster Recovery
Organizations can enable BCDR through the Admin portal under Capacity settings:
-
- Navigate to Admin portal → Capacity settings
- Select the appropriate Fabric Capacity
- Access Disaster Recovery configuration
- Enable the disaster recovery toggle
Critical Timing Considerations:
-
- 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted.
-
- 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective.
Azure Databricks & Microsoft Fabric DR Considerations
Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different.
Recovery Procedures
|
Procedure |
Databricks |
Fabric |
|
Failover |
Stop workloads, update routing, resume in secondary region. |
Microsoft initiates failover; customers restore services in DR capacity. |
|
Restore to Primary |
Stop secondary workloads, replicate data/code back, test, resume production. |
Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. |
|
Asset Syncing |
Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. |
Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. |
Business Considerations
|
Consideration |
Databricks |
Fabric |
|
Control |
Customers manage DR strategy, failover timing, and asset replication. |
Microsoft manages failover; customers restore services post-failover. |
|
Regional Dependencies |
Must ensure secondary region has sufficient capacity and services. |
DR only available in Azure regions with Fabric support and paired regions. |
|
Power BI Continuity |
Not applicable. |
Power BI offers built-in BCDR with read-only access to semantic models and reports. |
|
Activation Timeline |
Immediate upon configuration. |
DR setting takes up to 72 hours to activate; 30-day wait before changes allowed. |