Analytics on Azure Blog

4 MIN READ

Azure Databricks & Fabric Disaster Recovery: The Better Together Story

Microsoft

Dec 27, 2025

Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions.

Identify Business Critical Workloads

Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform.

Azure Databricks

Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions.

Full System Failover (Active-Passive) Strategy

A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include:

Infrastructure Components:

- - Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform

- - Deploy network infrastructure (subnets) in the secondary region

- - Establish data synchronization mechanisms

Data Replication Strategy:

- - Use Deep Clone for Delta tables rather than geo-redundant storage

- - Implement periodic synchronization jobs using Delta's incremental replication

- - Measure data transfer results using time travel syntax

Workspace Asset Synchronization:

- - Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD

- - Utilize Terraform and SCIM for identity and access management

- - Keep job concurrencies at zero in the secondary region to prevent execution

Fully Redundant (Active-Active) Strategy

The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy:

- Requires complex data synchronization between regions

- Incurs highest operational costs due to duplicate processing

- Typically needed only for mission-critical workloads with zero-tolerance for downtime

- Can be implemented as partial active-active, processing most workload in primary with subset in secondary

Enabling Disaster Recovery

- Create a secondary workspace in a paired region.

- Use CI/CD to keep Workspace Assets Synchronized continuously.

Requirement	Approach	Tools
Cluster Configurations	Co-deploy to both regions as code	Terraform
Code (Notebooks, Libraries, SQL)	Co-deploy with CI/CD pipelines	Git, Azure DevOps, GitHub Actions
Jobs	Co-deploy with CI/CD, set concurrency to zero in secondary	Databricks Asset Bundles, Terraform
Permissions (Users, Groups, ACLs)	Use IdP/SCIM and infrastructure as code	Terraform, SCIM
Secrets	Co-deploy using secret management	Terraform, Azure Key Vault
Table Metadata	Co-deploy with CI/CD workflows	Git, Terraform
Cloud Services (ADLS, Network)	Co-deploy infrastructure	Terraform

- Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution.

- Replicate all dependent services (Key Vault, Storage accounts, SQL DB).

- Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions.

- Introduce an application‑level “Sync Tool” that redirects:

- - data ingestion

- - compute execution

- Enable parallel processing in both regions for selected or all workloads.

- Use bi‑directional synchronization for Delta data to maintain consistency across regions.

- For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm.

Microsoft Fabric

Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication.

Power BI Business Continuity

Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering:

- No opt-in required: DR capabilities are automatically included.

- Azure storage geo-redundant replication: Ensures backup instances exist in other regions.

- Read-only access during disasters: Semantic models, reports, and dashboards remain accessible.

- Always supported: BCDR for Power BI remains active regardless of OneLake DR setting.

Microsoft Fabric

Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers:

Microsoft's Responsibilities:

- Ensure baseline infrastructure and platform services availability

- Maintain Azure regional pairings for geo-redundancy.

- Provide DR capabilities for Power BI as default.

Customer Responsibilities:

- Enable disaster recovery settings for capacities

- Set up secondary capacity and workspaces in paired regions

- Replicate data and configurations

Enabling Disaster Recovery

Organizations can enable BCDR through the Admin portal under Capacity settings:

1. Navigate to Admin portal → Capacity settings
2. Select the appropriate Fabric Capacity
3. Access Disaster Recovery configuration
4. Enable the disaster recovery toggle

Critical Timing Considerations:

- 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted.

- 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective.

Azure Databricks & Microsoft Fabric DR Considerations

Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different.

Recovery Procedures

Procedure	Databricks	Fabric
Failover	Stop workloads, update routing, resume in secondary region.	Microsoft initiates failover; customers restore services in DR capacity.
Restore to Primary	Stop secondary workloads, replicate data/code back, test, resume production.	Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data.
Asset Syncing	Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions.	Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses.

Business Considerations

Consideration	Databricks	Fabric
Control	Customers manage DR strategy, failover timing, and asset replication.	Microsoft manages failover; customers restore services post-failover.
Regional Dependencies	Must ensure secondary region has sufficient capacity and services.	DR only available in Azure regions with Fabric support and paired regions.
Power BI Continuity	Not applicable.	Power BI offers built-in BCDR with read-only access to semantic models and reports.
Activation Timeline	Immediate upon configuration.	DR setting takes up to 72 hours to activate; 30-day wait before changes allowed.