Blog Post

Analytics on Azure Blog
4 MIN READ

Azure Databricks & Fabric Disaster Recovery: The Better Together Story

Rafia_Aqil's avatar
Rafia_Aqil
Icon for Microsoft rankMicrosoft
Dec 27, 2025

Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions. 

Identify Business Critical Workloads 

Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform. 

Azure Databricks 

Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions. 

Full System Failover (Active-Passive) Strategy  

A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include:  

Infrastructure Components: 

      • Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform 
      • Deploy network infrastructure (subnets) in the secondary region 
      • Establish data synchronization mechanisms 

Data Replication Strategy: 

      • Use Deep Clone for Delta tables rather than geo-redundant storage 
      • Implement periodic synchronization jobs using Delta's incremental replication 
      • Measure data transfer results using time travel syntax 

Workspace Asset Synchronization: 

      • Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD 
      • Utilize Terraform and SCIM for identity and access management 
      • Keep job concurrencies at zero in the secondary region to prevent execution 

 

Fully Redundant (Active-Active) Strategy 

The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy:  

    • Requires complex data synchronization between regions 
    • Incurs highest operational costs due to duplicate processing 
    • Typically needed only for mission-critical workloads with zero-tolerance for downtime 
    • Can be implemented as partial active-active, processing most workload in primary with subset in secondary 

 

Enabling Disaster Recovery 

    • Create a secondary workspace in a paired region. 
    • Use CI/CD to keep Workspace Assets Synchronized continuously. 

 

Requirement 

Approach 

Tools 

Cluster Configurations 

Co-deploy to both regions as code 

Terraform 

Code (Notebooks, Libraries, SQL) 

Co-deploy with CI/CD pipelines 

Git, Azure DevOps, GitHub Actions 

Jobs 

Co-deploy with CI/CD, set concurrency to zero in secondary 

Databricks Asset Bundles, Terraform 

Permissions (Users, Groups, ACLs) 

Use IdP/SCIM and infrastructure as code 

Terraform, SCIM 

Secrets 

Co-deploy using secret management 

Terraform, Azure Key Vault 

Table Metadata 

Co-deploy with CI/CD workflows 

Git, Terraform 

Cloud Services (ADLS, Network) 

Co-deploy infrastructure 

Terraform 

 

    • Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution. 
    • Replicate all dependent services (Key Vault, Storage accounts, SQL DB). 
    • Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions. 
    • Introduce an application‑level “Sync Tool” that redirects:  
      • data ingestion 
      • compute execution 
    • Enable parallel processing in both regions for selected or all workloads. 
    • Use bi‑directional synchronization for Delta data to maintain consistency across regions. 
    • For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm. 

 

Microsoft Fabric  

Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication. 

Power BI Business Continuity 

Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering: 

    • No opt-in required: DR capabilities are automatically included. 
    • Azure storage geo-redundant replication: Ensures backup instances exist in other regions. 
    • Read-only access during disasters: Semantic models, reports, and dashboards remain accessible. 
    • Always supported: BCDR for Power BI remains active regardless of OneLake DR setting. 
Microsoft Fabric 

Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers:  

Microsoft's Responsibilities: 

    • Ensure baseline infrastructure and platform services availability 
    • Maintain Azure regional pairings for geo-redundancy. 
    • Provide DR capabilities for Power BI as default. 

Customer Responsibilities: 

    • Enable disaster recovery settings for capacities 
    • Set up secondary capacity and workspaces in paired regions 
    • Replicate data and configurations 
Enabling Disaster Recovery 

Organizations can enable BCDR through the Admin portal under Capacity settings:  

    1. Navigate to Admin portal → Capacity settings 
    2. Select the appropriate Fabric Capacity 
    3. Access Disaster Recovery configuration 
    4. Enable the disaster recovery toggle 

Critical Timing Considerations: 

    • 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted. 
    • 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective. 

 

Azure Databricks & Microsoft Fabric DR Considerations  

 

Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different. 

 
Recovery Procedures 

Procedure 

Databricks 

Fabric 

Failover 

Stop workloads, update routing, resume in secondary region. 

Microsoft initiates failover; customers restore services in DR capacity. 

Restore to Primary 

Stop secondary workloads, replicate data/code back, test, resume production. 

Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. 

Asset Syncing 

Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. 

Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. 

 

Business Considerations 

Consideration 

Databricks 

Fabric 

Control 

Customers manage DR strategy, failover timing, and asset replication. 

Microsoft manages failover; customers restore services post-failover. 

Regional Dependencies 

Must ensure secondary region has sufficient capacity and services. 

DR only available in Azure regions with Fabric support and paired regions. 

Power BI Continuity 

Not applicable. 

Power BI offers built-in BCDR with read-only access to semantic models and reports. 

Activation Timeline 

Immediate upon configuration. 

DR setting takes up to 72 hours to activate; 30-day wait before changes allowed. 

 

 

Updated Dec 27, 2025
Version 2.0

1 Comment

  • aniwatchcl's avatar
    aniwatchcl
    Occasional Reader

    Aniwatch lets you watch your favorite anime online with smooth streaming, quick episode updates, 
    and a clean viewing experience. Stay updated with the newest releases anytime.

    <a href="https://aniwatch.cl/">Aniwatch</a>