analytics
295 TopicsAzure Databricks at Databricks Data + AI Summit 2026: updates and new announcements
Databricks Data + AI Summit brings together the global data and AI community in San Francisco to share product news, technical breakthroughs, and customer stories. This year, as usual, we have a lot of Azure Databricks announcements, a strong presence across the event, and a continued focus on helping customers put their data to work across analytics, AI, and enable business productivity. Find us at Data and AI Summit As a Legend Sponsor and Databricks’ long-standing strategic partner, Microsoft is joining Databricks Data + AI Summit during the keynote, multiple breakout sessions, and at the Expo booth. We're also engaging with customers 1:1 to hear from you. Satya Nadella will join Ali Ghodsi, CEO Databricks, in a pre-recorded keynote conversation on the importance of data in AI implementation and the deep integrations we co-engineer. We encourage you to visit us at the Microsoft Booth (Booth # 103) on the Expo floor to chat with the Azure Databricks team, see demos, and learn more about the recent announcements. Azure Databricks Breakout Sessions Unlocking the Microsoft Data & AI Ecosystem with Azure Databricks: From Insight to Impact Wednesday, June 17 | 1:50 PM – 2:30 PM PDT | Speaker: Anavi Nahar, Head of Product, Azure Data Lake Storage & Azure Databricks, Microsoft In today’s data-driven landscape, organizations need more than analytics—they need a unified platform that turns raw data into actionable intelligence across the Microsoft ecosystem. This session explores how Azure Databricks serves as the backbone of modern data architecture, integrating with core Microsoft cloud services and platforms to accelerate innovation. Learn how to use Azure Databricks for scalable data engineering, advanced analytics, and AI-driven solutions while enabling real-time collaboration and governance. Through practical examples and architectural patterns, we’ll show how to eliminate data silos, optimize performance, and empower teams to deliver insights faster. Zero-Copy Federated Energy Analytics: ADME + Databricks in Action Wednesday, June 17 | 12:40 PM - 1:20 PM PDT | Speaker: Andy Corran, Principal Product Manager, Azure Databricks, Microsoft Oil and gas companies have standardized on Azure Data Manager for Energy (ADME) as their subsurface system of record, but running analytics and AI on that data has meant copying massive datasets into downstream platforms, breaking governance and slowing every workflow that follows. In this jointly developed Microsoft and Databricks session, we introduce a new zero‑copy, federated path that brings Databricks compute directly to data, with native governance and serverless scale. We walk through the architecture, show the solution in action against live ADME, and share how operators across the industry are accelerating subsurface analytics while keeping ADME as the single source of truth. Unity Catalog External Locations: Extending Governance to OneLake and Beyond Wednesday, Jun 17 | 5:20 PM - 5:40 PM PDT | Speaker: Ljubica Vujovic Boskovic, Senior Product Manager, Databricks In this session, we'll show how External Locations provide a consistent, extensible pattern for connecting Databricks to any storage platform — and walk through what it takes to create External Location for Microsoft OneLake. You'll see the architecture, the setup end-to-end, and a demo reading and writing UC-governed assets directly into OneLake storage without needing to setup any ETL pipelines. Latest announcements We recently announced new ways to build AI apps and agents with Azure Databricks, Copilot Studio, and GitHub Copilot, including authoring Copilot Studio agents that reason over an entire Azure Databricks workspace through one MCP connection. At Microsoft Build, PepsiCo also shared its blueprint for agentic AI, illustrating how Azure Databricks can provide the data foundation for agentic apps. This week’s announcements make it easier to use Azure Databricks with the Microsoft tools your teams rely on every day, including Microsoft Teams, M365 Copilot, Excel, SharePoint, Power BI, and OneLake: Genie for Microsoft Teams and M365 Copilot (Beta): You can tag Genie in a Teams thread and get a context-aware answer from your Azure Databricks lakehouse without leaving the conversation. Responses are governed by Unity Catalog, so each answer is scoped to what the user is permitted to see. It’s part of the broader Genie One experience for report generation, reusable agents, low-code apps, and natural-language pipeline design. See it in action in the Databricks + Microsoft co-authored training in AI Skills Navigator Genie in Copilot Cowork (Beta): Available today, Databricks Genie works seamlessly with M365 Copilot Cowork. This integration will allow teams to anchor Cowork’s tasks with the Genie Ontology, bringing trusted data intelligence straight into their workflows Azure Databricks Excel Add-in (Public Preview): This brings governed lakehouse data into Excel without SQL or per-user ODBC setup. Unity Catalog metric views let business logic be defined once and stay consistent across tools, and the add-in supports write-back, so permitted users can push updates from Excel into Databricks. Learn how to set it up. SharePoint Connector (Beta) via Lakeflow Connect. A fully managed connector for point-and-click ingestion pipelines that bring SharePoint content — structured sheets and unstructured PDFs, Word docs, and PowerPoints — into Delta tables, keeping downstream analytics, Genie spaces, and Excel workbooks supplied with current data. Read the documentation here. Azure Databricks OneLake Catalog Federation (Generally Available): The ability to query OneLake data directly from Azure Databricks without pipelines, duplication, or data movement is generally available. This announcement coupled with the Azure Databricks Mirrored Catalog item enable bidirectional READ from Azure Databricks and OneLake. Learn more here Storing Unity Catalog Managed Tables in OneLake (Beta): You can now customers can use OneLake as a storage location option for Unity Catalog tables in addition to Azure Data Lake Storage (ADLS). Read more on how to do this here. CustomerLake: a customer data platform inside the lakehouse Introducing CustomerLake, a Customer Data Platform (CDP) built directly within the lakehouse rather than as a separate application. CustomerLake is now available in Azure Databricks. Two kinds of agents do much of the work: Profile Agents help assemble business-ready Customer 360 profiles from fragmented sources, reducing the manual effort of stitching customer data together. Campaign Agents give marketing teams a workspace to segment audiences, recommend next-best actions, activate across channels, and continuously optimize personalized experiences. Because CustomerLake runs inside your governed storage boundary, customer data, AI models, and governance stay together — avoiding much of the data movement and duplication that come with connecting separate marketing tools. For Azure customers, that means building customer engagement on the same governed lakehouse foundation they already use for analytics and AI, rather than maintaining a parallel stack. “What excites us most about the CustomerLake and the new CDP capability is the ability to bring customer data together in a way that is actionable, timely, and scalable. By creating a more complete view of each customer, we can better understand behaviors, preferences, and needs across channels, which will help us deliver more personalized experiences and more relevant offers. Ultimately, we see this as a powerful step toward stronger engagement, deeper loyalty, and better outcomes for both our business and our customers.” Jay Malepati Global Director of Data Science, Circle K All of these announcements benefit from built in Governance with Azure Databricks Unity Catalog. By connecting governed lakehouse data to the Microsoft tools your teams already use — Teams, M365 Copilot, Excel, SharePoint, OneLake, and Power BI — these updates make it easier to put trusted AI to work on Azure. To learn more, explore the Azure Databricks documentation and try these capabilities in your own workspace.350Views1like0CommentsWhat to Do When You Hit Capacity in Azure Databricks: Engage, Mitigate, Plan!
Microsoft's Cloud Architects: Paul Singh PaulSingh, Eduardo Dos Santos eduardomdossantos, Chris Walk cwalk, Peter Lo PeterLo, Tim Orentlikher tim_orentlikher, Ajmal Hossain ajmalhossain, Chris Haynes Chris_Haynes, and Rafia Aqil Rafia_Aqil Start Here: Engage Microsoft Capacity constraints in Azure Databricks are not an Azure Databricks product issue. Azure Databricks does not own or reserve compute, it dynamically provisions VMs from Azure when clusters are created or scaled. This means cluster creation, autoscaling, or job execution can stall when the underlying VM SKUs are constrained at the regional level. The fastest path to resolution is a structured conversation with your Microsoft account team, who can engage the Azure capacity intake process on your behalf. Create a Quota Support Ticket via Microsoft Support and bring the following to your account team with your Support Ticket Number. Each field maps directly to what capacity intake teams will ask for: missing fields slow the request. What to Prepare Before You Reach Out Your Account Team Field What Capacity Intake Needs Example Subscription IDs The exact Azure subscriptions that will host the workspaces and clusters 7ebee83d-7923-426c-8449-59fd4dff25ab Region(s) Primary region, plus any acceptable alternates East US 2 VM family / SKU Specific series and version requested Eadsv5, ESv4, DSv4, DSv2 Core count / new limit Total vCPU or core count per SKU 10,000 cores for Eadsv5 Workload characteristic CPU-bound vs. memory/shuffle-heavy vs. IO-heavy; batch vs. streaming vs. SQL “Memory-intensive ETL with large joins and shuffles” Scale and timing When you need it, ramp profile, peak vs. steady state “Need by month-end; ramp from 2,000 to 9,650 cores over Q3” Business context Business use case “Migration off AWS” What “Capacity” Really Means: A Layered Mental Model Before diving into fixes, it is important to understand what is actually happening behind the scenes. Capacity constraints can occur at three distinct layers, and solving them requires addressing each one. Layer 1: Azure Infrastructure This is the layer most teams underestimate. Capacity here is governed by: VM SKU availability in the region. D-series and E-series: the two most common Databricks worker families: have repeatedly hit capacity constraints across multiple Azure regions, causing cluster creation failures, autoscale stalls, and provisioning delays. Regional supply constraints, which are dynamic and shared across all Azure tenants. vCPU quotas and limits per subscription, which are separate from regional supply. Quota is your subscription’s limit to deploy resources (like a credit card limit); regional capacity is the underlying infrastructure available. Both must be sufficient. Layer 2: Azure Databricks Platform The Azure Databricks control plane has its own published ceilings that your architecture must proactively respect. Key limits from the official Azure Databricks resource limits documentation: Resource Limit Scope Jobs created per hour 10,000 Workspace Tasks running simultaneously 2,000 Workspace (Run Job and For Each parent tasks excluded) Parent tasks running simultaneously (Run Job / For Each) 750 Workspace SQL warehouses 1,000 Workspace Attached notebooks or execution contexts 145 Cluster Virtual machines 25,000 Per subscription per region Note: For limits marked as non-fixed in the official documentation, you can request an increase through your Azure Databricks account team. Reference: https://learn.microsoft.com/en-us/azure/databricks/resources/limits Layer 3: Workload (Spark Execution) Even when both lower layers cooperate, Spark’s own execution model can produce capacity-like symptoms: Parallelism and task distribution, which dictate how many cores a job can usefully consume. Memory pressure from joins, shuffles, and skewed keys. IO demand and caching behavior, including Delta cache effectiveness and Spark cache misuse. Understanding these layers is critical. Retries sometimes succeed because capacity is dynamic: as other workloads complete, nodes are released back to Azure and briefly become available. Recognizing When You’ve Hit Capacity Capacity issues rarely present as a single clean error. Instead, they appear as inconsistent behaviors: Clusters stuck in Pending state Autoscaling fails or never reaches the desired size Jobs intermittently fail to start Retry attempts sometimes succeed These inconsistencies occur because capacity is shared across Azure tenants and fluctuates throughout the day. Running workloads outside peak business hours in the impacted region’s time zone is one of the most effective short-term mitigations. Immediate Actions: How to Unblock Your Workloads When you are actively hitting capacity constraints, speed matters. Please reach out to your Microsoft Account team and try these mitigations that are ordered from quickest to most involved. Retry and Run During Off-Peak Hours Capacity availability changes throughout the day as workloads complete and release VMs. Running outside peak business hours for the impacted region significantly improves success rates. Switch VM SKU or Family If a specific VM SKU is constrained, switching to another can immediately unblock provisioning. Move within the same family (for example, DSv4 → DSv5) Or switch families entirely (for example, D-series → F-series or L-series) This is one of the most effective but often underused approaches. Also, Choosing the Right VM Family Most Databricks environments default to D-series (general purpose) and E-series (memory optimized). These are also the most heavily used and most capacity-constrained VM families. Consider alternatives based on your workload: VM Family Best For When to Use Trade-off D-series General workloads Default choice Often constrained in high-demand regions E-series Memory-heavy Spark jobs Joins, shuffles, analytics High demand; higher cost F-series CPU-intensive jobs Parsing, transformations Lower memory per core L-series IO-heavy workloads Delta caching, large datasets Higher cost; large local NVMe Practical decision framework: Memory-bound workloads (joins, shuffles): Move from E-series to L-series. Similar memory per core, plus large local NVMe for Delta caching. CPU-bound workloads: Move from D-series to F-series. Higher CPU performance at lower cost. IO-heavy or cache-sensitive workloads: L-series can significantly improve performance and reduce shuffle pressure. Designing a single VM family is one of the biggest production risks in Azure Databricks environments. Implement Regional Diversity in your Databricks workload As Azure capacity constraints are region- and SKU-specific, it is important to build architectural flexibility into your Databricks deployments. For critical or large-scale workloads, consider deploying multiple Databricks workspaces across different Azure regions to reduce dependency on any single region’s capacity. This approach enables: improved resilience to regional capacity constraints greater flexibility in workload placement Important: Multi-region deployment requires deliberate architecture, including deploying separate workspaces and replicating data and configurations across regions—it is not automatic. Why Adding More Nodes Is Not Always the Answer When jobs slow down, the instinct is to scale compute. With Spark, more nodes do not always solve the problem. Common workload issues that masquerade as capacity problems: Data skew Excessive shuffle operations Inefficient partitioning Overuse of UDFs In real workloads, shuffle operations can grow significantly larger than input data, placing heavy pressure on both compute and memory that more nodes cannot relieve. Smarter optimization strategies: Reduce shuffle through repartitioning and query optimization Enable Photon for faster execution Optimize Delta tables using Z-ordering and compaction Leverage caching strategically (not just Spark cache: use the Delta/disk cache) These optimizations can reduce your dependency on scarce VM capacity altogether. What to Do When Your Capacity Is Approved Once Azure approves your capacity request, retaining it requires active steps. Because Azure capacity is dynamic and shared, approved capacity is held only while compute remains actively deployed and running. This is especially important in highly constrained regions. Microsoft recommends the following: Configure an Instance Pool For workloads that cannot yet use serverless compute, configure an Azure Databricks Instance Pool with a minimum number of idle nodes aligned to your production requirements. An instance pool pre-allocates and maintains a set of idle, ready-to-use VM instances. When a cluster is created from the pool, it draws from these warm nodes: eliminating the need to request new VMs from the regional Azure capacity pool between job runs. Key behaviors: The pool holds a minimum number of nodes continuously, keeping them warm and immediately available. Clusters attached to the pool pull from warm nodes, avoiding re-acquisition from Azure between runs. No DBU charges apply while nodes are idle in the pool. Azure VM infrastructure costs do apply for all minimum idle instances. Size the pool conservatively: aligned to production need only: to balance capacity retention against ongoing cost. Important: Instance pools hold idle nodes on a best-effort basis. Periodic platform events can recycle pool nodes, briefly causing the pool to fall below its configured minimum idle count while Azure re-acquires replacement nodes. Pools significantly improve availability and startup latency, but they do not change the fact that the underlying VMs are still requested from Azure on demand. They are not a hard reservation. Reference: https://learn.microsoft.com/en-us/azure/databricks/compute/pools Designing for Resilience: Long-Term Best Practices To avoid repeated capacity issues, your architecture needs to evolve beyond reactive mitigations. Plan for Capacity Early Understand VM quotas and limits before you need them: not after a constraint occurs. Avoid designing a single SKU. Build flexibility into cluster configurations so you can switch families without re-engineering jobs. Standardize Compute Configurations Consistent, policy-driven environments make it easier to adapt when capacity constraints occur. Use Databricks Cluster Policies to constrain cluster creation to approved, available VM families: this prevents teams from inadvertently requesting constrained SKUs. Move Toward Serverless Where Possible Serverless compute abstracts capacity management away from the customer. As the Databricks platform expands serverless support, migrating eligible workloads is the most durable long-term strategy. Azure continues to expand infrastructure capacity, but there are no guaranteed timelines for relief in constrained regions. Note: If your workload supports serverless compute, Databricks recommends using serverless compute instead of pools or classic VM-backed clusters. Serverless removes dependency on specific VM SKUs and regional capacity: scaling is managed by the platform with significantly improved availability. Reference: https://learn.microsoft.com/en-us/azure/databricks/serverless-compute. For eligible workloads: including Databricks Jobs (automated workflows), Databricks SQL Warehouses, and Delta Live Tables: serverless compute eliminates VM SKU dependency entirely. Configuration guidance is available in the Azure Databricks deployment guide, Development Section, Step 9. Multi-Region Strategy for Critical Workloads For the most critical workloads, evaluate a multi-region deployment as part of your business's continuity planning. This is a significant architectural investment: see the FAQ for the full scope: but it is the only approach that provides true regional redundancy. Coordinate this with your Microsoft account team. Reference: Azure Databricks & Microsoft Fabric Disaster Recovery: The Complete Better‑Together Strategy for Cloud Architects Final Takeaways Capacity issues are infrastructure-level constraints, not Databricks product failures VM family selection is critical: do not rely solely on D-series and E-series Workload optimization can reduce dependency on scarce resources before requesting more capacity Serverless compute is Microsoft’s preferred long-term recommendation for eligible workloads Azure On-Demand Capacity Reservations provide guaranteed capacity for mission-critical scenarios: distinct from instance pools (best-effort) and Reserved Instances (billing discount only) Architectural flexibility: multi-SKU, multi-region awareness is your best defense against future constraints FAQ Why do retries work? Capacity in Azure regions is shared across all tenants and fluctuates throughout the day as workloads complete and release VMs. A retry succeeds when capacity temporarily frees up. Retrying during off-peak hours improves success rates significantly. Why does capacity fluctuate during the day? Capacity is a function of regional supply and concurrent demand. As workloads complete, nodes are released back to Azure. Peak business hours in the impacted region’s time zone tend to be the tightest windows. Why are instance pools not a hard reservation? Pools hold a minimum number of nodes on a best-effort basis. Periodic platform events recycle pool nodes, so a pool can briefly fall below its configured minimum idle count while Azure re-acquires replacement nodes. Setting minimum idle to 0 avoids paying for idle VMs at the cost of slower acquisition time. Pools significantly improve availability and startup latency but do not guarantee capacity at the Azure infrastructure level. Why does serverless behave differently from classic clusters? Serverless compute removes customer control over individual VM SKUs. Databricks manages the underlying capacity across a shared pool. SKU-swap and pool-based mitigations do not apply. Customer-side levers reduce to retry and off-peak scheduling. The trade-off is that serverless is the simplest and most reliable option when the workload supports it. Why is changing regions a last resort? Region changes require redeployment of the Azure Databricks workspace and migration of all dependent artifacts: jobs, clusters, libraries, networking (private endpoints, VNet injection), Unity Catalog assignments, identities, and source data. The destination region must be validated for the same SKU and zonal configuration. For these reasons, region change should always be coordinated with the Microsoft account team and attempted only after preferred mitigations have been exhausted. Why does VM family selection matter so much for capacity? Different VM families have different supply curves. D-series and E-series are the most requested Databricks worker families and the ones most frequently constrained. Choosing a SKU based on whether the workload is memory/shuffle-heavy, CPU-bound, or IO-heavy improves both performance and the probability that capacity is available. The capacity team often steers customers toward newer-generation alternatives when supply differs by generation version. What does the Microsoft account team actually do? They route the request into the Azure capacity intake process, advise alternate SKUs and regions, surface zonal vs. regional considerations, and provide forward visibility into known constraints. The customer’s job is to bring a complete, accurate workload profile so the account team can advocate effectively. It is also recommended to open an Azure Support ticket. This will save time later, as the capacity planning teams would like to track issues and requests via a support ticket. Once an Azure Support ticket is opened, the ticket number should be shared to the Microsoft Account Team, at a minimum to the Customer Success Account Manager (CSAM), if one is assigned to your organization.138Views0likes0CommentsMicrosoft Fabric to Azure Databricks: Another Better Together Story!
Author's: Oscar Alvarado oscaralvarado and Rafia Aqil Rafia_Aqil Note: This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation. Use this article as a starting point to design a well-architected solution that aligns with your workload’s specific requirements. As organizations adopt Microsoft Fabric as their unified analytics platform, it has become a leading path for ingesting both streaming and batch data into Azure Databricks. This article covers integration approaches -via Microsoft Fabric- and details the five Fabric-specific paths that connect OneLake/ADLS and Databricks for end-to-end data processing. Medallion Architecture The following data flow corresponds to the architecture diagram: Data is ingested through Microsoft Fabric (via Mirroring, RTI, or Data Factory) lands data into OneLake/ADLS. With the medallion pattern, consisting of Bronze, Silver, and Gold storage layers, organizations have flexible access and extendable data processing: Bronze – Raw data entry point. Data arrives in its source format and is converted to the open, transactional Delta Lake format. Silver – Optimized for BI and data science. ETL and stream processing tasks filter, clean, transform, join, and aggregate Bronze data into curated datasets using SQL, Python, R, or Scala. Gold – Enriched data ready for analytics and reporting. Analysts use Power BI, PySpark, SQL, or Excel for insights and queries. Fabric Integration Paths Note: This architecture establishes a complete loop-back between Microsoft Fabric and Azure Databricks, enabling Gold layer tables to be seamlessly mirrored back to Microsoft Fabric for dashboarding through Azure Databricks Mirroring. The following five paths connect Microsoft Fabric to Azure Databricks: Fabric Mirroring to OneLake – A low-cost, low-latency turnkey solution that creates a replica of data from operational sources (SQL Server, Azure Cosmos DB, Oracle) in OneLake. Handles the initial load and ongoing CDC changes automatically, keeping data continuously up to date. Fabric RTI to OneLake – Fabric Real-Time Intelligence ingests streaming event data into OneLake with sub-second latency, enabling real-time analytics on live event streams. Fabric Data Factory to OneLake – Orchestrates ingestion from diverse sources not covered by Mirroring (such as Sybase or REST APIs) and lands data in OneLake, ensuring complete source coverage. OneLake to Azure Databricks – Unity Catalog connections to OneLake, secured via Managed Identities from Microsoft Entra ID, allow Databricks to query OneLake data items as a native catalog without data duplication. Fabric Data Factory to Azure Databricks (direct) – Orchestrates ingestion from diverse sources directly into Azure Data Lake Storage (ADLS), where Azure Databricks picks up the data for medallion architecture processing. Design Considerations Area Updated guidance Direct RTI-to-Databricks integration There is still no broad GA direct integration where Fabric RTI and Databricks operate as one native real-time runtime. Integration should be positioned through open protocols, Event Hubs/Kafka-style patterns, OneLake, Delta, and federation. OneLake federation in Azure Databricks OneLake federation in Azure Databricks is now the key integration story. It allows Databricks Unity Catalog to query Fabric Lakehouse and Warehouse data in OneLake without copying it. Access is read-only and depends on Fabric tenant settings, workspace permissions, and Databricks Unity Catalog setup. RTI data availability to Databricks Data ingested through Fabric RTI can be made available to Databricks by landing or exposing the data into OneLake-backed items, especially Lakehouse/Warehouse patterns. Eventhouse data can be made available in OneLake in Delta format through OneLake availability, but Databricks OneLake federation should be validated against the specific Fabric item type and access path. Existing Databricks customers Existing Databricks customers do not need to abandon Databricks. They can use Fabric RTI as the event ingestion, real-time detection, operational alerting, and business action layer, while continuing to use Databricks for engineering, ML, advanced analytics, and Unity Catalog-governed access. Activator and business action Fabric Activator is the cleanest business-user action layer. It can monitor streaming events and trigger Teams messages, email, Power Automate flows, Fabric pipelines, notebooks, Spark jobs, Dataflows, UDFs, and other downstream actions. This is a strong differentiator because it lets business users act on events without waiting for batch analytics. Operations Agents Operations Agents are in preview and should be positioned carefully. They monitor real-time data from Eventhouse or ontology sources, surface insights, recommend actions, and can connect to Activator/Power Automate action paths. They are not simply a pre-ingestion decision engine before data lands anywhere; they work from configured Fabric knowledge/data sources. Before landing in Lakehouse For decisioning before Lakehouse persistence, use Eventstream processing and Activator rules on streams. For AI-assisted operational recommendations, use Operations Agents once the relevant data is available in Eventhouse or ontology. Requirement-Specific Notes Data Ingestion Microsoft Fabric Mirroring currently supports SQL Server, Azure Cosmos DB, and Oracle as source systems. For sources not yet supported by Mirroring—such as Sybase or REST APIs—use Fabric Data Factory pipelines to ensure full coverage across all data systems. Once data is in the landing zone with the correct format, Mirroring’s CDC replication starts automatically and manages the complexity of merging changes (updates, inserts, and deletes) into Delta tables, keeping data in Fabric continuously up to date. Learn more about open mirroring Storage Format and Time Travel OneLake supports Delta tables, enabling schema evolution and time travel across all data stored in the lakehouse. Learn more about OneLake and Delta tables Security Encryption at rest: OneLake automatically encrypts all data at rest using Microsoft-managed keys, compliant with FIPS 140-2 standards. Learn more Encryption in transit: All data in transit is encrypted using TLS 1.2 or higher, securing data movement between Fabric, OneLake, and Azure Databricks. Learn more Data Governance OneLake can be registered and scanned by Microsoft Purview, enabling cataloging of stored metadata and data quality profiling. This protects sensitive information, including PHI and PII, across ingestion and analytics workflows. Learn more about Purview with Fabric Lakehouse Operations and Monitoring Use the Fabric monitor hub to track pipeline health, Spark application performance, and ingestion job status across all Fabric workloads. Learn more about the Fabric monitor hub Scenario Details This architecture applies to any organization that needs to unify streaming and batch data at scale. Common characteristics include: Multiple operational data sources (databases, SaaS applications, event streams) A requirement to process both real-time and historical data in the same platform Governance and compliance requirements for sensitive data (PHI, PII, financial records) Analytics consumers spanning BI (Power BI), data science (Databricks notebooks), and ML workloads Potential Use Cases Healthcare and life sciences – PHI/PII protection via Purview; real-time patient telemetry + batch EHR analytics Financial services – Real-time fraud detection streams + batch regulatory reporting Retail and e-commerce – Streaming clickstream analytics + batch inventory and supply chain processing Energy and utilities – IoT sensor telemetry streaming + batch consumption analytics Next Steps Get started with Microsoft Fabric Mirroring Build an ETL pipeline with Lakeflow Declarative Pipelines Configure Unity Catalog with OneLake shortcuts Monitor Fabric pipelines with the Fabric monitor hub219Views0likes0CommentsDesigning Reliable Data Platforms: Centralized Failure Logging Framework with Azure Monitor
Introduction Modern data platforms are no longer just about moving and transforming data. In production, what really matters is reliability and how quickly you can understand and react when something breaks. If you’re using Azure Synapse/ADF/Microsoft Fabric, you already have built-in monitoring. You can see pipeline runs, error messages. But it doesnt show you activity level errors, Pipeline errors works well when you’re debugging a single failure. But it doesn’t scale. Once you have dozens of pipelines running across multiple environments, failures become harder to track. You find yourself jumping between pipeline runs, scanning activity outputs, and trying to piece together what actually happened. And suddenly, simple questions become difficult to answer: Which datasets are failing most often? Are failures concentrated in Bronze, Silver, or Gold? Is this a one-off issue or a recurring pattern? What changed between yesterday and today? At that point, pipeline-level monitoring is no longer enough. You need something more structured. P.S the framework can be implemented across both Synapse and Microsoft Fabric environments with minimal changes. Why we need a custom logging framework The core issue is that pipeline failures are treated as runtime events, not as data. They live inside pipeline output and are tied to a specific run. This makes them hard to query across time, aggregate across pipelines, correlate across environments, or understand which activities failed inside the pipeline or integrate into alerting and dashboards in a consistent way. Pipeline Failures are visible but activity failures are not , and they’re not operationalized, what’s missing is a central place where all failures are captured in a consistent, structured format, regardless of which pipeline or dataset produced them including Activity level logs. That’s where a custom logging framework comes in, instead of relying only on built-in monitoring, we introduce a layer that captures failures as structured events, standardizes the payload across pipelines, and sends it to Log Analytics where it can be queried using KQL. This shifts the model from checking a pipeline when it fails to treating failures as a dataset that can be analyzed, monitored, and improved over time. Once you make that shift, you can build alerts based on patterns instead of reacting to single failures, track reliability across datasets or domains, and identify recurring issues instead of dealing with incidents one by one. It also changes who can use the data, visibility is no longer limited to engineers digging into pipeline runs it becomes accessible at the platform level for leads and stakeholders. This framework doesn’t replace Synapse monitoring. It complements it by adding a proper observability layer on top. Architecture When a pipeline fails in Synapse, the failure is intercepted through a dedicated failure path. At this stage, we don’t just log the error as-is we pass it through a custom logging framework that transforms the failure into a structured payload. This payload includes key context such as pipeline name, activity, environment, dataset, layer (Bronze/Silver/Gold), error details, and correlation identifiers. The important part here is consistency every pipeline emits the same schema, regardless of its logic. Once the payload is constructed, it is sent to Azure Monitor using the Logs Ingestion API, this API acts as the entry point into the monitoring system and decouples the pipelines from the underlying storage implementation. A Data Collection Rule (DCR) sits behind the ingestion layer and defines how incoming data is handled. It acts as a contract for the payload schema and optionally applies transformations before the data is persisted. Finally, the logs are stored in a custom Log Analytics table, where they become fully query-able using KQL, at this point, failures are no longer tied to a single pipeline run they are part of a centralized dataset that can be analyzed across time, environments, and domains. Setting up Log Analytics Before integrating the logging framework with Synapse, we first need to set up the destination for our logs, this includes creating a Log Analytics workspace, defining a custom table, and configuring the ingestion path using a Data Collection Rule (DCR). The goal is to create a pipeline where structured failure events can be received, validated, and stored in a consistent format. P.S all steps mentioned in this blog can be automated with ARM templates. 1. Create a Log Analytics workspace Start by creating a Log Analytics workspace. This will act as the central store for all failure logs across your data platform. In the Azure Portal: Navigate to Azure Monitor → Log Analytics workspaces Create a new workspace in your target subscription and region Choose a meaningful name (for example: log-analytics-data-domain) This workspace becomes the single place where all pipeline failures will be collected and queried. 2. Create a custom table for pipeline failures Instead of relying on generic tables, we define a dedicated custom table to store pipeline failure events. From the Log Analytics workspace: Go to Tables → Create → Custom table (DCR-based) Define a table name such as: DataDomain_SynapsePipelineErrors_CL [it has to end with CL suffix] At this stage, you’ll define the schema that represents your logging payload. Typical fields include: TimeGenerated PipelineName PipelineRunId ActivityName ActivityType Status ErrorCode ErrorMessage Severity Environment Layer DatasetName PartitionDate WorkspaceName CorrelationId The key here is consistency, this schema will be reused across all pipelines, so take the time to define it properly. 3. Create a Data Collection Rule (DCR) The Data Collection Rule defines how incoming data is ingested into Log Analytics. It acts as both a schema contract and a routing mechanism. In Azure Portal: Go to Azure Monitor → Data Collection Rules Create a new DCR and associate it with your Log Analytics workspace Within the DCR: Define a custom stream (for example: DataDomain_SynapsePipelineErrors_CL) Map this stream to your custom table Optionally define transformations using KQL (for example, renaming fields or enforcing types) This step is critical because it decouples your pipelines from the storage layer. If the schema evolves later, you can adjust it here without changing pipeline logic. 4. Configure the Logs Ingestion endpoint Once the DCR is created, Azure generates an ingestion endpoint that will be used by your pipelines. The endpoint follows this pattern: https://<dce>.<region>.ingest.monitor.azure.com/dataCollectionRules/<dcrId>/streams/<streamName>?api-version=2023-01-01 This endpoint is what your Synapse pipeline will call using a Web Activity. At this point, you should also: Enable Managed Identity authentication Grant the Synapse workspace permission to send data to the DCR This ensures secure ingestion without using secrets. 5.RBAC for Managed Identity The Managed Identity used by Synapse or Microsoft Fabric must have the following Azure RBAC role: Monitoring Metrics Publisher This role allows the identity to send data through the Azure Monitor Logs Ingestion API. The role should be assigned on the: Data Collection Rule (DCR) resource In Azure Portal: Data Collection Rule (DCR) → Access Control (IAM) → Add Role Assignment → Monitoring Metrics Publisher → Select Synapse/Fabric Managed Identity Without this role assignment, requests to the Logs Ingestion API will fail with authorization errors such as HTTP 403. 6. Validate the setup Before integrating with pipelines, it’s a good idea to validate that ingestion works. You can send a test payload (via Postman or a simple script) and then query your table in Log Analytics: DataDomain_SynapsePipelineErrors_CL | take 10 If everything is configured correctly, you should see your test records appear. Integrating with Synapse pipelines Now that the ingestion layer is ready, the next step is connecting Synapse pipelines, so failures are logged automatically instead of sending manual test payloads. The idea is simple: whenever a pipeline activity fails, we capture the failure details, transform them into a structured payload, and send them directly to the Logs Ingestion API, this turns pipeline failures into centralized operational events. 1. Add a failure handling path Inside your Synapse pipeline, add an On Failure dependency from the activities you want to monitor. Typically, this includes critical activities such as: Copy Activities Notebook executions Stored Procedures Data Flows Web Activities Instead of allowing the pipeline to fail silently, the failure path redirects execution into a dedicated logging step , in most production environments, this is implemented as a reusable child pipeline such as: pipeline name : Customized Logs API This keeps logging logic centralized and avoids duplicating the same implementation across dozens of pipelines. 2. Pass failure metadata as parameters The logging pipeline should receive operational context from the parent pipeline. Typical parameters include: Pipeline name Pipeline run ID Activity name Activity type Error code Error message Environment Layer (Bronze/Silver/Gold) Dataset name Severity Correlation ID This metadata becomes the foundation of the structured logging payload. The more operational context you capture here, the easier troubleshooting becomes later. 3. Construct the logging payload Inside the logging pipeline [Customized Logs API], use a dynamic content expression to construct a JSON payload matching the Log Analytics schema. Example payload: concat( '[{"TimeGenerated":"', utcNow(), '","PipelineName":"POC_Test"', ',"PipelineRunId":"', pipeline().RunId, '","PipelineStatus":"Failed"', ',"ActivityName":"TestActivity"', ',"ActivityType":"Web"', ',"ActivityStatus":"Failed"', ',"ErrorCode":"TEST"', ',"ErrorMessage":"POC test"', ',"Severity":"Warning"', ',"Environment":"Test"', ',"Layer":"Bronze"', ',"ExecutionStage":"POC"', ',"DatasetName":"TestDataset"', ',"PartitionDate":"', utcNow(), '","WorkspaceName":"', pipeline().DataFactory, '","TriggerName":"Manual"', ',"TriggerTimeUtc":"', utcNow(), '","DurationMs":1000', ',"RetryCount":0', ',"Compute":"Synapse"', ',"CorrelationId":"', pipeline().RunId, '","Payload":{"source":"test","target":"loganalytics"}}]' ) The important part is schema consistency. Every pipeline should emit the same payload structure regardless of which activity failed. This makes downstream querying and dashboarding significantly easier. 4. Send logs using a Web Activity After constructing the payload, use a Web Activity to send the data to the Logs Ingestion API endpoint configured earlier. Typical configuration: URL: https://<data-collection-endpoint>.<region>.ingest.monitor.azure.com/dataCollectionRules/<dcr-id>/streams/<stream-name>?api-version=2023-01-01 Method POST Authentication Managed Identity Resource https://monitor.azure.com Headers { "Content-Type": "application/json" } Body Dynamic JSON payload generated in the previous step. I highly recommend using Managed Identity avoids storing secrets or credentials inside Synapse pipelines and keeps authentication fully managed by Azure. 5. Validate end-to-end ingestion Once the pipeline is connected, trigger a controlled failure and verify that the event appears in Log Analytics. Run: DataDomain_SynapsePipelineErrors_CL | sort by TimeGenerated desc | take 20 You should now see real pipeline failures arriving automatically from Synapse. At this point, the framework becomes fully operational. Failures are no longer isolated runtime events buried inside activity outputs they are centralized, queryable operational records that can be analyzed across the entire platform. Future Steps Now that we have a centralized logging framework in place, we can take observability one step further by building operational dashboards in Power BI or Microsoft Fabric to analyze reliability trends across the entire data platform. Instead of reacting to isolated pipeline failures, we can aggregate logs across pipelines, datasets, environments, and medallion layers to identify what is actually causing instability over time. This allows engineering teams to detect recurring error patterns, identify unstable datasets, measure platform reliability, analyze failure spikes after deployments, and understand where operational bottlenecks are concentrated. By transforming pipeline failures into structured operational telemetry, the framework evolves beyond simple logging into a true observability platform that supports proactive reliability engineering, helping teams move from reactive firefighting to data-driven operational improvements based on measurable reliability KPIs such as failure trends, MTTR, SLA compliance, severity distribution, and pipeline health scoring. Links Tutorial: Send data to Azure Monitor Logs with Logs ingestion API (Azure portal) - Azure Monitor | Microsoft Learn Medallion Architecture Understanding with Azure Synapse Analytics Example | by Satyam Gawade | Medium Feedback: Sally Dabbah | LinkedIn198Views0likes0CommentsBuilding AI apps and agents with Azure Databricks, Copilot Studio, and GitHub Copilot
A workspace-wide Genie MCP endpoint for Copilot Studio Genie is Azure Databricks’ AI agent that lets any employee chat with their data and get trusted answers instantly. Genie Spaces are curated, business‑domain workspaces for teams to find strategic insights for their targeted use cases. Until now, connecting Azure Databricks Genie to Microsoft Copilot Studio meant adding each Genie Space as a separate tool. This works and adds value for customers wanting to integrate a specific Genie Space with Copilot Studio, but the per-space MCP server added overhead when trying to connect multiple Genie spaces to one Copilot Studio agent. The workspace-wide MCP endpoint changes that. One endpoint per workspace gives a Copilot Studio agent access to every connected Genie space and Unity Catalog dataset, and the curated context inside each Genie space stays in place. Key capabilities: Natural-language access across the workspace. Copilot Studio agents can route questions across every connected Genie Space and Unity Catalog dataset without losing the curation that keeps answers accurate. Unity Catalog governance. Access controls are enforced at query time, so existing data permissions extend to every agent built in Copilot Studio. Beyond a single domain. Move from a finance agent or a supply chain agent to a workspace-aware agent that follows users wherever the data lives. Lakebase branching with GitHub Copilot agent mode Production AI agents fail on real-data edge cases that synthetic or mocked environments do not catch. But giving a developer direct production access to investigate is not a realistic option in most enterprises. Lakebase branching, now integrated with GitHub Copilot agent mode, gives you a way to debug against real data without ever connecting to the production database. Key capabilities: Copy-on-write branching. Create a full-fidelity branch of a Lakebase production database in seconds. No data is moved and no production records are altered. Native GitHub Copilot agent debugging. Point GitHub Copilot agent mode at the branch endpoint to reproduce, root-cause, and resolve data-dependent issues with AI assistance. Azure-native end-to-end workflow. The full loop runs across GitHub, Azure Databricks, and Lakebase. No third-party tools or custom infrastructure required. Compliance built in. Fixes ship through the standard Git-based deployment and compliance workflows already in place, so debug cycles compress from hours to minutes. What this unlocks for AI agent teams Together, the two capabilities cover both halves of the agent lifecycle on Azure: Author Copilot Studio agents that reason over an entire Azure Databricks workspace through one MCP connection. Debug production AI agents against real Lakebase data using GitHub Copilot agent mode, reducing production data risk. Keep Unity Catalog governance and existing compliance controls in place from authoring through deployment. Standardize the data, agent, and developer toolchain on GitHub, Azure Databricks, and Microsoft 365. Get started Both features are available in public preview on June 2, 2026, directly in Azure Databricks workspaces. Azure Databricks and Power Platform integration to set up Genie workspace-wide MCP for Copilot Studio Connect your GitHub to Azure Databricks to take advantage of Lakebase branching with GitHub Copilot agent mode1.3KViews0likes0CommentsFrom Manual Backfills to Autonomous Pipelines: Building an LLM-Powered Backfill Agent on Azure
Introduction: Data backfills are a common operational requirement in modern data platforms. Missing partitions, upstream delays, or failed pipeline runs often require engineers to manually identify gaps, determine the appropriate recovery window, and trigger reprocessing. This approach does not scale well it introduces operational overhead, increases the risk of human error, and requires deep knowledge of data dependencies and pipeline behavior. In this post, I describe how to build a backfill agent using Azure AI Foundry, Model Context Protocol (MCP), Azure Functions, Synapse, and ADX. The goal is to automate the decision-making process while keeping execution controlled, observable, and governed. The design separates responsibilities across three layers: Decision layer: an LLM-based agent determines whether a backfill is required and defines the recovery scope (e.g., which dates, datasets, or layers) Execution layer: an MCP server hosted on Azure Functions exposes controlled operations such as triggering pipelines and querying system state State layer: ADX tables maintain backfill control metadata, data availability signals, and execution history This separation keeps the system flexible while ensuring that all actions are traceable, auditable, and policy-driven. Importantly, this pattern is not limited to a single dataset or pipeline. It can be applied across all datasets and across all layers of a medallion architecture Bronze, Silver, and Gold with layer-specific validation rules and backfill strategies. For example, Bronze may focus on completeness of ingestion, while Silver and Gold can enforce data quality and business logic constraints before initiating recovery. The key benefit of a backfill agent is that it shifts backfilling from a manual, reactive process to an automated, intelligent, and consistent workflow. Instead of engineers investigating incidents and triggering reruns, the agent continuously evaluates data state, identifies gaps, and initiates-controlled recovery actions. This reduces operational burden, improves reliability, and ensures faster recovery from data issues while maintaining governance, observability, and strict control over execution. Architecture Overview The solution is designed as a controlled orchestration pattern that separates decision-making, execution, and state management. This allows backfill operations to be automated without compromising governance or observability. The architecture consists of four main components. Logic Apps Trigger The workflow is initiated using a Logic App. The trigger can be scheduled or invoked on demand, depending on operational requirements. It provides the input context required for the backfill evaluation, such as dataset name, processing layer, and scope constraints (for example, maximum number of dates to process). Azure AI Foundry Agent (Decision Layer) The Azure AI Foundry agent acts as the decision layer. It evaluates the request and determines whether a backfill is required, and if so, what scope should be applied. The agent does not interact directly with data systems. Instead, it invokes predefined tools exposed through the MCP server. This ensures that decision logic is flexible, while execution remains controlled. Azure Function App – MCP Server (Execution Layer) The Azure Function App hosts the MCP server and exposes a set of operations to the agent. These operations include querying missing partitions, triggering Synapse pipelines, retrieving execution status, and updating control tables. All interactions with external systems (Synapse and ADX) are handled within this layer. It is responsible for input validation, authorization, and enforcing execution rules. This abstraction ensures that infrastructure actions are not directly performed by the agent. Synapse Pipelines (Processing Layer) Backfill execution is handled by a parameterized Synapse pipeline. The pipeline follows a consistent pattern: Data is first written to a staging table Validation is performed Data is promoted to the main table only if validation succeeds This approach ensures data quality and prevents partial or invalid data from being published. Azure Data Explorer(State and Observability Layer) ADX is used as the central state store. It maintains control and execution tables that track expected partitions, missing data, pipeline runs, and execution outcomes. This enables: Detection of missing partitions Idempotent execution (avoiding duplicate processing) Full traceability of backfill operations The agent relies on this state, accessed via the MCP server, to make decisions. End-to-End Flow The Logic App triggers the workflow and passes the request context. The Foundry agent evaluates the request. The agent invokes an MCP tool to retrieve missing partitions from ADX. Based on the result, the agent determines whether a backfill is required. If required, the agent invokes an MCP tool to trigger the Synapse pipeline. The pipeline executes the backfill using a staging and validation pattern. Execution details are written to ADX. The agent returns a summary of the operation. Analytics Layer: Azure Synapse Analytics: In the Synapse workspace, I created a generic parameterized pipeline that has three steps: 1.Copy data from upstream and ingest it to ADX staging table 2. Run Data validation 3 ingest staging data to main dataset table. the pipeline gets as a parameter dataset name, partitioning date, isbackfill flag and layer and ingest dataset into kusto table. values for layer : Bronze,Silver or Gold. Kusto: In Kusto, the solution relies on the following tables: Dataset tables for example, the Customers table in this demo [the same pattern can be extended to support multiple datasets.] BackfillControl: its the central configuration and decision input for the backfill process. It defines which dataset partitions require backfill and provides the metadata needed for the agent to make execution decisions. each row in this table represents a specific dataset partition (for example, a given date in a specific layer) and its current backfill state. BackfillExecutionLog : this table is used to track the execution of backfill operations. It provides a complete record of when backfills were triggered, their outcome, and the associated pipeline runs, while the BackfillControl table defines what should be processed, the BackfillExecutionLog captures what actually happened. Code for Creating the tables: .create table BackfillExecutionLog ( ExecutionId: string, DatasetName: string, Layer: string, PartitionDate: datetime, PipelineName: string, PipelineRunId: string, TriggeredAt: datetime, TriggeredBy: string, ExecutionStatus: string, Reason: string ) .create table BackfillControl ( DatasetName: string, Layer: string, PartitionDate: datetime, BackfillRequired: bool, Status: string, DQStatus: string, RetryCount: int, MaxRetryCount: int, Reason: string ) Output examples: In this demo, Logic Apps, Synapse, and Kusto are treated as existing systems; the focus is how to expose controlled MCP tools from an Azure Function App and connect them to Azure AI Foundry agent. Microsoft’s Azure Functions MCP extension lets a Function App expose functions as MCP tools, and Foundry can connect to the deployed MCP endpoint. Steps: Step1: Create the local Function App project in VS code, run the command: mkdir backfill-kusto-mcp cd backfill-kusto-mcp func init . --worker-runtime python --python Step2: Implement the MCP tools add requirements to requirements.txt file: azure-functions>=1.24.0 azure-identity azure-kusto-data requests python-dotenv The host.json file defines runtime-level behavior for the Azure Function App. In this implementation, it is used to configure the MCP extension, logging, and extension bundles. { "version": "2.0", "extensions": { "mcp": { "system": { "webhookAuthorizationLevel": "Anonymous" } } }, "logging": { "applicationInsights": { "samplingSettings": { "isEnabled": true, "excludedTypes": "Request" }, "enableLiveMetricsFilters": true } }, "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle.Experimental", "version": "[4.*, 5.0.0)" } } The local.settings.json file is used to define environment-specific configuration for the Azure Function App during local development. It contains application settings (environment variables) that are read by the Function App at runtime. These settings are not checked into source control and are replaced by App Settings in Azure after deployment. For example: { "IsEncrypted": false, "Values": { "AzureWebJobsStorage": "UseDevelopmentStorage=true", "FUNCTIONS_WORKER_RUNTIME": "python", "KUSTO_CLUSTER": "https://<ClusterName>.<Region>.kusto.windows.net", "KUSTO_DATABASE": "<DatabaseName>", "BACKFILL_CONTROL_TABLE": "BackfillControl", "BACKFILL_EXECUTION_LOG_TABLE": "BackfillExecutionLog", "SYNAPSE_WORKSPACE": "<SynapseWorkspaceName>", "SYNAPSE_PIPELINE": "<PipelineName>", "AUTH_MODE": "az_login", "AZURE_CLIENT_ID": "", "DEFAULT_DATASET_NAME": "Customers", "DEFAULT_LAYER": "Bronze", "MAX_DATES_DEFAULT": "5" } } For local development, AUTH_MODE is set to az_login. Before deploying to Azure Functions, change AUTH_MODE to MANAGED_IDENTITY in the Function App application settings. The function_app.py defines the main implementation of MCP server ir: Exposes MCP tools (find_backfill_candidates, trigger_backfill, run_backfill_agent, get_backfill_execution_log) Reads configuration from environment variables Authenticates using Azure CLI (local) or Managed Identity (Azure) Queries BackfillControl in Kusto to identify missing partitions Triggers Synapse pipelines for backfill Writes execution results to BackfillExecutionLog Enforces idempotency by checking if a partition was already triggered Code: import os import uuid import json import logging from datetime import datetime, timezone from urllib.parse import quote import requests import azure.functions as func from azure.identity import ManagedIdentityCredential, AzureCliCredential from azure.kusto.data import KustoClient, KustoConnectionStringBuilder app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS) logging.basicConfig(level=logging.INFO) AUTH_MODE = os.getenv("AUTH_MODE", "MANAGED_IDENTITY").lower() KUSTO_CLUSTER = os.getenv("KUSTO_CLUSTER") KUSTO_DATABASE = os.getenv("KUSTO_DATABASE") CONTROL_TABLE = os.getenv("BACKFILL_CONTROL_TABLE", "BackfillControl") EXECUTION_LOG_TABLE = os.getenv("BACKFILL_EXECUTION_LOG_TABLE", "BackfillExecutionLog") SYNAPSE_WORKSPACE = os.getenv("SYNAPSE_WORKSPACE") SYNAPSE_PIPELINE = os.getenv("SYNAPSE_PIPELINE", "Customer Dataset") DEFAULT_DATASET_NAME = os.getenv("DEFAULT_DATASET_NAME", "Customers") DEFAULT_LAYER = os.getenv("DEFAULT_LAYER", "Bronze") def utc_now() -> str: return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") def log_event(event: str, **properties): logging.info( "MCP_BACKFILL %s", json.dumps( { "event": event, "timestamp_utc": utc_now(), **properties, }, default=str, ), ) def require_setting(name: str, value: str | None): if not value: raise ValueError(f"Missing required app setting: {name}") def escape_kusto_string(value: str | None) -> str: if value is None: return "" return str(value).replace("\\", "\\\\").replace('"', '\\"') def get_credential(): if AUTH_MODE == "az_login": return AzureCliCredential() managed_identity_client_id = os.getenv("AZURE_CLIENT_ID") if managed_identity_client_id: log_event( "using_user_assigned_managed_identity", client_id=managed_identity_client_id, ) return ManagedIdentityCredential(client_id=managed_identity_client_id) log_event("using_system_assigned_managed_identity") return ManagedIdentityCredential() def get_kusto_client() -> KustoClient: require_setting("KUSTO_CLUSTER", KUSTO_CLUSTER) if AUTH_MODE == "az_login": kcsb = KustoConnectionStringBuilder.with_az_cli_authentication(KUSTO_CLUSTER) else: managed_identity_client_id = os.getenv("AZURE_CLIENT_ID") if managed_identity_client_id: kcsb = KustoConnectionStringBuilder.with_aad_managed_service_identity_authentication( KUSTO_CLUSTER, client_id=managed_identity_client_id, ) else: kcsb = KustoConnectionStringBuilder.with_aad_managed_service_identity_authentication( KUSTO_CLUSTER ) return KustoClient(kcsb) def execute_kusto_query(query: str): require_setting("KUSTO_DATABASE", KUSTO_DATABASE) client = get_kusto_client() response = client.execute(KUSTO_DATABASE, query) return response.primary_results[0] def execute_kusto_command(command: str): require_setting("KUSTO_DATABASE", KUSTO_DATABASE) client = get_kusto_client() return client.execute_mgmt(KUSTO_DATABASE, command) def find_backfill_candidates_core( dataset_name: str, layer: str, max_dates: int, ) -> list[dict]: dataset = escape_kusto_string(dataset_name) layer_name = escape_kusto_string(layer) query = f""" {CONTROL_TABLE} | where DatasetName == "{dataset}" | where Layer == "{layer_name}" | where BackfillRequired == true | where RetryCount < MaxRetryCount | where Status in ("Missing", "Failed") or DQStatus == "Failed" | top {int(max_dates)} by PartitionDate asc | project DatasetName, Layer, PartitionDate, Status, DQStatus, RetryCount, MaxRetryCount, Reason """ rows = execute_kusto_query(query) return [ { "DatasetName": row["DatasetName"], "Layer": row["Layer"], "PartitionDate": str(row["PartitionDate"])[:10], "Status": row["Status"], "DQStatus": row["DQStatus"], "RetryCount": row["RetryCount"], "MaxRetryCount": row["MaxRetryCount"], "Reason": row["Reason"], } for row in rows ] def was_backfill_already_triggered( dataset_name: str, layer: str, partition_date: str, ) -> bool: dataset = escape_kusto_string(dataset_name) layer_name = escape_kusto_string(layer) query = f""" {EXECUTION_LOG_TABLE} | where DatasetName == "{dataset}" | where Layer == "{layer_name}" | where PartitionDate == datetime({partition_date}) | where ExecutionStatus == "Triggered" | summarize Count = count() """ rows = list(execute_kusto_query(query)) return bool(rows and rows[0]["Count"] > 0) def write_execution_log( execution_id: str, dataset_name: str, layer: str, partition_date: str, pipeline_name: str, pipeline_run_id: str, execution_status: str, reason: str, ): command = f""" .set-or-append {EXECUTION_LOG_TABLE} <| print ExecutionId = "{escape_kusto_string(execution_id)}", DatasetName = "{escape_kusto_string(dataset_name)}", Layer = "{escape_kusto_string(layer)}", PartitionDate = datetime({partition_date}), PipelineName = "{escape_kusto_string(pipeline_name)}", PipelineRunId = "{escape_kusto_string(pipeline_run_id)}", TriggeredAt = datetime({utc_now()}), TriggeredBy = "FoundryMCPBackfillAgent", ExecutionStatus = "{escape_kusto_string(execution_status)}", Reason = "{escape_kusto_string(reason)}" """ execute_kusto_command(command) def trigger_synapse_pipeline( dataset_name: str, layer: str, partition_date: str, ) -> str: require_setting("SYNAPSE_WORKSPACE", SYNAPSE_WORKSPACE) require_setting("SYNAPSE_PIPELINE", SYNAPSE_PIPELINE) credential = get_credential() token = credential.get_token("https://dev.azuresynapse.net/.default").token encoded_pipeline_name = quote(SYNAPSE_PIPELINE, safe="") url = ( f"https://{SYNAPSE_WORKSPACE}.dev.azuresynapse.net" f"/pipelines/{encoded_pipeline_name}/createRun" f"?api-version=2020-12-01" ) payload = { "DatasetName": dataset_name, "Layer": layer, "PartitionDate": partition_date, "IsBackfill": True, } response = requests.post( url, headers={ "Authorization": f"Bearer {token}", "Content-Type": "application/json", }, json=payload, timeout=30, ) log_event( "synapse_create_run_response", status_code=response.status_code, body=response.text[:2000], ) response_json = {} try: response_json = response.json() except Exception: pass if "runId" in response_json: return response_json["runId"] raise Exception( f"Synapse trigger failed. " f"StatusCode={response.status_code}. " f"Body={response.text}" ) def trigger_backfill_core( dataset_name: str, layer: str, partition_date: str, ) -> dict: execution_id = str(uuid.uuid4()) log_event( "trigger_backfill_started", execution_id=execution_id, dataset_name=dataset_name, layer=layer, partition_date=partition_date, ) try: if was_backfill_already_triggered(dataset_name, layer, partition_date): return { "ExecutionId": execution_id, "DatasetName": dataset_name, "Layer": layer, "PartitionDate": partition_date, "Status": "Skipped", "Reason": "Backfill was already triggered for this partition.", } pipeline_run_id = trigger_synapse_pipeline( dataset_name=dataset_name, layer=layer, partition_date=partition_date, ) write_execution_log( execution_id=execution_id, dataset_name=dataset_name, layer=layer, partition_date=partition_date, pipeline_name=SYNAPSE_PIPELINE, pipeline_run_id=pipeline_run_id, execution_status="Triggered", reason="Triggered by Foundry MCP backfill agent", ) return { "ExecutionId": execution_id, "DatasetName": dataset_name, "Layer": layer, "PartitionDate": partition_date, "PipelineName": SYNAPSE_PIPELINE, "PipelineRunId": pipeline_run_id, "Status": "Triggered", } except Exception as ex: error_message = str(ex) log_event( "trigger_backfill_failed", execution_id=execution_id, dataset_name=dataset_name, layer=layer, partition_date=partition_date, error=error_message, ) try: write_execution_log( execution_id=execution_id, dataset_name=dataset_name, layer=layer, partition_date=partition_date, pipeline_name=SYNAPSE_PIPELINE or "", pipeline_run_id="", execution_status="FailedToTrigger", reason=error_message, ) except Exception as log_ex: log_event( "failed_to_write_execution_log", execution_id=execution_id, original_error=error_message, log_error=str(log_ex), ) return { "ExecutionId": execution_id, "DatasetName": dataset_name, "Layer": layer, "PartitionDate": partition_date, "Status": "FailedToTrigger", "Error": error_message, } def run_backfill_agent_core( dataset_name: str, layer: str, max_dates: int, ) -> list[dict]: log_event( "run_backfill_agent_started", dataset_name=dataset_name, layer=layer, max_dates=max_dates, ) candidates = find_backfill_candidates_core( dataset_name=dataset_name, layer=layer, max_dates=max_dates, ) results = [] for candidate in candidates: result = trigger_backfill_core( dataset_name=candidate["DatasetName"], layer=candidate["Layer"], partition_date=candidate["PartitionDate"], ) results.append(result) log_event( "run_backfill_agent_completed", dataset_name=dataset_name, layer=layer ) return results def get_execution_log_core( dataset_name: str, limit: int, ) -> list[dict]: dataset = escape_kusto_string(dataset_name) query = f""" {EXECUTION_LOG_TABLE} | where DatasetName == "{dataset}" | top {int(limit)} by TriggeredAt desc | project ExecutionId, DatasetName, Layer, PartitionDate, PipelineName, PipelineRunId, TriggeredAt, TriggeredBy, ExecutionStatus, Reason """ rows = execute_kusto_query(query) return [ { "ExecutionId": row["ExecutionId"], "DatasetName": row["DatasetName"], "Layer": row["Layer"], "PartitionDate": str(row["PartitionDate"])[:10], "PipelineName": row["PipelineName"], "PipelineRunId": row["PipelineRunId"], "TriggeredAt": str(row["TriggeredAt"]), "TriggeredBy": row["TriggeredBy"], "ExecutionStatus": row["ExecutionStatus"], "Reason": row["Reason"], } for row in rows ] @app.mcp_tool() @app.mcp_tool_property(arg_name="dataset_name", description="Dataset name, for example Customers.") @app.mcp_tool_property(arg_name="layer", description="Layer name, for example Bronze.") @app.mcp_tool_property(arg_name="max_dates", description="Maximum number of dates to return.") def find_backfill_candidates( dataset_name: str = DEFAULT_DATASET_NAME, layer: str = DEFAULT_LAYER, max_dates: int = 5, ) -> list[dict]: return find_backfill_candidates_core(dataset_name, layer, max_dates) @app.mcp_tool() @app.mcp_tool_property(arg_name="dataset_name", description="Dataset name, for example Customers.") @app.mcp_tool_property(arg_name="layer", description="Layer name, for example Bronze.") @app.mcp_tool_property(arg_name="partition_date", description="Partition date in yyyy-MM-dd format.") def trigger_backfill( dataset_name: str, layer: str, partition_date: str, ) -> dict: return trigger_backfill_core(dataset_name, layer, partition_date) @app.mcp_tool() @app.mcp_tool_property(arg_name="dataset_name", description="Dataset name, for example Customers.") @app.mcp_tool_property(arg_name="layer", description="Layer name, for example Bronze.") @app.mcp_tool_property(arg_name="max_dates", description="Maximum number of dates to trigger.") def run_backfill_agent( dataset_name: str = DEFAULT_DATASET_NAME, layer: str = DEFAULT_LAYER, max_dates: int = 5, ) -> list[dict]: return run_backfill_agent_core(dataset_name, layer, max_dates) @app.mcp_tool() @app.mcp_tool_property(arg_name="dataset_name", description="Dataset name, for example Customers.") @app.mcp_tool_property(arg_name="limit", description="Maximum number of execution log rows to return.") def get_backfill_execution_log( dataset_name: str = DEFAULT_DATASET_NAME, limit: int = 10, ) -> list[dict]: return get_execution_log_core(dataset_name, limit) Step3: . Run locally 1. Activate virtual environment: python -m venv .sally-env .\.sally-env\Scripts\activate 2. Install dependencies : pip install -r requirements.txt npm install -g azurite 3. Open 2 terminals, in one terminal run: azurite 4. in the second terminal: Login to Azure: az login start the function app: func start P.S make sure to change auth in local.settings.json file to "AUTH_MODE": "az_login" Step4: Create Azure resources and deploy # LOGIN az login # VARIABLES $RG="rg-backfill-kusto-mcp-demo" $LOCATION="westeurope" $STORAGE="stbackfillmcp$((Get-Random -Minimum 10000 -Maximum 99999))" $FUNCAPP="func-backfill-kusto-mcp-$((Get-Random -Minimum 10000 -Maximum 99999))" # CREATE RESOURCE GROUP az group create --name $RG --location $LOCATION # CREATE STORAGE ACCOUNT az storage account create ` --name $STORAGE ` --resource-group $RG ` --location $LOCATION ` --sku Standard_LRS # CREATE FUNCTION APP az functionapp create ` --resource-group $RG ` --consumption-plan-location $LOCATION ` --runtime python ` --runtime-version 3.11 ` --functions-version 4 ` --name $FUNCAPP ` --storage-account $STORAGE ` --os-type Linux # ENABLE MANAGED IDENTITY az functionapp identity assign ` --resource-group $RG ` --name $FUNCAPP # GET PRINCIPAL ID $FUNC_PRINCIPAL_ID = az functionapp identity show ` --resource-group $RG ` --name $FUNCAPP ` --query principalId ` --output tsv Write-Host "Function App Principal ID: $FUNC_PRINCIPAL_ID" # CONFIGURE APP SETTINGS az functionapp config appsettings set ` --resource-group $RG ` --name $FUNCAPP ` --settings ` AUTH_MODE=MANAGED_IDENTITY ` KUSTO_CLUSTER="https://<ClusterName>.<Region>.kusto.windows.net" ` KUSTO_DATABASE="<DatabaseName>" ` BACKFILL_CONTROL_TABLE="BackfillControl" ` BACKFILL_EXECUTION_LOG_TABLE="BackfillExecutionLog" ` SYNAPSE_WORKSPACE="<SynapseWorkspaceName>" ` SYNAPSE_PIPELINE="<PipelineName>" ` DEFAULT_DATASET_NAME="Customers" ` DEFAULT_LAYER="Bronze" ` MAX_DATES_DEFAULT="5" # DEPLOY FUNCTION APP (RUN FROM PROJECT FOLDER) func azure functionapp publish $FUNCAPP # GET MCP ENDPOINT $MCP_ENDPOINT="https://$FUNCAPP.azurewebsites.net/runtime/webhooks/mcp" Write-Host "MCP Endpoint: $MCP_ENDPOINT" # GET MCP KEY $MCP_KEY = az functionapp keys list ` --resource-group $RG ` --name $FUNCAPP ` --query "systemKeys.mcp_extension" ` --output tsv Write-Host "MCP Key: $MCP_KEY" # TEST MCP TOOL $body = @{ jsonrpc = "2.0" id = "1" method = "tools/call" params = @{ name = "find_backfill_candidates" arguments = @{ dataset_name = "Customers" layer = "Bronze" max_dates = 1 } } } | ConvertTo-Json -Depth 10 Invoke-RestMethod ` -Uri $MCP_ENDPOINT ` -Method POST ` -Headers @{ Accept = "application/json, text/event-stream" "x-functions-key" = $MCP_KEY } ` -ContentType "application/json" ` -Body $body After deployment, the Function App’s managed identity must be granted the appropriate permissions in both Kusto and Synapse with Function app principal id , this allows the Function App to query Kusto tables and trigger Synapse pipelines without issues. Step5: . Connect the MCP server to Azure AI Foundry Go to Azure AI Foundry portal Navigate to your Project Open your Agent Add MCP as a tool : Go to Tools Click Add Tool Select: Custom → Model Context Protocol (MCP) Configure custom MCP and click on Save MCP endpoint: https://<function-app-name>.azurewebsites.net/runtime/webhooks/mcp Step6: Define Agent instructions You are a Backfill Reliability Agent. You MUST use the backfill_agent MCP tool. Do NOT ask the user for candidate dates. When asked to run backfill: Find dataset name 1. Call find_backfill_candidates with dataset_name layer max_dates 2. Then call run_backfill_agent with dataset_name, layer= max_dates Return the PipelineRunId. Note: The instructions are very generic; you need to modify it based on your business scenario. Step7: Test prompt Now in Synapse Monitor: Search for PipelineRunId: df1b1920-09dd-415b-bbe9-d810d8505f58: Future Enhancements: The backfill agent automates recovery by detecting missing or failed data and triggering controlled reprocessing via MCP. It can scale across all datasets and medallion layers (Bronze, Silver, Gold) with layer-specific rules. The design can evolve into a multi-agent workflow for example, if backfill fails multiple times, a notification agent can automatically send emails or create incidents for upstream teams. Overall, this shift backfilling from a manual, reactive task to an automated, governed, and intelligent data operations process. Links: Tutorial: Host an MCP server on Azure Functions | Microsoft Learn Quickstart: Set up Microsoft Foundry resources - Microsoft Foundry | Microsoft Learn Quickstart: Connect Azure Data Explorer to an Azure Synapse Analytics workspace - Azure Synapse Analytics | Microsoft Learn Would love to hear your Feedback: Sally Dabbah | LinkedIn333Views0likes1CommentSecure Medallion Architecture Pattern on Azure Databricks (Part II)
Disclaimer: The views in this article are my own and do not represent Microsoft or Databricks. This article is part of a series focused on deploying a secure Medallion Architecture. The series follows a top-down approach , beginning with a high-level architectural perspective and gradually drilling down into implementation details using repeatable, code. In this part we will discuss the implementation of the pattern using GitHub Copilot If you have missed, please read first the first part of this blog series. It can be found at: Secure Medallion Architecture Pattern on Azure Databricks (Part I). I waited a while before publishing this article. Partly due to other priorities, but also because I wanted to experiment with deploying infrastructure and data pipelines using agents. At that point, I was looking to leverage agents with a spec-driven approach, and through using GitHub Copilot, I learned what skills are and how I can use them to achieve my scope. In this blog I'll share what I learned using GitHub Copilot for spec-driven development. I'll use the content from my previous article, Secure Medallion Architecture Pattern on Azure Databricks (Part I) , as a technical specification to extract implementation details and generate two outputs: Terraform code for infrastructure, platform configuration, and deployment Databricks Declarative Automation Bundles for jobs, pipelines, and other deployment-ready workload resources I've tried not to overfit the prompts within the skills I've developed, so they remain portable to other technical articles, not just the one mentioned in this blog. Separate the platform from the workload When I started the design, I decided to modularise the automation scripts by separating the platform from the actual data platform workloads. I assigned networking, storage, identities, secret scopes, and workspace configuration to Terraform, while Databricks notebook runs, job clusters, pipelines, and environment-specific deployments were developed within Databricks Declarative Automation Bundles (formerly known as Databricks Asset Bundles). That may sound obvious, but it's exactly where generated code often goes wrong. Without explicit instructions, AI tools tend to blur these boundaries and produce one oversized block of configuration. That's why my Copilot skill needs to enforce a clear contract by: Infer the architecture from the article Identify what is explicit and what is assumed Emit Terraform only for infrastructure concerns Emit bundle files only for workload concerns Leave placeholders for anything the article does not specify That last point is critical. A blog post or low-level technical specification is not a source of truth for account IDs, hostnames, catalog names, secret values, or subnet IDs. Good automation should never fabricate those values. Instead, I decided to produce a starter implementation with TODO markers wherever environment-specific values are required. Skills are a great way to get more consistent, repeatable output across runs, so I decided to use them for this project. I could have used one of the tools listed in the table below, but I chose to go my own way, into developing a Spec-Driven Development (SDD) framework which I hope it will carryon improve with time. Tool Creator Type Link Description GitHub Spec Kit GitHub Open source github/spec-kit Turns feature ideas into specs, plans, and task lists before any code is written. Works with multiple AI coding agents. Specification first, code as generated output. BMAD Method BMad Code LLC Open source bmad-code-org/BMAD-METHOD An AI-driven agile framework with specialised agents covering the full lifecycle from ideation to deployment. Scale-adaptive — adjusts planning depth from a bug fix to an enterprise system. OpenSpec Fission AI Open source Fission-AI/OpenSpec Lightweight spec layer that sits above your existing AI tools. Each change gets a proposal, specs, design, and task list. No rigid phase gates, no IDE lock-in. What are skills, and why are they a good fit? Skills are essentially reusable prompt modules that aim to force LLMs to produce repeatable answers. Within a skill, I define the behavior and then attach supporting resources or scripts so Copilot can perform the task consistently. That means a skill can do more than just "write some code." A skill can define a repeatable workflow like this: Fetch the blog URL Extract headings, paragraphs, and code snippets Normalize the article into a lightweight implementation spec Decide what belongs in Terraform Decide what belongs in the Databricks bundle Generate files in a predictable project structure Produce a TODO.md file for unresolved values This approach turns Copilot from a generic assistant into a specialized code-conversion tool. However, there are some constraints I had to be mindful of when developing skills: Context window limits. The model has limited space to read instructions, process input, and generate output. Long prompts can cause files to be cut off or steps to be skipped. Non-determinism. Output may vary between runs, even with strict instructions. I always lint, validate, and review the diff before committing. Boundary leakage. Models may invent plausible but incorrect values. The TODO.md pattern must be enforced as a rule, not a suggestion. Model and tool drift. Copilot's model and tool surface change over time. I use example inputs and outputs as repeatable sanity checks. Maintainability. A skill is code-as-prompt and will age with the platforms it targets. I keep skills narrowly scoped so they stay easy to update. I'll explain the TODO.md file in more detail later in this post. The GitHub repo The repository can be found at the link MarcoScagliola/CopilotBlogToCode Below you will find a function I have added that, when invoked, deletes all the files produced by the skills, so you can test the repo from a clean state. python .github/skills/blog-to-databricks-iac/scripts/reset_generated.py --force; If you want to tried it out, please clone and try it on your copy. In GitHub Copilot, I usually keep: Model as Auto Foer the configure tools I keep just the built-in tools selected. Below you can find the prompt that I use to run the skills and have the blog analysed. Use the blog-to-databricks-iac skill on this article: https://techcommunity.microsoft.com/blog/analyticsonazure/secure-medallion-architecture-pattern-on-azure-databricks-part-i/4459268 Inputs: workload: blg environment: dev azure_region: uksouth github_environment: To make this more repeatable and less manual, I've added a prompt file at run-blogToDatabricksIac-selected-tools.prompt.md, which can be run directly from VS Code by opening the file and clicking the run button at the top. Feel free to experiment with it and let me know what you think. Further instructions on how to use the repo are available READ_FIRST.md. Following you will find the exact repository setup I used for this workflow, starting with my initial configuration and ending with the final directory structure and files. 1. Create a new GitHub repository and clone it locally I started by creating a new repository on GitHub, then cloned it to my local machine so I could add the Copilot skill, Terraform scaffolding, and Databricks bundle files in a centralized location. git clone https://github.com/YOUR-ORG/blog-to-databricks-iac.git cd blog-to-databricks-iac This approach keeps the workflow organised from the start: the repository exists on GitHub first, and the local clone becomes the working directory for all subsequent setup steps. 2. Create the GitHub skill folder structure (first iteration) GitHub Copilot skills are file-based and centered on a SKILL.md file inside a skill folder. GitHub's current pattern places these under .github/skills/ . I used the script below to create the folder hierarchy for my initial integration. mkdir -p .github/skills/blog-to-databricks-iac/scripts mkdir -p .github/skills/blog-to-databricks-iac/templates mkdir -p infra/terraform mkdir -p databricks-bundle/resources mkdir -p databricks-bundle/src This script generates the structure depicted below. 3. Add the main skill definition Next, I created the SKILL.md file at .github/skills/blog-to-databricks-iac/ . The orchestrator decides what happens and in what order, while each specialist decides what its own file should contain (as an example the Terraform specialist owns the Terraform, the bundle specialist owns the bundle, and so on). In practice, SKILL.md turns Copilot from a general assistant into a domain-specific generator for this repo. GitHub documents this SKILL.md-based structure as the foundation of agent skills. My first iteration of .github/skills/blog-to-databricks-iac/SKILL.md> was very simple and can be found here. 4. Add a script to fetch and normalize the blog article Next, I created a Python script that the main orchestrator SKILL.md invokes to read the blog article. This script is stored at .github/skills/blog-to-databricks-iac/scripts/ and named fetch_blog.py . Within SKILL.md , the script is invoked as shown below. ### 1. Fetch article ```bash python .github/skills/blog-to-databricks-iac/scripts/fetch_blog.py "<url>" ``` If fetch fails, stop and return the fetch error output. Do not retry; surface the error to the user and wait for guidance.</url> The script validates the URL, fetches the HTML with a 30-second timeout, and uses a spoofed Mozilla User-Agent to avoid being blocked by CDNs (Content Delivery Networks). It reads through the HTML one tag at a time, flagging when it enters relevant sections like paragraphs, headings, or code blocks, and buffering text until the tag closes. Before storing anything, it cleans the text by decoding HTML objects, collapsing whitespace, and trimming edges. As it parses, the script also scans for cloud platform keywords: AWS, S3, Azure, ADLS, GCP, Google Cloud. The first match wins; if none are found, it returns unknown. This is a quick heuristic, not authoritative. Finally, it outputs clean JSON with the extracted data: title, headings, paragraphs, code blocks, and cloud hint, capped at reasonable sizes to keep the output manageable. If anything goes wrong, such as a network error, timeout, bad HTML, or empty content, the script exits cleanly with a structured error message, making it easy to integrate into larger workflows without surprises. The Python scrip can be found here. 5. The output and output contract Now I needed to think about the output I wanted GitHub Copilot to deliver through the skills. To reiterate, I needed the following: File Name Description README.md This is the operator-facing runbook that turns the generated artifacts into a working deployment. It contains no unresolved placeholders and no embedded credentials. The header summarizes the architecture and links back to the source blog. A prerequisites section lists required Azure access, Entra permissions, GitHub Environment setup, and local CLI versions. It includes tables of always-required GitHub secrets and variables, plus conditional ones based on deployment mode. Step-by-step numbered sections walk through bootstrapping the deployment principal and populating the GitHub Environment. Workflow blocks describe each Terraform validation, infrastructure deployment, and DAB deployment step, including file paths, triggers, and outputs. A commands section lists the exact Terraform and Databricks bundle sequences to run. Finally, assumption notes point the operator to TODO.md and SPEC.md for context. TODO.md The operator's checklist of remaining tasks. It uses a strict five-section format (Heading, What this is, Why deferred, Source, Resolution, Done looks like) with no commands or code, only concepts and decisions. Each section captures a different layer of post-deployment work, pre-deployment tasks like RBAC roles and GitHub secrets, deployment-time inputs like region and environment, post-infrastructure setup like Key Vault secrets and external locations, post-DAB work like Unity Catalog grants and job schedules, and architectural choices the orchestrator couldn't make (network posture, schemas, partitioning). Every entry comes from something the article left unstated, plus the universal post-deploy work for any Databricks deployment. The operator works through TODO.md sequentially, resolving each item before the system is production-ready. SPEC.md The structured, source-faithful read of the blog article, organized by checklist. Every item is marked as a stated value, inferred from code or diagrams, or "not stated in article." It includes architecture details, Azure services configuration, Databricks setup, data model, security and identity requirements, and observations. SPEC.md is the single source of truth that Terraform and DAB generators read from, TODO.md is populated from every "not stated" entry, and README.md references it for assumptions. This ensures the deployment is built on documented decisions, not hidden assumptions. Together, these files create a clear boundary: SPEC.md answers what the blog says, TODO.md captures what's missing or must be decided, README.md tells you exactly how to deploy. This split is enforced by validation rules that fail if any content duplicates across the three files. To make these files as repeatable as possible, I needed two things: Two templates, one for README.md and one for TODO.md , that the orchestrator fills in from SPEC.md at generation time. A broader delivery contract, output-contract.md , which lists the five files the orchestrator must produce. README.md and TODO.md are two of those five, and the templates are how they get produced. The output-contract.md file defines a strict, ordered format that the agent must follow when transforming a blog article about Databricks-on-Azure architecture into a runnable repository. The first commit was deliberately minimal, as you can see from the file available here. No leaf-skill routing, no repo-context.md, no GitHub Actions workflows, no validation rules, no entry-field templates for TODO.md . That commit's single job was to lock down the shape of the output: what gets produced and in what order. Every commit since has refined how to produce that shape without changing what gets produced. Putting the contract in the very first commit gave every later change a fixed reference point. Every leaf skill, generator script, and validation rule I've added since has fit into one of its five sections. The pipeline has changed; the deliverables haven't. The structure of the GitHub repo at commit 17ab443 can be see in the pictorial below. 6. The README.md and TODO.md templates After iteratively working on the orchestrator, a clear pattern emerged, the code-generation paths were kind of stable, but the documentation outputs weren't. Every run produced README.md and TODO.md from scratch in free-form Markdown. Across runs, the same content kept drifting. Section ordering changed between runs and the explanation of GitHub Environments was rewritten with subtle wording differences. RBAC roles appeared sometimes as lists, sometimes in prose, sometimes split across sections. Universal post-deploy actions (create the secret scope, populate the vault, set up Unity Catalog grants) were re-derived every time, occasionally with steps missing. The root cause was that the orchestrator was treating durable, universal content as if it were per-run content. So I've decided to add two templates: README.md.template and TODO.md.template. Templates separate universal content (RBAC, TODO sections, GitHub setup) in the template from per-workload content (catalog names, credentials) substituted from SPEC.md. This delivers consistency across runs. The README and TODO are structurally identical, so readers can navigate them intuitively. Universal content is correct by construction; I write it once, review carefully, and every run inherits that quality. Validation also becomes more precise, and the agent's job shrinks from open-ended writing to mechanical substitution, which is easier to validate and maintain. Templates introduce clear vocabulary: {placeholder} is filled by the orchestrator at generation time, by the deployer at run time. Finally, templates enforce traceability: every "not stated in article" entry in SPEC.md automatically becomes a TODO entry via the from SPEC.md slot, making this an automatically-enforced rule. I'm invoking the templates in the orchestrator as shown below. The Git commit with this code can be found at this link. ### 3.1 Generate README from template Load the template: `.github/skills/blog-to-databricks-iac/templates/README.md.template` ### 3.2 Generate TODO from template Load the template: `.github/skills/blog-to-databricks-iac/templates/TODO.md.template` 7. The output of the fetch_blog.py file and the interaction with the orchestrator When the orchestrator invokes fetch_blog.py , the script produces a JSON output and passes it back to the orchestrator. The orchestrator then reads the JSON document into its working context and maps each field onto an analysis checklist. The title and meta description establish the article identity and scope. Headings with their levels reveal the structure, helping the agent locate sections about architecture, security, data flow, and naming. Paragraphs provide evidence for stated values like regions, resource types, and RBAC models. Code blocks become the source of inferred values. As an example, a Terraform snippet might reveal SKU choices or naming patterns not mentioned in the text. These inferred values get tagged "inferred from code snippet" when recorded. The cloud hint acts as a sanity check that the article actually describes an Azure architecture. For every checklist item, the agent records either an extracted value or the literal string "not stated in article". This becomes SPEC.md , the single source of truth for everything downstream. SPEC.md drives every subsequent step. Steps 3 through 7 (the Terraform module, workflows, and Databricks bundle generators) read architectural decisions from it. Step 8 then produces TODO.md by converting every "not stated in article" entry into a TODO item the operator must resolve before deployment. What I find worth pointing out is how little the output contract has actually moved since that very first commit. The implementation underneath has changed completely. Leaf skills emerged, generator scripts came in, validation rules got added, a soft-delete state machine showed up to handle Key Vault recovery. None of those existed at the start. But what the orchestrator delivers, the list of files it puts on disk, has stayed exactly the same. We have a much larger SKILL.md today that still mirrors the initial five-item output list. The contract itself has changed by exactly one line: the addition of "Design of the architecture" to section 5. SPEC.md : the structured, source-faithful read of the article, organised by the analysis checklist ( link ) TODO.md : the operator's checklist of everything the article didn't specify, plus the universal post-deploy actions ( link ) Terraform code under infra/terraform/ : the platform layer with networking, storage, identities, Key Vault, workspace ( link ) Databricks Asset Bundle under databricks-bundle/ : the workload layer with jobs, entry points, environment configuration ( link ) README.md : the operator runbook, with the architecture design diagram embedded ( link ) If the JSON contains an error, the orchestrator stops immediately. Per the skill rule "If fetch fails, stop and return the fetch error output. Do not retry," the error surfaces to the user rather than propagating downstream. So the script's output is the raw evidence pack: title, structure, prose, code, cloud hint. The agent uses it to fill the architecture spec, which parameterises every generated artifact. At this point the fetch_blog.py output is sent to Step 2 of the orchestrator, as shown in the code snippet below. ### 2. Analyse article Analyse the fetched article against the structured checklist in `.github/skills/blog-to-databricks-iac/references/blog-analysis-checklist.md`. The analysis covers the article text, diagrams, screenshots, and code snippets. And, much later in the orchestrator, Step 8 closes the loop by turning everything that's been recorded into the two operator-facing documents: ### 8. Generate README and TODO from templates Use the templates in `.github/skills/blog-to-databricks-iac/templates/`: - `README.md.template` -> `README.md` - `TODO.md.template` -> `TODO.md` 8. How this actually came together What I've described so far is how the orchestrator works currently. The reality of building it was much cumbersome , but also fun. I got from the first version to the current one by iterating. Rerun the orchestrator, find the defect, identify the rule that would have caught it, add the rule to the skill that owns the artifact, rerun. The reason I'm calling this out now, before walking through the rest of the pipeline, is that everything from this point on is a story about a specific lesson learned that way. The leaf skills exist because a single SKILL.md got too dense. The restricted-tenant guardrails exist because the deployment failed against a tenant that couldn't read Microsoft Graph. The validation harness exists because prose rules weren't catching the regressions that mattered. The soft-delete state machine exists because the same vault name kept colliding with a previous deploy. None of these rules were present from day-one. So in the next sections I'll walk through how the pipeline actually matured: how the single skill split into a graph, what the inner regenerate-fix loop felt like in practice, the day the project pivoted to support restricted tenants, the bugs that became rules, and the Key Vault soft-delete state machine that closed the project out. 9. From a single skill to a skill graph When I started, everything lived inside a single SKILL.md . It was simpler that way, and to be honest, at that point I didn't yet know which rules would actually matter. But as I kept rerunning the orchestrator on the article, a pattern emerged. Each rerun produced something that broke in a slightly different way, and the fix always belonged to a very specific concern: Terraform authoring, bundle structure, workflow generation, or the orchestration logic itself. Stuffing the rules for all of them into one file was making the orchestrator unreadable and, worse, was silently dropping rules when the context window got tight. So I split it. The orchestrator stayed at the top, kept routing the work and validating the result, and each concern got promoted to its own leaf skill. The Databricks bundle skill itself ended up needing one more split a few days later, it had got too dense, so I broke it into two leaves: databricks-yml-authoring ( link ) Python-entrypoints ( link ) The diagram below shows the shape the repo has today. The orchestrator now does almost no authoring. It owns the sequence of steps, the contract, and the validation gates, while everything else is delegated. This was the single biggest readability win. I wish I'd done it earlier. The REPO_CONTEXT.md is one extra node in that diagram that I want to call out But I'll come back to later in section 12. 10. The inner loop: rerun, fail, fix the skill If I had to describe the middle of this project in one sentence, it would be: every commit was a regeneration. I'd run the orchestrator end-to-end against the article, inspect the generated Terraform, the bundle, the workflows. I'd find a defect, identify the rule that would have prevented it, add that rule to the skill that owns the artifact, then rerun. As shown in the image below. This loop is what I think people miss when they treat AI-generated infrastructure code as a one-shot. The first run is never the deliverable. The deliverable is the skill that produces good runs. The generated files are disposable and can always be reproduced. The skill is what carries the knowledge forward. I had to actively resist the temptation to fix bugs in the generated code directly. Patching infra/terraform/main.tf by hand fixes today's run but not tomorrow's, because the rule that would prevent the bug doesn't exist anywhere. So I made it a discipline: never edit the output, always edit the skill, then regenerate. 11. Restricted-tenant compatibility The bug was simple to describe and brutal to fix: the deployment principal in the target tenant couldn't read Microsoft Graph. Any Terraform data source that resolved an Entra name to an object ID at plan time (e.g., azuread_user , azuread_group , azuread_service_principal ) blew up at terraform plan. My first instinct was to think "I just give the principal Graph permissions". But in a lot of real environments this is not possible. The principal that runs your IaC is governed by a security team, the team has a policy, and the policy says no Graph reads. The pivot was getting the skill to produce Terraform that never reads Graph. Object IDs are inputs, not lookups. They come in as trusted secrets, the workflow exports them as TF_VAR_* , and Terraform consumes them as variables. No data " azuread_* " block is allowed in the generated code, ever. I thought this was a simple fix. It wasn't. It cascaded into about six other things: App Registration vs Service Principal object IDs. The workflow was being given the wrong one. Role assignments need the Enterprise Application (Service Principal) object ID, not the App Registration object ID. The two are different objects in Entra with different IDs. I encoded the distinction in the skill as *_SP_OBJECT_ID (the Service Principal) versus *_CLIENT_ID (the App Registration's application ID). Naming carries the meaning now, so the wrong value is hard to pass. Single-principal mapping. In some tenants you only have one principal and it has to play both deployment and runtime roles. The skill grew a layer_sp_mode = existing input so the generator stops trying to create a new Service Principal and reuses the deployment one instead. Key Vault access policies, gone. Access policies were Graph-touching, and not all tenants support them anyway. The skill switched fully to RBAC role assignments (Key Vault Secrets User, and so on). A few cascading bugs followed, but this was the right call. It took some time to harden the Terraform skill against everything the restricted tenant was throwing back. Each iterations had the same shape, each orchestrator runs, hits a fresh provider error, I add the rule, run again, hit the next one. The commit subjects from that run are basically a transcript of the conversation I was having with the platform. 12. The bugs that became rules There are three bugs that I believe are worth telling the story of, because they each illustrate a slightly different lesson. The HCL trim() arity bug. The generator emitted trim(var.something) in a validation block. HCL's trim() takes two arguments, not one. The function I actually wanted was trimspace() . This is the kind of bug that any human would catch in a code review in two seconds, and which the model produced confidently because the shape of the call looked right. I added the rule to the Terraform skill ("for whitespace trimming use trimspace, never trim") and the bug never came back. Lesson: even for trivial syntactic mistakes, the fix belongs in the skill. The variable shadowing bug. The deploy workflow had a job-level env: block that set TF_VAR_key_vault_recover_soft_deleted to a static value. A detection step earlier in the workflow was supposed to compute the right value at runtime and write it via $GITHUB_ENV . The problem is that GitHub Actions resolves job-level environment variables before $GITHUB_ENV writes take effect, so the static value always won and the dynamic one was silently ignored. The fix was to never set the recovery flag at job level. It must be written in the detection step, on every code path, including the trivial "no recovery needed" path. Lesson: state must be explicit, not inherited. If a flag has three possible meanings, three code paths must each write it. The hardcoded -platform suffix. The workflow had a shell-side suffix that someone (let's be honest, the model) had invented to make the resource group name "look right". When recovery logic started running and the workflow looked for the canonical resource group, it looked for -platform instead of whatever the Terraform locals.tf actually emitted. The result was that the recovery handler was happily reaching past the real resource group and into a different one. I made it a rule in the orchestrator: workflow-invented suffixes are not permitted. Naming is owned by Terraform's locals.tf . There are seventeen more defects in the catalogue, and the pattern is the same in every case. The bug surfaces, the rule gets written, the rule lives in the skill that owns the affected artifact. There is no implementation-learnings.md in the repo. There used to be, but I've deleted it because a tracked log of past bugs, sitting next to a skill that's already supposed to encode the lessons from those bugs, is a duplication waiting to drift. I believe that if the rule is in the skill, the log is redundant. If the rule isn't in the skill, the log is an evidence that I haven't finished the work. Either way, the right place for bug history is git log. 13. Splitting "the skill" from "this repo's defaults" I then wanted the orchestrator to be portable, but every run kept needing the same handful of decisions. Which Azure region by default? Which environment names? Which catalog naming convention? These weren't part of the article. They weren't part of the Terraform skill either. They were specific to this repository's opinion about how things should be deployed. If I baked them into the orchestrator, the orchestrator stopped being portable. If I left them out, every run produced unhelpful "not stated in article" entries for the same five universal decisions. The answer was a new file called REPO_CONTEXT.md stored in the repo root. It's read by the orchestrator before generation and it carries the defaults that are owned by the repo, not by the skill. The split looks like this in practice: SKILL.md answers the question "how do I turn an article into a runnable repo?" It is portable. REPO_CONTEXT.md answers the question "what does this repo default to when the article doesn't say?" It is local. Cloning the orchestrator into another GitHub project is now a clean operation. You take the skill, you write your own REPO_CONTEXT.md , and the same generator produces output appropriate to your environment. 14. The Validations Most of the rules I'd written into the skills were prose. "Don't invent suffixes." "Object IDs are inputs, not lookups." "Every required Terraform variable must have a matching TF_VAR_* in the workflow." The model is good at following prose rules most of the time. So a few of the most regression-prone rules became executable. The most important one is scripts/validate_workflow_parity.sh . Every variable declared in infra/terraform/variables.tf must appear as a TF_VAR_* export in the deploy workflow. The script greps both files, diffs the sets, and exits non-zero if they don't match. It is run at the end of generation. If it fails, the run failed, even if everything else looks fine. This caught real bugs. The most embarrassing was a variable I'd added to variables.tf and forgot to wire through the workflow. Terraform plan would prompt interactively for it on a non-interactive runner, and the run would hang. The rule of thumb I've ended up with is: prose rules are the default, but if a rule has been violated more than twice, it gets promoted to an executable check. There's a short list of those checks now, and it's the load-bearing one. 15. Key Vault soft-delete state machine Key Vaults in Azure have soft delete on by default. When you delete a vault, it sticks around for ninety days in a "soft-deleted" state. If you try to create a vault with the same name in the same subscription during that window, the deploy fails. The right behaviour is to recover the soft-deleted vault, not create a new one. The first version of my recovery handler covered exactly one case: if the vault is soft-deleted, recover it. This worked the first time I ran it. The second time, the recovered vault came back into the previous resource group, not the new one I had just created. Terraform then tried to create a new vault in the correct resource group and failed because the name was already taken globally. The handler had no concept of "the recovered vault is in the wrong resource group." So I added that case. The third time, the previous resource group itself was gone, and the handler was looking for it to verify the move. So I added that case too. By the end, the state machine had three distinct cases and two preconditions, as shown in the diagram below. The reason I keep coming back to this state machine is that it captures something that I think is generally true about agent-generated infrastructure code. The happy path is easy and meaningless, while the value is in the failure modes. The first version that worked on a clean tenant was about ten lines of bash. The version that works on a tenant that has been deployed-into and partially-torn-down five times is six times longer, and every additional line of it corresponds to a real environmental condition that I had to learn the hard way. 16. What I've learned so far I'm not going to pretend the full list of principles below was clear to me on day one. Every single one of these was learned by getting it wrong first. Looking back at the history, though, they are the ones that survived contact with reality. The contract precedes the implementation. output-contract.md was committed before any generator existed. Locking the shape of the deliverable first meant every later change had a fixed reference point. Generators, not stencils. Workflows are produced by Python scripts that take parameters and emit YAML. When restricted-tenant logic and the soft-delete state machine arrived, they needed conditional structure that a static template can't express. Every bug becomes a rule. Patching the generated code is a tax on tomorrow's run. While patching the skill is an investment. Each concern has a clear owner. The orchestrator routes, the leaves author, and the repo context holds the local defaults. Restricted-tenant compatibility is non-negotiable. No Microsoft Graph reads from generated Terraform. Object IDs are trusted inputs. Single-principal mapping is supported. Naming is owned by Terraform. No suffixes invented in shell. The validation harness enforces this. State must be explicit, not inherited. Every workflow run writes its own flags. No reliance on env defaults from a previous step or a previous run. Validation is executable when a rule has been violated more than twice. Prose rules are the default. Promotion to a script is earned. Operator docs describe concepts, not commands. Command syntax ages out, while conceptual descriptions don't. The TODO template enforces this rule. Add strong testing at the end of the process, once all the files are generated. Each run may produce slightly different output and introduce bugs, even if the previous run was successful. End-to-end runs against dirty tenants are the truth. The acceptance test isn't a clean-room deploy. It's a deploy into a tenant that has soft-deleted vaults, lingering RGs, and existing role assignments. Until that works, the project isn't done. From time to time, skills need to be reviewed and consolidated. The summary above of the journey is the one I find most useful to share when people ask whether this approach actually goes anywhere. From an empty repo to a generator that produces a deployable, restricted-tenant-compatible infrastructure-as-code repository from a blog URL, with executable validation and a recovery state machine that survives a previously-deployed environment. The first commit was an empty workspace. The last commit was the one where the same orchestrator, run against the same blog, against a tenant carrying state from five previous runs, deployed cleanly with no manual intervention. That is what I what I was aiming to achieve when I started! Thanks for reading.444Views0likes0CommentsLegacy SSRS reports after upgrading Azure DevOps Server 2020 to 2022 or 25H2
We are currently planning an upgrade from Azure DevOps Server 2020 to Azure DevOps Server 2022 or 25H2, and one of our biggest concerns is reporting. We understand that Microsoft’s recommended direction is to move to Power BI based on Analytics / OData. However, for on-prem environments with a large number of existing SSRS reports, rebuilding everything from scratch would require significant time and effort. Since Warehouse and Analysis Services are no longer available in newer versions, we would like to understand how other on-prem teams are handling legacy SSRS reporting during and after the upgrade. Have you rebuilt your reports in Power BI, moved to another reporting approach, or found a practical way to keep existing SSRS reports available during the transition? Any real-world experience, lessons learned, or recommended approaches would be greatly appreciated.134Views0likes2CommentsResilient by Design: Azure Databricks Disaster Recovery Strategy
Introduction: From Recovery Plans to Resilience Strategy As organizations increasingly rely on Azure Databricks for mission-critical analytics and data engineering workloads, the need for robust disaster recovery (DR) strategies becomes paramount. These platforms are no longer just analytics engines, they power real-time decisions, AI models, and core business operations. Yet many organizations still approach Disaster Recovery (DR) as a reactive safeguard rather than a strategic capability. Resilience today is not about “if something fails,” but about ensuring continuity, trust, and performance under any condition. A modern DR strategy must therefore evolve beyond backup configurations and failover scripts. It must align with business priorities, regulatory requirements, risk tolerance, and operational maturity to become a core pillar of the enterprise data platform. In this context, organizations are increasingly adopting architecture patterns that enable cross-region resilience for the Azure Databricks Lakehouse. This pattern includes synchronizing Unity Catalog objects—catalogs, schemas, tables, views, function, models, and volumes—across regions, combined with scalable data movement mechanisms and secure data access approaches such as Delta Sharing and high-performance transfer tools. To help organizations operationalize this approach today, we have defined a structured strategy for synchronizing Unity Catalog objects and associated data across regions, enabling a resilient-by-design Azure Databricks architecture. This post focuses on that approach, outlining the key architectural patterns, strategic considerations, and practical implementation steps required to design and enable cross-region resilience. In October 2025, Databricks announced a Managed Disaster Recovery solution, developed in collaboration with Capital One, which includes managed replication, customer-specified failover, and read-only secondary capabilities. The approach outlined in this post serves as a complementary, customer-managed pattern, providing a practical and production-ready path for organizations to achieve robust disaster recovery and business continuity while Databricks continues to expand its native DR capabilities. Why Disaster Recovery for Azure Databricks is Different Traditional Disaster Recovery approaches do not fully apply to modern Lakehouse platforms. In Azure Databricks, resilience must account for: Tight coupling between data, compute, and metadata (Unity Catalog) Distributed pipelines (batch, streaming, ML) Decentralized workspace ownership and rapid platform growth This makes disaster recovery not just an infrastructure concern, but a data platform design challenge. Figure 1. Main Disaster Recovery Considerations Understanding the Fundamentals: RTO, RPO, and DR Trade-offs Before defining a disaster recovery strategy, it is essential to understand the core concepts that drive design decisions. Recovery Time Objective (RTO) defines how quickly a system must be restored after a disruption; while Recovery Point Objective (RPO) defines how much data loss is acceptable. These two metrics directly influence the architecture, cost, and complexity of any DR solution. As illustrated in Figure 1, there is a clear trade-off between cost and recovery performance: Active-active (hot) architectures, minimize downtime and data loss but come at a higher cost. Warm standby provides a balance between cost and recovery time. Cold DR is cost-efficient but results in longer recovery times and higher data loss risk. Understanding these trade-offs is critical to aligning DR strategy with business expectations. Understanding the Fundamentals: RTO, RPO, and DR Trade-offs Before defining a disaster recovery strategy, it is essential to understand the core concepts that drive design decisions. Recovery Time Objective (RTO) defines how quickly a system must be restored after a disruption; while Recovery Point Objective (RPO) defines how much data loss is acceptable. These two metrics directly influence the architecture, cost, and complexity of any DR solution. As illustrated in Figure 1, there is a clear trade-off between cost and recovery performance: Active-active (hot) architectures, minimize downtime and data loss but come at a higher cost. Warm standby provides a balance between cost and recovery time. Cold DR is cost-efficient but results in longer recovery times and higher data loss risk. Understanding these trade-offs is critical to aligning DR strategy with business expectations. Designing for Resilience: A Phased Disaster Recovery Approach Disaster recovery has evolved beyond a one-time setup into a structured, lifecycle-driven capability. Leading organizations design resilience intentionally, implement it systematically, and continuously validate it to ensure ongoing effectiveness. The framework outlined below provides a practical and strategic approach to operationalizing disaster recovery in Azure Databricks environments, bridging the gap between architectural intent and true operational readiness. Figure 2. Different Phases of Azure Databricks Disaster Recovery Phase 1: Discovery & Assessment A resilient disaster recovery strategy starts with clarity—yet in many Azure Databricks environments, that clarity is often missing. As platforms evolve, clusters multiply, jobs are duplicated, and data assets grow, making it increasingly difficult to answer a simple question: what do we actually have, and how critical is it? The Discovery phase addresses this by establishing a single, authoritative view of the platform. By consolidating all assets, dependencies, and usage patterns into a structured baseline, organizations can move from fragmented visibility to informed decision-making. This approach aligns closely with the concepts outlined in “From Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass”, where establishing a comprehensive inventory becomes the foundation for governance, optimization, and ultimately resilience. This foundation enables teams to identify what matters most, define appropriate RTO and RPO targets, and understand the dependencies that will ultimately shape their disaster recovery strategy. Outcome A clear, data-driven baseline of the environment—enabling confident workload prioritization and effective disaster recovery design. Phase 2: Strategy & Design Once visibility is established, the next step is making deliberate design choices—balancing resilience, cost, and complexity. At this stage, organizations define how their platform should behave under failure. This typically starts with selecting a multi-site deployment pattern, in which two primary approaches are commonly adopted: Active–Active, where both regions are fully operational and serve live workloads Active–Passive (Warm Standby), where a secondary region is pre-provisioned and activated only during failover Active–active architectures provide near-zero downtime and minimal data loss but come with increased cost and architectural complexity. Active–passive patterns offer a more cost-efficient alternative, with slightly higher recovery times depending on how failover is orchestrated. Beyond selecting the deployment pattern, a key architectural decision is how data is replicated across the Medallion architecture (Bronze, Silver, Gold). Our approach introduces a set of practical scenarios that allow organizations to tailor resilience based on both workload criticality and recovery requirements. A common starting point is aligning the DR strategy to workload tiers, such as: Tier 1 (Mission-critical): Active–Active with full replication Tier 2 (Business-critical) : Active–Passive with partial replication Building on this, organizations can further refine their approach by defining how data is replicated across the Medallion layers: Full replication (Bronze, Silver, Gold) , i.e. fastest recovery at highest cost; Bronze-only replication, lower cost, with re-computation required during recovery; Gold-only replication, optimized for consumption-focused use cases. This combination of workload tiering and Medallion replication strategies enables a flexible, fit-for-purpose approach to disaster recovery, which balances performance, cost, and operational complexity. Below we demonstrate, as an example, two representative patterns: (a) Active–Active architecture, where data pipelines operate in continuous trigger mode across regions, enabling near real-time synchronization; and (b) Active–Passive architecture, where all layers are replicated using a clone-based approach and activated on demand during failover. These scenarios highlight how organizations can balance recovery performance and cost by adjusting both the deployment model and the depth of data replication. 3. Active - Active Scenario - Continuous Trigger Mode Within the active–passive model, multiple variations can be applied, ranging from full replication of all medallion layers to more selective approaches (such as replicating only Bronze or Gold layers). This flexibility allows organizations to further balance recovery performance, cost, and operational complexity. 4. Active - Passive Scenario - Clone All Layers Mode Phase 3: Disaster Recovery Implementation & Enablement With the strategy defined, the focus shifts to translating design into a repeatable and operational solution. At this stage, resilience is no longer conceptual, it is embedded into the platform through automation, data replication, and standardized deployment patterns. From Strategy to Architecture At a high level, the DR architecture spans both the primary and secondary Azure regions, ensuring that all critical components can be either replicated or recreated: Control plane synchronization: Users, groups, and workspace assets are replicated using SCIM, Terraform, and CI/CD pipelines. Workspace and metadata portability: Jobs, notebooks, and configurations are defined as code and deployed consistently across regions. Data layer replication: Managed data, external data, and streaming checkpoints are synchronized using deep clone operations. This layered approach ensures that the platform can be reconstructed end-to-end, not just partially recovered. Unity Catalog-Driven Replication A critical aspect of the implementation is the replication of Unity Catalog metadata and associated data assets. This includes: Synchronizing catalogs, schemas, tables, views, functions, and volumes Using Delta Sharing to expose datasets across regions Leveraging deep clone and storage replication to ensure data availability Recreating external and managed locations in the target region By combining metadata synchronization with data replication, the target environment becomes a fully functional mirror of the source. 5. Unity Catalog Focused DR Mechanisms Operationalizing with a DR Pipeline To make this repeatable, the architecture is supported by a DR pipeline that orchestrates the process end-to-end: Synchronize schemas and Unity Catalog structures Perform deep clone of Delta tables Recreate views and dependent objects Provision volumes and copy associated data Ensure consistency across storage layers (e.g., ADLS via AzCopy) This pipeline can operate either continuously or on demand, depending on the selected DR pattern. 6. Azure Databricks DR Replication Workflow Outcome A fully implemented disaster recovery solution where data, metadata, and platform components are consistently synchronized, enabling rapid and reliable activation of workloads in a secondary region. Phase 4: DR Drill: Validation, Operations & Continuous Improvement A disaster recovery strategy is only valuable if it works when needed. This phase focuses on validating, operating, and continuously improving the DR solution to ensure it meets business expectations. Failover & Failback in Practice In a real failure scenario, the transition to the secondary region must be simple, predictable, and fast. A typical failover process includes: Detecting primary region unavailability Executing a final synchronization (if possible) Redirecting connections to the DR workspace Resuming operations without requiring code changes Equally important is failback, once the primary region is restored: Re-synchronizing data from DR to primary Switching pipelines and configurations back Gradually restoring normal operations Because infrastructure and metadata are standardized, this process becomes operational rather than reactive. Operating DR as a Continuous Capability Beyond failover, DR must be actively managed as part of daily operations: Monitoring & Alerting: Track job failures, performance bottlenecks, and system health Governance & Change Management: Maintain consistency between environments using IaC and version-controlled pipelines Continuous Optimization: Adjust replication strategies, scaling, and performance as workloads evolve This ensures the DR solution remains aligned with both technical and business changes over time. Ensuring Performance, Integrity, and Security A production-ready DR solution must also guarantee: Performance & Scalability: Optimize compute, autoscaling, and data transfer to handle recovery scenarios efficiently Data Integrity & Consistency: Validate schema synchronization, monitor replication jobs, and ensure parity between regions Security & Compliance: Enforce consistent access controls, secure credentials, and enable audit logging across environments Outcome A validated and continuously evolving DR capability—where recovery processes are tested, monitored, and improved over time, providing confidence to both technical teams and business stakeholders. Key Takeaways and Closing Thoughts Resilience in modern data platforms is no longer defined by how quickly systems can recover, but by how effectively they are designed to withstand disruption in the first place. Azure Databricks, as a core engine for data, analytics, and AI, requires a disaster recovery approach that extends beyond infrastructure—one that treats data, metadata, pipelines, and governance as a unified system. By combining a structured discovery phase, a strategy aligned to workload criticality, and automated, repeatable implementation patterns, organizations can move from reactive recovery to resilience by design. This not only reduces risk, but also ensures that critical data workloads remain available, trusted, and performant when it matters most. The approach outlined in this post provides a practical and flexible way to enable cross-region resilience today, while also complementing the managed disaster recovery capabilities expected to be introduced by Databricks. As we anticipate the availability of these native features, this approach offers a production-ready foundation that can extend and integrate with future platform capabilities. In a world where disruption is inevitable, the objective is no longer simply to recover—but to maintain continuity of data, decisions, and business operations with confidence. Special thank you to Vasilis Zisiadis, Dimitris Kotanis who contributed their expertise to create this material and bring it to life. Thank You Antony Bitar, Collin Brian and Jason Pereira for their support in reviewing the content.426Views0likes1CommentApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.6KViews9likes1Comment