azure databricks
32 TopicsWhat to Do When You Hit Capacity in Azure Databricks: Engage, Mitigate, Plan!
Microsoft's Cloud Architects: Paul Singh PaulSingh, Eduardo Dos Santos eduardomdossantos, Chris Walk cwalk, Peter Lo PeterLo, Tim Orentlikher tim_orentlikher, Ajmal Hossain ajmalhossain, Chris Haynes Chris_Haynes, and Rafia Aqil Rafia_Aqil Start Here: Engage Microsoft Capacity constraints in Azure Databricks are not an Azure Databricks product issue. Azure Databricks does not own or reserve compute, it dynamically provisions VMs from Azure when clusters are created or scaled. This means cluster creation, autoscaling, or job execution can stall when the underlying VM SKUs are constrained at the regional level. The fastest path to resolution is a structured conversation with your Microsoft account team, who can engage the Azure capacity intake process on your behalf. Create a Quota Support Ticket via Microsoft Support and bring the following to your account team with your Support Ticket Number. Each field maps directly to what capacity intake teams will ask for: missing fields slow the request. What to Prepare Before You Reach Out Your Account Team Field What Capacity Intake Needs Example Subscription IDs The exact Azure subscriptions that will host the workspaces and clusters 7ebee83d-7923-426c-8449-59fd4dff25ab Region(s) Primary region, plus any acceptable alternates East US 2 VM family / SKU Specific series and version requested Eadsv5, ESv4, DSv4, DSv2 Core count / new limit Total vCPU or core count per SKU 10,000 cores for Eadsv5 Workload characteristic CPU-bound vs. memory/shuffle-heavy vs. IO-heavy; batch vs. streaming vs. SQL “Memory-intensive ETL with large joins and shuffles” Scale and timing When you need it, ramp profile, peak vs. steady state “Need by month-end; ramp from 2,000 to 9,650 cores over Q3” Business context Business use case “Migration off AWS” What “Capacity” Really Means: A Layered Mental Model Before diving into fixes, it is important to understand what is actually happening behind the scenes. Capacity constraints can occur at three distinct layers, and solving them requires addressing each one. Layer 1: Azure Infrastructure This is the layer most teams underestimate. Capacity here is governed by: VM SKU availability in the region. D-series and E-series: the two most common Databricks worker families: have repeatedly hit capacity constraints across multiple Azure regions, causing cluster creation failures, autoscale stalls, and provisioning delays. Regional supply constraints, which are dynamic and shared across all Azure tenants. vCPU quotas and limits per subscription, which are separate from regional supply. Quota is your subscription’s limit to deploy resources (like a credit card limit); regional capacity is the underlying infrastructure available. Both must be sufficient. Layer 2: Azure Databricks Platform The Azure Databricks control plane has its own published ceilings that your architecture must proactively respect. Key limits from the official Azure Databricks resource limits documentation: Resource Limit Scope Jobs created per hour 10,000 Workspace Tasks running simultaneously 2,000 Workspace (Run Job and For Each parent tasks excluded) Parent tasks running simultaneously (Run Job / For Each) 750 Workspace SQL warehouses 1,000 Workspace Attached notebooks or execution contexts 145 Cluster Virtual machines 25,000 Per subscription per region Note: For limits marked as non-fixed in the official documentation, you can request an increase through your Azure Databricks account team. Reference: https://learn.microsoft.com/en-us/azure/databricks/resources/limits Layer 3: Workload (Spark Execution) Even when both lower layers cooperate, Spark’s own execution model can produce capacity-like symptoms: Parallelism and task distribution, which dictate how many cores a job can usefully consume. Memory pressure from joins, shuffles, and skewed keys. IO demand and caching behavior, including Delta cache effectiveness and Spark cache misuse. Understanding these layers is critical. Retries sometimes succeed because capacity is dynamic: as other workloads complete, nodes are released back to Azure and briefly become available. Recognizing When You’ve Hit Capacity Capacity issues rarely present as a single clean error. Instead, they appear as inconsistent behaviors: Clusters stuck in Pending state Autoscaling fails or never reaches the desired size Jobs intermittently fail to start Retry attempts sometimes succeed These inconsistencies occur because capacity is shared across Azure tenants and fluctuates throughout the day. Running workloads outside peak business hours in the impacted region’s time zone is one of the most effective short-term mitigations. Immediate Actions: How to Unblock Your Workloads When you are actively hitting capacity constraints, speed matters. Please reach out to your Microsoft Account team and try these mitigations that are ordered from quickest to most involved. Retry and Run During Off-Peak Hours Capacity availability changes throughout the day as workloads complete and release VMs. Running outside peak business hours for the impacted region significantly improves success rates. Switch VM SKU or Family If a specific VM SKU is constrained, switching to another can immediately unblock provisioning. Move within the same family (for example, DSv4 → DSv5) Or switch families entirely (for example, D-series → F-series or L-series) This is one of the most effective but often underused approaches. Also, Choosing the Right VM Family Most Databricks environments default to D-series (general purpose) and E-series (memory optimized). These are also the most heavily used and most capacity-constrained VM families. Consider alternatives based on your workload: VM Family Best For When to Use Trade-off D-series General workloads Default choice Often constrained in high-demand regions E-series Memory-heavy Spark jobs Joins, shuffles, analytics High demand; higher cost F-series CPU-intensive jobs Parsing, transformations Lower memory per core L-series IO-heavy workloads Delta caching, large datasets Higher cost; large local NVMe Practical decision framework: Memory-bound workloads (joins, shuffles): Move from E-series to L-series. Similar memory per core, plus large local NVMe for Delta caching. CPU-bound workloads: Move from D-series to F-series. Higher CPU performance at lower cost. IO-heavy or cache-sensitive workloads: L-series can significantly improve performance and reduce shuffle pressure. Designing a single VM family is one of the biggest production risks in Azure Databricks environments. Implement Regional Diversity in your Databricks workload As Azure capacity constraints are region- and SKU-specific, it is important to build architectural flexibility into your Databricks deployments. For critical or large-scale workloads, consider deploying multiple Databricks workspaces across different Azure regions to reduce dependency on any single region’s capacity. This approach enables: improved resilience to regional capacity constraints greater flexibility in workload placement Important: Multi-region deployment requires deliberate architecture, including deploying separate workspaces and replicating data and configurations across regions—it is not automatic. Why Adding More Nodes Is Not Always the Answer When jobs slow down, the instinct is to scale compute. With Spark, more nodes do not always solve the problem. Common workload issues that masquerade as capacity problems: Data skew Excessive shuffle operations Inefficient partitioning Overuse of UDFs In real workloads, shuffle operations can grow significantly larger than input data, placing heavy pressure on both compute and memory that more nodes cannot relieve. Smarter optimization strategies: Reduce shuffle through repartitioning and query optimization Enable Photon for faster execution Optimize Delta tables using Z-ordering and compaction Leverage caching strategically (not just Spark cache: use the Delta/disk cache) These optimizations can reduce your dependency on scarce VM capacity altogether. What to Do When Your Capacity Is Approved Once Azure approves your capacity request, retaining it requires active steps. Because Azure capacity is dynamic and shared, approved capacity is held only while compute remains actively deployed and running. This is especially important in highly constrained regions. Microsoft recommends the following: Configure an Instance Pool For workloads that cannot yet use serverless compute, configure an Azure Databricks Instance Pool with a minimum number of idle nodes aligned to your production requirements. An instance pool pre-allocates and maintains a set of idle, ready-to-use VM instances. When a cluster is created from the pool, it draws from these warm nodes: eliminating the need to request new VMs from the regional Azure capacity pool between job runs. Key behaviors: The pool holds a minimum number of nodes continuously, keeping them warm and immediately available. Clusters attached to the pool pull from warm nodes, avoiding re-acquisition from Azure between runs. No DBU charges apply while nodes are idle in the pool. Azure VM infrastructure costs do apply for all minimum idle instances. Size the pool conservatively: aligned to production need only: to balance capacity retention against ongoing cost. Important: Instance pools hold idle nodes on a best-effort basis. Periodic platform events can recycle pool nodes, briefly causing the pool to fall below its configured minimum idle count while Azure re-acquires replacement nodes. Pools significantly improve availability and startup latency, but they do not change the fact that the underlying VMs are still requested from Azure on demand. They are not a hard reservation. Reference: https://learn.microsoft.com/en-us/azure/databricks/compute/pools Designing for Resilience: Long-Term Best Practices To avoid repeated capacity issues, your architecture needs to evolve beyond reactive mitigations. Plan for Capacity Early Understand VM quotas and limits before you need them: not after a constraint occurs. Avoid designing a single SKU. Build flexibility into cluster configurations so you can switch families without re-engineering jobs. Standardize Compute Configurations Consistent, policy-driven environments make it easier to adapt when capacity constraints occur. Use Databricks Cluster Policies to constrain cluster creation to approved, available VM families: this prevents teams from inadvertently requesting constrained SKUs. Move Toward Serverless Where Possible Serverless compute abstracts capacity management away from the customer. As the Databricks platform expands serverless support, migrating eligible workloads is the most durable long-term strategy. Azure continues to expand infrastructure capacity, but there are no guaranteed timelines for relief in constrained regions. Note: If your workload supports serverless compute, Databricks recommends using serverless compute instead of pools or classic VM-backed clusters. Serverless removes dependency on specific VM SKUs and regional capacity: scaling is managed by the platform with significantly improved availability. Reference: https://learn.microsoft.com/en-us/azure/databricks/serverless-compute. For eligible workloads: including Databricks Jobs (automated workflows), Databricks SQL Warehouses, and Delta Live Tables: serverless compute eliminates VM SKU dependency entirely. Configuration guidance is available in the Azure Databricks deployment guide, Development Section, Step 9. Multi-Region Strategy for Critical Workloads For the most critical workloads, evaluate a multi-region deployment as part of your business's continuity planning. This is a significant architectural investment: see the FAQ for the full scope: but it is the only approach that provides true regional redundancy. Coordinate this with your Microsoft account team. Reference: Azure Databricks & Microsoft Fabric Disaster Recovery: The Complete Better‑Together Strategy for Cloud Architects Final Takeaways Capacity issues are infrastructure-level constraints, not Databricks product failures VM family selection is critical: do not rely solely on D-series and E-series Workload optimization can reduce dependency on scarce resources before requesting more capacity Serverless compute is Microsoft’s preferred long-term recommendation for eligible workloads Azure On-Demand Capacity Reservations provide guaranteed capacity for mission-critical scenarios: distinct from instance pools (best-effort) and Reserved Instances (billing discount only) Architectural flexibility: multi-SKU, multi-region awareness is your best defense against future constraints FAQ Why do retries work? Capacity in Azure regions is shared across all tenants and fluctuates throughout the day as workloads complete and release VMs. A retry succeeds when capacity temporarily frees up. Retrying during off-peak hours improves success rates significantly. Why does capacity fluctuate during the day? Capacity is a function of regional supply and concurrent demand. As workloads complete, nodes are released back to Azure. Peak business hours in the impacted region’s time zone tend to be the tightest windows. Why are instance pools not a hard reservation? Pools hold a minimum number of nodes on a best-effort basis. Periodic platform events recycle pool nodes, so a pool can briefly fall below its configured minimum idle count while Azure re-acquires replacement nodes. Setting minimum idle to 0 avoids paying for idle VMs at the cost of slower acquisition time. Pools significantly improve availability and startup latency but do not guarantee capacity at the Azure infrastructure level. Why does serverless behave differently from classic clusters? Serverless compute removes customer control over individual VM SKUs. Databricks manages the underlying capacity across a shared pool. SKU-swap and pool-based mitigations do not apply. Customer-side levers reduce to retry and off-peak scheduling. The trade-off is that serverless is the simplest and most reliable option when the workload supports it. Why is changing regions a last resort? Region changes require redeployment of the Azure Databricks workspace and migration of all dependent artifacts: jobs, clusters, libraries, networking (private endpoints, VNet injection), Unity Catalog assignments, identities, and source data. The destination region must be validated for the same SKU and zonal configuration. For these reasons, region change should always be coordinated with the Microsoft account team and attempted only after preferred mitigations have been exhausted. Why does VM family selection matter so much for capacity? Different VM families have different supply curves. D-series and E-series are the most requested Databricks worker families and the ones most frequently constrained. Choosing a SKU based on whether the workload is memory/shuffle-heavy, CPU-bound, or IO-heavy improves both performance and the probability that capacity is available. The capacity team often steers customers toward newer-generation alternatives when supply differs by generation version. What does the Microsoft account team actually do? They route the request into the Azure capacity intake process, advise alternate SKUs and regions, surface zonal vs. regional considerations, and provide forward visibility into known constraints. The customer’s job is to bring a complete, accurate workload profile so the account team can advocate effectively. It is also recommended to open an Azure Support ticket. This will save time later, as the capacity planning teams would like to track issues and requests via a support ticket. Once an Azure Support ticket is opened, the ticket number should be shared to the Microsoft Account Team, at a minimum to the Customer Success Account Manager (CSAM), if one is assigned to your organization.217Views0likes0CommentsStreaming and Batch Data Architectures with Microsoft Fabric to Azure Databricks
Author's: Oscar Alvarado oscaralvarado and Rafia Aqil Rafia_Aqil Note: This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation. Use this article as a starting point to design a well-architected solution that aligns with your workload’s specific requirements. As organizations adopt Microsoft Fabric as their unified analytics platform, it has become a leading path for ingesting both streaming and batch data into Azure Databricks. This article covers integration approaches -via Microsoft Fabric- and details the five Fabric-specific paths that connect OneLake/ADLS and Databricks for end-to-end data processing. Medallion Architecture The following data flow corresponds to the architecture diagram: Data is ingested through Microsoft Fabric (via Mirroring, RTI, or Data Factory) lands data into OneLake/ADLS. With the medallion pattern, consisting of Bronze, Silver, and Gold storage layers, organizations have flexible access and extendable data processing: Bronze – Raw data entry point. Data arrives in its source format and is converted to the open, transactional Delta Lake format. Silver – Optimized for BI and data science. ETL and stream processing tasks filter, clean, transform, join, and aggregate Bronze data into curated datasets using SQL, Python, R, or Scala. Gold – Enriched data ready for analytics and reporting. Analysts use Power BI, PySpark, SQL, or Excel for insights and queries. Fabric Integration Paths Note: This architecture establishes a complete loop-back between Microsoft Fabric and Azure Databricks, enabling Gold layer tables to be seamlessly mirrored back to Microsoft Fabric for dashboarding through Azure Databricks Mirroring. The following five paths connect Microsoft Fabric to Azure Databricks: Fabric Mirroring to OneLake – A low-cost, low-latency turnkey solution that creates a replica of data from operational sources (SQL Server, Azure Cosmos DB, Oracle) in OneLake. Handles the initial load and ongoing CDC changes automatically, keeping data continuously up to date. Fabric RTI to OneLake – Fabric Real-Time Intelligence ingests streaming event data into OneLake with sub-second latency, enabling real-time analytics on live event streams. Fabric Data Factory to OneLake – Orchestrates ingestion from diverse sources not covered by Mirroring (such as Sybase or REST APIs) and lands data in OneLake, ensuring complete source coverage. OneLake to Azure Databricks – Unity Catalog connections to OneLake, secured via Managed Identities from Microsoft Entra ID, allow Databricks to query OneLake data items as a native catalog without data duplication. Fabric Data Factory to Azure Databricks (direct) – Orchestrates ingestion from diverse sources directly into Azure Data Lake Storage (ADLS), where Azure Databricks picks up the data for medallion architecture processing. Design Considerations Area Updated guidance Direct RTI-to-Databricks integration There is still no broad GA direct integration where Fabric RTI and Databricks operate as one native real-time runtime. Integration should be positioned through open protocols, Event Hubs/Kafka-style patterns, OneLake, Delta, and federation. OneLake federation in Azure Databricks OneLake federation in Azure Databricks is now the key integration story. It allows Databricks Unity Catalog to query Fabric Lakehouse and Warehouse data in OneLake without copying it. Access is read-only and depends on Fabric tenant settings, workspace permissions, and Databricks Unity Catalog setup. RTI data availability to Databricks Data ingested through Fabric RTI can be made available to Databricks by landing or exposing the data into OneLake-backed items, especially Lakehouse/Warehouse patterns. Eventhouse data can be made available in OneLake in Delta format through OneLake availability, but Databricks OneLake federation should be validated against the specific Fabric item type and access path. Existing Databricks customers Existing Databricks customers do not need to abandon Databricks. They can use Fabric RTI as the event ingestion, real-time detection, operational alerting, and business action layer, while continuing to use Databricks for engineering, ML, advanced analytics, and Unity Catalog-governed access. Activator and business action Fabric Activator is the cleanest business-user action layer. It can monitor streaming events and trigger Teams messages, email, Power Automate flows, Fabric pipelines, notebooks, Spark jobs, Dataflows, UDFs, and other downstream actions. This is a strong differentiator because it lets business users act on events without waiting for batch analytics. Operations Agents Operations Agents are in preview and should be positioned carefully. They monitor real-time data from Eventhouse or ontology sources, surface insights, recommend actions, and can connect to Activator/Power Automate action paths. They are not simply a pre-ingestion decision engine before data lands anywhere; they work from configured Fabric knowledge/data sources. Before landing in Lakehouse For decisioning before Lakehouse persistence, use Eventstream processing and Activator rules on streams. For AI-assisted operational recommendations, use Operations Agents once the relevant data is available in Eventhouse or ontology. Requirement-Specific Notes Data Ingestion Microsoft Fabric Mirroring currently supports SQL Server, Azure Cosmos DB, and Oracle as source systems. For sources not yet supported by Mirroring—such as Sybase or REST APIs—use Fabric Data Factory pipelines to ensure full coverage across all data systems. Once data is in the landing zone with the correct format, Mirroring’s CDC replication starts automatically and manages the complexity of merging changes (updates, inserts, and deletes) into Delta tables, keeping data in Fabric continuously up to date. Learn more about open mirroring Storage Format and Time Travel OneLake supports Delta tables, enabling schema evolution and time travel across all data stored in the lakehouse. Learn more about OneLake and Delta tables Security Encryption at rest: OneLake automatically encrypts all data at rest using Microsoft-managed keys, compliant with FIPS 140-2 standards. Learn more Encryption in transit: All data in transit is encrypted using TLS 1.2 or higher, securing data movement between Fabric, OneLake, and Azure Databricks. Learn more Data Governance OneLake can be registered and scanned by Microsoft Purview, enabling cataloging of stored metadata and data quality profiling. This protects sensitive information, including PHI and PII, across ingestion and analytics workflows. Learn more about Purview with Fabric Lakehouse Operations and Monitoring Use the Fabric monitor hub to track pipeline health, Spark application performance, and ingestion job status across all Fabric workloads. Learn more about the Fabric monitor hub Scenario Details This architecture applies to any organization that needs to unify streaming and batch data at scale. Common characteristics include: Multiple operational data sources (databases, SaaS applications, event streams) A requirement to process both real-time and historical data in the same platform Governance and compliance requirements for sensitive data (PHI, PII, financial records) Analytics consumers spanning BI (Power BI), data science (Databricks notebooks), and ML workloads Potential Use Cases Healthcare and life sciences – PHI/PII protection via Purview; real-time patient telemetry + batch EHR analytics Financial services – Real-time fraud detection streams + batch regulatory reporting Retail and e-commerce – Streaming clickstream analytics + batch inventory and supply chain processing Energy and utilities – IoT sensor telemetry streaming + batch consumption analytics Next Steps Get started with Microsoft Fabric Mirroring Build an ETL pipeline with Lakeflow Declarative Pipelines Configure Unity Catalog with OneLake shortcuts Monitor Fabric pipelines with the Fabric monitor hub357Views0likes0CommentsBuilding AI apps and agents with Azure Databricks, Copilot Studio, and GitHub Copilot
A workspace-wide Genie MCP endpoint for Copilot Studio Genie is Azure Databricks’ AI agent that lets any employee chat with their data and get trusted answers instantly. Genie Spaces are curated, business‑domain workspaces for teams to find strategic insights for their targeted use cases. Until now, connecting Azure Databricks Genie to Microsoft Copilot Studio meant adding each Genie Space as a separate tool. This works and adds value for customers wanting to integrate a specific Genie Space with Copilot Studio, but the per-space MCP server added overhead when trying to connect multiple Genie spaces to one Copilot Studio agent. The workspace-wide MCP endpoint changes that. One endpoint per workspace gives a Copilot Studio agent access to every connected Genie space and Unity Catalog dataset, and the curated context inside each Genie space stays in place. Key capabilities: Natural-language access across the workspace. Copilot Studio agents can route questions across every connected Genie Space and Unity Catalog dataset without losing the curation that keeps answers accurate. Unity Catalog governance. Access controls are enforced at query time, so existing data permissions extend to every agent built in Copilot Studio. Beyond a single domain. Move from a finance agent or a supply chain agent to a workspace-aware agent that follows users wherever the data lives. Lakebase branching with GitHub Copilot agent mode Production AI agents fail on real-data edge cases that synthetic or mocked environments do not catch. But giving a developer direct production access to investigate is not a realistic option in most enterprises. Lakebase branching, now integrated with GitHub Copilot agent mode, gives you a way to debug against real data without ever connecting to the production database. Key capabilities: Copy-on-write branching. Create a full-fidelity branch of a Lakebase production database in seconds. No data is moved and no production records are altered. Native GitHub Copilot agent debugging. Point GitHub Copilot agent mode at the branch endpoint to reproduce, root-cause, and resolve data-dependent issues with AI assistance. Azure-native end-to-end workflow. The full loop runs across GitHub, Azure Databricks, and Lakebase. No third-party tools or custom infrastructure required. Compliance built in. Fixes ship through the standard Git-based deployment and compliance workflows already in place, so debug cycles compress from hours to minutes. What this unlocks for AI agent teams Together, the two capabilities cover both halves of the agent lifecycle on Azure: Author Copilot Studio agents that reason over an entire Azure Databricks workspace through one MCP connection. Debug production AI agents against real Lakebase data using GitHub Copilot agent mode, reducing production data risk. Keep Unity Catalog governance and existing compliance controls in place from authoring through deployment. Standardize the data, agent, and developer toolchain on GitHub, Azure Databricks, and Microsoft 365. Get started Both features are available in public preview on June 2, 2026, directly in Azure Databricks workspaces. Azure Databricks and Power Platform integration to set up Genie workspace-wide MCP for Copilot Studio Connect your GitHub to Azure Databricks to take advantage of Lakebase branching with GitHub Copilot agent mode1.6KViews0likes0CommentsSecure Medallion Architecture Pattern on Azure Databricks (Part II)
Disclaimer: The views in this article are my own and do not represent Microsoft or Databricks. This article is part of a series focused on deploying a secure Medallion Architecture. The series follows a top-down approach , beginning with a high-level architectural perspective and gradually drilling down into implementation details using repeatable, code. In this part we will discuss the implementation of the pattern using GitHub Copilot If you have missed, please read first the first part of this blog series. It can be found at: Secure Medallion Architecture Pattern on Azure Databricks (Part I). I waited a while before publishing this article. Partly due to other priorities, but also because I wanted to experiment with deploying infrastructure and data pipelines using agents. At that point, I was looking to leverage agents with a spec-driven approach, and through using GitHub Copilot, I learned what skills are and how I can use them to achieve my scope. In this blog I'll share what I learned using GitHub Copilot for spec-driven development. I'll use the content from my previous article, Secure Medallion Architecture Pattern on Azure Databricks (Part I) , as a technical specification to extract implementation details and generate two outputs: Terraform code for infrastructure, platform configuration, and deployment Databricks Declarative Automation Bundles for jobs, pipelines, and other deployment-ready workload resources I've tried not to overfit the prompts within the skills I've developed, so they remain portable to other technical articles, not just the one mentioned in this blog. Separate the platform from the workload When I started the design, I decided to modularise the automation scripts by separating the platform from the actual data platform workloads. I assigned networking, storage, identities, secret scopes, and workspace configuration to Terraform, while Databricks notebook runs, job clusters, pipelines, and environment-specific deployments were developed within Databricks Declarative Automation Bundles (formerly known as Databricks Asset Bundles). That may sound obvious, but it's exactly where generated code often goes wrong. Without explicit instructions, AI tools tend to blur these boundaries and produce one oversized block of configuration. That's why my Copilot skill needs to enforce a clear contract by: Infer the architecture from the article Identify what is explicit and what is assumed Emit Terraform only for infrastructure concerns Emit bundle files only for workload concerns Leave placeholders for anything the article does not specify That last point is critical. A blog post or low-level technical specification is not a source of truth for account IDs, hostnames, catalog names, secret values, or subnet IDs. Good automation should never fabricate those values. Instead, I decided to produce a starter implementation with TODO markers wherever environment-specific values are required. Skills are a great way to get more consistent, repeatable output across runs, so I decided to use them for this project. I could have used one of the tools listed in the table below, but I chose to go my own way, into developing a Spec-Driven Development (SDD) framework which I hope it will carryon improve with time. Tool Creator Type Link Description GitHub Spec Kit GitHub Open source github/spec-kit Turns feature ideas into specs, plans, and task lists before any code is written. Works with multiple AI coding agents. Specification first, code as generated output. BMAD Method BMad Code LLC Open source bmad-code-org/BMAD-METHOD An AI-driven agile framework with specialised agents covering the full lifecycle from ideation to deployment. Scale-adaptive — adjusts planning depth from a bug fix to an enterprise system. OpenSpec Fission AI Open source Fission-AI/OpenSpec Lightweight spec layer that sits above your existing AI tools. Each change gets a proposal, specs, design, and task list. No rigid phase gates, no IDE lock-in. What are skills, and why are they a good fit? Skills are essentially reusable prompt modules that aim to force LLMs to produce repeatable answers. Within a skill, I define the behavior and then attach supporting resources or scripts so Copilot can perform the task consistently. That means a skill can do more than just "write some code." A skill can define a repeatable workflow like this: Fetch the blog URL Extract headings, paragraphs, and code snippets Normalize the article into a lightweight implementation spec Decide what belongs in Terraform Decide what belongs in the Databricks bundle Generate files in a predictable project structure Produce a TODO.md file for unresolved values This approach turns Copilot from a generic assistant into a specialized code-conversion tool. However, there are some constraints I had to be mindful of when developing skills: Context window limits. The model has limited space to read instructions, process input, and generate output. Long prompts can cause files to be cut off or steps to be skipped. Non-determinism. Output may vary between runs, even with strict instructions. I always lint, validate, and review the diff before committing. Boundary leakage. Models may invent plausible but incorrect values. The TODO.md pattern must be enforced as a rule, not a suggestion. Model and tool drift. Copilot's model and tool surface change over time. I use example inputs and outputs as repeatable sanity checks. Maintainability. A skill is code-as-prompt and will age with the platforms it targets. I keep skills narrowly scoped so they stay easy to update. I'll explain the TODO.md file in more detail later in this post. The GitHub repo The repository can be found at the link MarcoScagliola/CopilotBlogToCode Below you will find a function I have added that, when invoked, deletes all the files produced by the skills, so you can test the repo from a clean state. python .github/skills/blog-to-databricks-iac/scripts/reset_generated.py --force; If you want to tried it out, please clone and try it on your copy. In GitHub Copilot, I usually keep: Model as Auto Foer the configure tools I keep just the built-in tools selected. Below you can find the prompt that I use to run the skills and have the blog analysed. Use the blog-to-databricks-iac skill on this article: https://techcommunity.microsoft.com/blog/analyticsonazure/secure-medallion-architecture-pattern-on-azure-databricks-part-i/4459268 Inputs: workload: blg environment: dev azure_region: uksouth github_environment: To make this more repeatable and less manual, I've added a prompt file at run-blogToDatabricksIac-selected-tools.prompt.md, which can be run directly from VS Code by opening the file and clicking the run button at the top. Feel free to experiment with it and let me know what you think. Further instructions on how to use the repo are available READ_FIRST.md. Following you will find the exact repository setup I used for this workflow, starting with my initial configuration and ending with the final directory structure and files. 1. Create a new GitHub repository and clone it locally I started by creating a new repository on GitHub, then cloned it to my local machine so I could add the Copilot skill, Terraform scaffolding, and Databricks bundle files in a centralized location. git clone https://github.com/YOUR-ORG/blog-to-databricks-iac.git cd blog-to-databricks-iac This approach keeps the workflow organised from the start: the repository exists on GitHub first, and the local clone becomes the working directory for all subsequent setup steps. 2. Create the GitHub skill folder structure (first iteration) GitHub Copilot skills are file-based and centered on a SKILL.md file inside a skill folder. GitHub's current pattern places these under .github/skills/ . I used the script below to create the folder hierarchy for my initial integration. mkdir -p .github/skills/blog-to-databricks-iac/scripts mkdir -p .github/skills/blog-to-databricks-iac/templates mkdir -p infra/terraform mkdir -p databricks-bundle/resources mkdir -p databricks-bundle/src This script generates the structure depicted below. 3. Add the main skill definition Next, I created the SKILL.md file at .github/skills/blog-to-databricks-iac/ . The orchestrator decides what happens and in what order, while each specialist decides what its own file should contain (as an example the Terraform specialist owns the Terraform, the bundle specialist owns the bundle, and so on). In practice, SKILL.md turns Copilot from a general assistant into a domain-specific generator for this repo. GitHub documents this SKILL.md-based structure as the foundation of agent skills. My first iteration of .github/skills/blog-to-databricks-iac/SKILL.md> was very simple and can be found here. 4. Add a script to fetch and normalize the blog article Next, I created a Python script that the main orchestrator SKILL.md invokes to read the blog article. This script is stored at .github/skills/blog-to-databricks-iac/scripts/ and named fetch_blog.py . Within SKILL.md , the script is invoked as shown below. ### 1. Fetch article ```bash python .github/skills/blog-to-databricks-iac/scripts/fetch_blog.py "<url>" ``` If fetch fails, stop and return the fetch error output. Do not retry; surface the error to the user and wait for guidance.</url> The script validates the URL, fetches the HTML with a 30-second timeout, and uses a spoofed Mozilla User-Agent to avoid being blocked by CDNs (Content Delivery Networks). It reads through the HTML one tag at a time, flagging when it enters relevant sections like paragraphs, headings, or code blocks, and buffering text until the tag closes. Before storing anything, it cleans the text by decoding HTML objects, collapsing whitespace, and trimming edges. As it parses, the script also scans for cloud platform keywords: AWS, S3, Azure, ADLS, GCP, Google Cloud. The first match wins; if none are found, it returns unknown. This is a quick heuristic, not authoritative. Finally, it outputs clean JSON with the extracted data: title, headings, paragraphs, code blocks, and cloud hint, capped at reasonable sizes to keep the output manageable. If anything goes wrong, such as a network error, timeout, bad HTML, or empty content, the script exits cleanly with a structured error message, making it easy to integrate into larger workflows without surprises. The Python scrip can be found here. 5. The output and output contract Now I needed to think about the output I wanted GitHub Copilot to deliver through the skills. To reiterate, I needed the following: File Name Description README.md This is the operator-facing runbook that turns the generated artifacts into a working deployment. It contains no unresolved placeholders and no embedded credentials. The header summarizes the architecture and links back to the source blog. A prerequisites section lists required Azure access, Entra permissions, GitHub Environment setup, and local CLI versions. It includes tables of always-required GitHub secrets and variables, plus conditional ones based on deployment mode. Step-by-step numbered sections walk through bootstrapping the deployment principal and populating the GitHub Environment. Workflow blocks describe each Terraform validation, infrastructure deployment, and DAB deployment step, including file paths, triggers, and outputs. A commands section lists the exact Terraform and Databricks bundle sequences to run. Finally, assumption notes point the operator to TODO.md and SPEC.md for context. TODO.md The operator's checklist of remaining tasks. It uses a strict five-section format (Heading, What this is, Why deferred, Source, Resolution, Done looks like) with no commands or code, only concepts and decisions. Each section captures a different layer of post-deployment work, pre-deployment tasks like RBAC roles and GitHub secrets, deployment-time inputs like region and environment, post-infrastructure setup like Key Vault secrets and external locations, post-DAB work like Unity Catalog grants and job schedules, and architectural choices the orchestrator couldn't make (network posture, schemas, partitioning). Every entry comes from something the article left unstated, plus the universal post-deploy work for any Databricks deployment. The operator works through TODO.md sequentially, resolving each item before the system is production-ready. SPEC.md The structured, source-faithful read of the blog article, organized by checklist. Every item is marked as a stated value, inferred from code or diagrams, or "not stated in article." It includes architecture details, Azure services configuration, Databricks setup, data model, security and identity requirements, and observations. SPEC.md is the single source of truth that Terraform and DAB generators read from, TODO.md is populated from every "not stated" entry, and README.md references it for assumptions. This ensures the deployment is built on documented decisions, not hidden assumptions. Together, these files create a clear boundary: SPEC.md answers what the blog says, TODO.md captures what's missing or must be decided, README.md tells you exactly how to deploy. This split is enforced by validation rules that fail if any content duplicates across the three files. To make these files as repeatable as possible, I needed two things: Two templates, one for README.md and one for TODO.md , that the orchestrator fills in from SPEC.md at generation time. A broader delivery contract, output-contract.md , which lists the five files the orchestrator must produce. README.md and TODO.md are two of those five, and the templates are how they get produced. The output-contract.md file defines a strict, ordered format that the agent must follow when transforming a blog article about Databricks-on-Azure architecture into a runnable repository. The first commit was deliberately minimal, as you can see from the file available here. No leaf-skill routing, no repo-context.md, no GitHub Actions workflows, no validation rules, no entry-field templates for TODO.md . That commit's single job was to lock down the shape of the output: what gets produced and in what order. Every commit since has refined how to produce that shape without changing what gets produced. Putting the contract in the very first commit gave every later change a fixed reference point. Every leaf skill, generator script, and validation rule I've added since has fit into one of its five sections. The pipeline has changed; the deliverables haven't. The structure of the GitHub repo at commit 17ab443 can be see in the pictorial below. 6. The README.md and TODO.md templates After iteratively working on the orchestrator, a clear pattern emerged, the code-generation paths were kind of stable, but the documentation outputs weren't. Every run produced README.md and TODO.md from scratch in free-form Markdown. Across runs, the same content kept drifting. Section ordering changed between runs and the explanation of GitHub Environments was rewritten with subtle wording differences. RBAC roles appeared sometimes as lists, sometimes in prose, sometimes split across sections. Universal post-deploy actions (create the secret scope, populate the vault, set up Unity Catalog grants) were re-derived every time, occasionally with steps missing. The root cause was that the orchestrator was treating durable, universal content as if it were per-run content. So I've decided to add two templates: README.md.template and TODO.md.template. Templates separate universal content (RBAC, TODO sections, GitHub setup) in the template from per-workload content (catalog names, credentials) substituted from SPEC.md. This delivers consistency across runs. The README and TODO are structurally identical, so readers can navigate them intuitively. Universal content is correct by construction; I write it once, review carefully, and every run inherits that quality. Validation also becomes more precise, and the agent's job shrinks from open-ended writing to mechanical substitution, which is easier to validate and maintain. Templates introduce clear vocabulary: {placeholder} is filled by the orchestrator at generation time, by the deployer at run time. Finally, templates enforce traceability: every "not stated in article" entry in SPEC.md automatically becomes a TODO entry via the from SPEC.md slot, making this an automatically-enforced rule. I'm invoking the templates in the orchestrator as shown below. The Git commit with this code can be found at this link. ### 3.1 Generate README from template Load the template: `.github/skills/blog-to-databricks-iac/templates/README.md.template` ### 3.2 Generate TODO from template Load the template: `.github/skills/blog-to-databricks-iac/templates/TODO.md.template` 7. The output of the fetch_blog.py file and the interaction with the orchestrator When the orchestrator invokes fetch_blog.py , the script produces a JSON output and passes it back to the orchestrator. The orchestrator then reads the JSON document into its working context and maps each field onto an analysis checklist. The title and meta description establish the article identity and scope. Headings with their levels reveal the structure, helping the agent locate sections about architecture, security, data flow, and naming. Paragraphs provide evidence for stated values like regions, resource types, and RBAC models. Code blocks become the source of inferred values. As an example, a Terraform snippet might reveal SKU choices or naming patterns not mentioned in the text. These inferred values get tagged "inferred from code snippet" when recorded. The cloud hint acts as a sanity check that the article actually describes an Azure architecture. For every checklist item, the agent records either an extracted value or the literal string "not stated in article". This becomes SPEC.md , the single source of truth for everything downstream. SPEC.md drives every subsequent step. Steps 3 through 7 (the Terraform module, workflows, and Databricks bundle generators) read architectural decisions from it. Step 8 then produces TODO.md by converting every "not stated in article" entry into a TODO item the operator must resolve before deployment. What I find worth pointing out is how little the output contract has actually moved since that very first commit. The implementation underneath has changed completely. Leaf skills emerged, generator scripts came in, validation rules got added, a soft-delete state machine showed up to handle Key Vault recovery. None of those existed at the start. But what the orchestrator delivers, the list of files it puts on disk, has stayed exactly the same. We have a much larger SKILL.md today that still mirrors the initial five-item output list. The contract itself has changed by exactly one line: the addition of "Design of the architecture" to section 5. SPEC.md : the structured, source-faithful read of the article, organised by the analysis checklist ( link ) TODO.md : the operator's checklist of everything the article didn't specify, plus the universal post-deploy actions ( link ) Terraform code under infra/terraform/ : the platform layer with networking, storage, identities, Key Vault, workspace ( link ) Databricks Asset Bundle under databricks-bundle/ : the workload layer with jobs, entry points, environment configuration ( link ) README.md : the operator runbook, with the architecture design diagram embedded ( link ) If the JSON contains an error, the orchestrator stops immediately. Per the skill rule "If fetch fails, stop and return the fetch error output. Do not retry," the error surfaces to the user rather than propagating downstream. So the script's output is the raw evidence pack: title, structure, prose, code, cloud hint. The agent uses it to fill the architecture spec, which parameterises every generated artifact. At this point the fetch_blog.py output is sent to Step 2 of the orchestrator, as shown in the code snippet below. ### 2. Analyse article Analyse the fetched article against the structured checklist in `.github/skills/blog-to-databricks-iac/references/blog-analysis-checklist.md`. The analysis covers the article text, diagrams, screenshots, and code snippets. And, much later in the orchestrator, Step 8 closes the loop by turning everything that's been recorded into the two operator-facing documents: ### 8. Generate README and TODO from templates Use the templates in `.github/skills/blog-to-databricks-iac/templates/`: - `README.md.template` -> `README.md` - `TODO.md.template` -> `TODO.md` 8. How this actually came together What I've described so far is how the orchestrator works currently. The reality of building it was much cumbersome , but also fun. I got from the first version to the current one by iterating. Rerun the orchestrator, find the defect, identify the rule that would have caught it, add the rule to the skill that owns the artifact, rerun. The reason I'm calling this out now, before walking through the rest of the pipeline, is that everything from this point on is a story about a specific lesson learned that way. The leaf skills exist because a single SKILL.md got too dense. The restricted-tenant guardrails exist because the deployment failed against a tenant that couldn't read Microsoft Graph. The validation harness exists because prose rules weren't catching the regressions that mattered. The soft-delete state machine exists because the same vault name kept colliding with a previous deploy. None of these rules were present from day-one. So in the next sections I'll walk through how the pipeline actually matured: how the single skill split into a graph, what the inner regenerate-fix loop felt like in practice, the day the project pivoted to support restricted tenants, the bugs that became rules, and the Key Vault soft-delete state machine that closed the project out. 9. From a single skill to a skill graph When I started, everything lived inside a single SKILL.md . It was simpler that way, and to be honest, at that point I didn't yet know which rules would actually matter. But as I kept rerunning the orchestrator on the article, a pattern emerged. Each rerun produced something that broke in a slightly different way, and the fix always belonged to a very specific concern: Terraform authoring, bundle structure, workflow generation, or the orchestration logic itself. Stuffing the rules for all of them into one file was making the orchestrator unreadable and, worse, was silently dropping rules when the context window got tight. So I split it. The orchestrator stayed at the top, kept routing the work and validating the result, and each concern got promoted to its own leaf skill. The Databricks bundle skill itself ended up needing one more split a few days later, it had got too dense, so I broke it into two leaves: databricks-yml-authoring ( link ) Python-entrypoints ( link ) The diagram below shows the shape the repo has today. The orchestrator now does almost no authoring. It owns the sequence of steps, the contract, and the validation gates, while everything else is delegated. This was the single biggest readability win. I wish I'd done it earlier. The REPO_CONTEXT.md is one extra node in that diagram that I want to call out But I'll come back to later in section 12. 10. The inner loop: rerun, fail, fix the skill If I had to describe the middle of this project in one sentence, it would be: every commit was a regeneration. I'd run the orchestrator end-to-end against the article, inspect the generated Terraform, the bundle, the workflows. I'd find a defect, identify the rule that would have prevented it, add that rule to the skill that owns the artifact, then rerun. As shown in the image below. This loop is what I think people miss when they treat AI-generated infrastructure code as a one-shot. The first run is never the deliverable. The deliverable is the skill that produces good runs. The generated files are disposable and can always be reproduced. The skill is what carries the knowledge forward. I had to actively resist the temptation to fix bugs in the generated code directly. Patching infra/terraform/main.tf by hand fixes today's run but not tomorrow's, because the rule that would prevent the bug doesn't exist anywhere. So I made it a discipline: never edit the output, always edit the skill, then regenerate. 11. Restricted-tenant compatibility The bug was simple to describe and brutal to fix: the deployment principal in the target tenant couldn't read Microsoft Graph. Any Terraform data source that resolved an Entra name to an object ID at plan time (e.g., azuread_user , azuread_group , azuread_service_principal ) blew up at terraform plan. My first instinct was to think "I just give the principal Graph permissions". But in a lot of real environments this is not possible. The principal that runs your IaC is governed by a security team, the team has a policy, and the policy says no Graph reads. The pivot was getting the skill to produce Terraform that never reads Graph. Object IDs are inputs, not lookups. They come in as trusted secrets, the workflow exports them as TF_VAR_* , and Terraform consumes them as variables. No data " azuread_* " block is allowed in the generated code, ever. I thought this was a simple fix. It wasn't. It cascaded into about six other things: App Registration vs Service Principal object IDs. The workflow was being given the wrong one. Role assignments need the Enterprise Application (Service Principal) object ID, not the App Registration object ID. The two are different objects in Entra with different IDs. I encoded the distinction in the skill as *_SP_OBJECT_ID (the Service Principal) versus *_CLIENT_ID (the App Registration's application ID). Naming carries the meaning now, so the wrong value is hard to pass. Single-principal mapping. In some tenants you only have one principal and it has to play both deployment and runtime roles. The skill grew a layer_sp_mode = existing input so the generator stops trying to create a new Service Principal and reuses the deployment one instead. Key Vault access policies, gone. Access policies were Graph-touching, and not all tenants support them anyway. The skill switched fully to RBAC role assignments (Key Vault Secrets User, and so on). A few cascading bugs followed, but this was the right call. It took some time to harden the Terraform skill against everything the restricted tenant was throwing back. Each iterations had the same shape, each orchestrator runs, hits a fresh provider error, I add the rule, run again, hit the next one. The commit subjects from that run are basically a transcript of the conversation I was having with the platform. 12. The bugs that became rules There are three bugs that I believe are worth telling the story of, because they each illustrate a slightly different lesson. The HCL trim() arity bug. The generator emitted trim(var.something) in a validation block. HCL's trim() takes two arguments, not one. The function I actually wanted was trimspace() . This is the kind of bug that any human would catch in a code review in two seconds, and which the model produced confidently because the shape of the call looked right. I added the rule to the Terraform skill ("for whitespace trimming use trimspace, never trim") and the bug never came back. Lesson: even for trivial syntactic mistakes, the fix belongs in the skill. The variable shadowing bug. The deploy workflow had a job-level env: block that set TF_VAR_key_vault_recover_soft_deleted to a static value. A detection step earlier in the workflow was supposed to compute the right value at runtime and write it via $GITHUB_ENV . The problem is that GitHub Actions resolves job-level environment variables before $GITHUB_ENV writes take effect, so the static value always won and the dynamic one was silently ignored. The fix was to never set the recovery flag at job level. It must be written in the detection step, on every code path, including the trivial "no recovery needed" path. Lesson: state must be explicit, not inherited. If a flag has three possible meanings, three code paths must each write it. The hardcoded -platform suffix. The workflow had a shell-side suffix that someone (let's be honest, the model) had invented to make the resource group name "look right". When recovery logic started running and the workflow looked for the canonical resource group, it looked for -platform instead of whatever the Terraform locals.tf actually emitted. The result was that the recovery handler was happily reaching past the real resource group and into a different one. I made it a rule in the orchestrator: workflow-invented suffixes are not permitted. Naming is owned by Terraform's locals.tf . There are seventeen more defects in the catalogue, and the pattern is the same in every case. The bug surfaces, the rule gets written, the rule lives in the skill that owns the affected artifact. There is no implementation-learnings.md in the repo. There used to be, but I've deleted it because a tracked log of past bugs, sitting next to a skill that's already supposed to encode the lessons from those bugs, is a duplication waiting to drift. I believe that if the rule is in the skill, the log is redundant. If the rule isn't in the skill, the log is an evidence that I haven't finished the work. Either way, the right place for bug history is git log. 13. Splitting "the skill" from "this repo's defaults" I then wanted the orchestrator to be portable, but every run kept needing the same handful of decisions. Which Azure region by default? Which environment names? Which catalog naming convention? These weren't part of the article. They weren't part of the Terraform skill either. They were specific to this repository's opinion about how things should be deployed. If I baked them into the orchestrator, the orchestrator stopped being portable. If I left them out, every run produced unhelpful "not stated in article" entries for the same five universal decisions. The answer was a new file called REPO_CONTEXT.md stored in the repo root. It's read by the orchestrator before generation and it carries the defaults that are owned by the repo, not by the skill. The split looks like this in practice: SKILL.md answers the question "how do I turn an article into a runnable repo?" It is portable. REPO_CONTEXT.md answers the question "what does this repo default to when the article doesn't say?" It is local. Cloning the orchestrator into another GitHub project is now a clean operation. You take the skill, you write your own REPO_CONTEXT.md , and the same generator produces output appropriate to your environment. 14. The Validations Most of the rules I'd written into the skills were prose. "Don't invent suffixes." "Object IDs are inputs, not lookups." "Every required Terraform variable must have a matching TF_VAR_* in the workflow." The model is good at following prose rules most of the time. So a few of the most regression-prone rules became executable. The most important one is scripts/validate_workflow_parity.sh . Every variable declared in infra/terraform/variables.tf must appear as a TF_VAR_* export in the deploy workflow. The script greps both files, diffs the sets, and exits non-zero if they don't match. It is run at the end of generation. If it fails, the run failed, even if everything else looks fine. This caught real bugs. The most embarrassing was a variable I'd added to variables.tf and forgot to wire through the workflow. Terraform plan would prompt interactively for it on a non-interactive runner, and the run would hang. The rule of thumb I've ended up with is: prose rules are the default, but if a rule has been violated more than twice, it gets promoted to an executable check. There's a short list of those checks now, and it's the load-bearing one. 15. Key Vault soft-delete state machine Key Vaults in Azure have soft delete on by default. When you delete a vault, it sticks around for ninety days in a "soft-deleted" state. If you try to create a vault with the same name in the same subscription during that window, the deploy fails. The right behaviour is to recover the soft-deleted vault, not create a new one. The first version of my recovery handler covered exactly one case: if the vault is soft-deleted, recover it. This worked the first time I ran it. The second time, the recovered vault came back into the previous resource group, not the new one I had just created. Terraform then tried to create a new vault in the correct resource group and failed because the name was already taken globally. The handler had no concept of "the recovered vault is in the wrong resource group." So I added that case. The third time, the previous resource group itself was gone, and the handler was looking for it to verify the move. So I added that case too. By the end, the state machine had three distinct cases and two preconditions, as shown in the diagram below. The reason I keep coming back to this state machine is that it captures something that I think is generally true about agent-generated infrastructure code. The happy path is easy and meaningless, while the value is in the failure modes. The first version that worked on a clean tenant was about ten lines of bash. The version that works on a tenant that has been deployed-into and partially-torn-down five times is six times longer, and every additional line of it corresponds to a real environmental condition that I had to learn the hard way. 16. What I've learned so far I'm not going to pretend the full list of principles below was clear to me on day one. Every single one of these was learned by getting it wrong first. Looking back at the history, though, they are the ones that survived contact with reality. The contract precedes the implementation. output-contract.md was committed before any generator existed. Locking the shape of the deliverable first meant every later change had a fixed reference point. Generators, not stencils. Workflows are produced by Python scripts that take parameters and emit YAML. When restricted-tenant logic and the soft-delete state machine arrived, they needed conditional structure that a static template can't express. Every bug becomes a rule. Patching the generated code is a tax on tomorrow's run. While patching the skill is an investment. Each concern has a clear owner. The orchestrator routes, the leaves author, and the repo context holds the local defaults. Restricted-tenant compatibility is non-negotiable. No Microsoft Graph reads from generated Terraform. Object IDs are trusted inputs. Single-principal mapping is supported. Naming is owned by Terraform. No suffixes invented in shell. The validation harness enforces this. State must be explicit, not inherited. Every workflow run writes its own flags. No reliance on env defaults from a previous step or a previous run. Validation is executable when a rule has been violated more than twice. Prose rules are the default. Promotion to a script is earned. Operator docs describe concepts, not commands. Command syntax ages out, while conceptual descriptions don't. The TODO template enforces this rule. Add strong testing at the end of the process, once all the files are generated. Each run may produce slightly different output and introduce bugs, even if the previous run was successful. End-to-end runs against dirty tenants are the truth. The acceptance test isn't a clean-room deploy. It's a deploy into a tenant that has soft-deleted vaults, lingering RGs, and existing role assignments. Until that works, the project isn't done. From time to time, skills need to be reviewed and consolidated. The summary above of the journey is the one I find most useful to share when people ask whether this approach actually goes anywhere. From an empty repo to a generator that produces a deployable, restricted-tenant-compatible infrastructure-as-code repository from a blog URL, with executable validation and a recovery state machine that survives a previously-deployed environment. The first commit was an empty workspace. The last commit was the one where the same orchestrator, run against the same blog, against a tenant carrying state from five previous runs, deployed cleanly with no manual intervention. That is what I what I was aiming to achieve when I started! Thanks for reading.468Views0likes0CommentsResilient by Design: Azure Databricks Disaster Recovery Strategy
Introduction: From Recovery Plans to Resilience Strategy As organizations increasingly rely on Azure Databricks for mission-critical analytics and data engineering workloads, the need for robust disaster recovery (DR) strategies becomes paramount. These platforms are no longer just analytics engines, they power real-time decisions, AI models, and core business operations. Yet many organizations still approach Disaster Recovery (DR) as a reactive safeguard rather than a strategic capability. Resilience today is not about “if something fails,” but about ensuring continuity, trust, and performance under any condition. A modern DR strategy must therefore evolve beyond backup configurations and failover scripts. It must align with business priorities, regulatory requirements, risk tolerance, and operational maturity to become a core pillar of the enterprise data platform. In this context, organizations are increasingly adopting architecture patterns that enable cross-region resilience for the Azure Databricks Lakehouse. This pattern includes synchronizing Unity Catalog objects—catalogs, schemas, tables, views, function, models, and volumes—across regions, combined with scalable data movement mechanisms and secure data access approaches such as Delta Sharing and high-performance transfer tools. To help organizations operationalize this approach today, we have defined a structured strategy for synchronizing Unity Catalog objects and associated data across regions, enabling a resilient-by-design Azure Databricks architecture. This post focuses on that approach, outlining the key architectural patterns, strategic considerations, and practical implementation steps required to design and enable cross-region resilience. In October 2025, Databricks announced a Managed Disaster Recovery solution, developed in collaboration with Capital One, which includes managed replication, customer-specified failover, and read-only secondary capabilities. The approach outlined in this post serves as a complementary, customer-managed pattern, providing a practical and production-ready path for organizations to achieve robust disaster recovery and business continuity while Databricks continues to expand its native DR capabilities. Why Disaster Recovery for Azure Databricks is Different Traditional Disaster Recovery approaches do not fully apply to modern Lakehouse platforms. In Azure Databricks, resilience must account for: Tight coupling between data, compute, and metadata (Unity Catalog) Distributed pipelines (batch, streaming, ML) Decentralized workspace ownership and rapid platform growth This makes disaster recovery not just an infrastructure concern, but a data platform design challenge. Figure 1. Main Disaster Recovery Considerations Understanding the Fundamentals: RTO, RPO, and DR Trade-offs Before defining a disaster recovery strategy, it is essential to understand the core concepts that drive design decisions. Recovery Time Objective (RTO) defines how quickly a system must be restored after a disruption; while Recovery Point Objective (RPO) defines how much data loss is acceptable. These two metrics directly influence the architecture, cost, and complexity of any DR solution. As illustrated in Figure 1, there is a clear trade-off between cost and recovery performance: Active-active (hot) architectures, minimize downtime and data loss but come at a higher cost. Warm standby provides a balance between cost and recovery time. Cold DR is cost-efficient but results in longer recovery times and higher data loss risk. Understanding these trade-offs is critical to aligning DR strategy with business expectations. Understanding the Fundamentals: RTO, RPO, and DR Trade-offs Before defining a disaster recovery strategy, it is essential to understand the core concepts that drive design decisions. Recovery Time Objective (RTO) defines how quickly a system must be restored after a disruption; while Recovery Point Objective (RPO) defines how much data loss is acceptable. These two metrics directly influence the architecture, cost, and complexity of any DR solution. As illustrated in Figure 1, there is a clear trade-off between cost and recovery performance: Active-active (hot) architectures, minimize downtime and data loss but come at a higher cost. Warm standby provides a balance between cost and recovery time. Cold DR is cost-efficient but results in longer recovery times and higher data loss risk. Understanding these trade-offs is critical to aligning DR strategy with business expectations. Designing for Resilience: A Phased Disaster Recovery Approach Disaster recovery has evolved beyond a one-time setup into a structured, lifecycle-driven capability. Leading organizations design resilience intentionally, implement it systematically, and continuously validate it to ensure ongoing effectiveness. The framework outlined below provides a practical and strategic approach to operationalizing disaster recovery in Azure Databricks environments, bridging the gap between architectural intent and true operational readiness. Figure 2. Different Phases of Azure Databricks Disaster Recovery Phase 1: Discovery & Assessment A resilient disaster recovery strategy starts with clarity—yet in many Azure Databricks environments, that clarity is often missing. As platforms evolve, clusters multiply, jobs are duplicated, and data assets grow, making it increasingly difficult to answer a simple question: what do we actually have, and how critical is it? The Discovery phase addresses this by establishing a single, authoritative view of the platform. By consolidating all assets, dependencies, and usage patterns into a structured baseline, organizations can move from fragmented visibility to informed decision-making. This approach aligns closely with the concepts outlined in “From Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass”, where establishing a comprehensive inventory becomes the foundation for governance, optimization, and ultimately resilience. This foundation enables teams to identify what matters most, define appropriate RTO and RPO targets, and understand the dependencies that will ultimately shape their disaster recovery strategy. Outcome A clear, data-driven baseline of the environment—enabling confident workload prioritization and effective disaster recovery design. Phase 2: Strategy & Design Once visibility is established, the next step is making deliberate design choices—balancing resilience, cost, and complexity. At this stage, organizations define how their platform should behave under failure. This typically starts with selecting a multi-site deployment pattern, in which two primary approaches are commonly adopted: Active–Active, where both regions are fully operational and serve live workloads Active–Passive (Warm Standby), where a secondary region is pre-provisioned and activated only during failover Active–active architectures provide near-zero downtime and minimal data loss but come with increased cost and architectural complexity. Active–passive patterns offer a more cost-efficient alternative, with slightly higher recovery times depending on how failover is orchestrated. Beyond selecting the deployment pattern, a key architectural decision is how data is replicated across the Medallion architecture (Bronze, Silver, Gold). Our approach introduces a set of practical scenarios that allow organizations to tailor resilience based on both workload criticality and recovery requirements. A common starting point is aligning the DR strategy to workload tiers, such as: Tier 1 (Mission-critical): Active–Active with full replication Tier 2 (Business-critical) : Active–Passive with partial replication Building on this, organizations can further refine their approach by defining how data is replicated across the Medallion layers: Full replication (Bronze, Silver, Gold) , i.e. fastest recovery at highest cost; Bronze-only replication, lower cost, with re-computation required during recovery; Gold-only replication, optimized for consumption-focused use cases. This combination of workload tiering and Medallion replication strategies enables a flexible, fit-for-purpose approach to disaster recovery, which balances performance, cost, and operational complexity. Below we demonstrate, as an example, two representative patterns: (a) Active–Active architecture, where data pipelines operate in continuous trigger mode across regions, enabling near real-time synchronization; and (b) Active–Passive architecture, where all layers are replicated using a clone-based approach and activated on demand during failover. These scenarios highlight how organizations can balance recovery performance and cost by adjusting both the deployment model and the depth of data replication. 3. Active - Active Scenario - Continuous Trigger Mode Within the active–passive model, multiple variations can be applied, ranging from full replication of all medallion layers to more selective approaches (such as replicating only Bronze or Gold layers). This flexibility allows organizations to further balance recovery performance, cost, and operational complexity. 4. Active - Passive Scenario - Clone All Layers Mode Phase 3: Disaster Recovery Implementation & Enablement With the strategy defined, the focus shifts to translating design into a repeatable and operational solution. At this stage, resilience is no longer conceptual, it is embedded into the platform through automation, data replication, and standardized deployment patterns. From Strategy to Architecture At a high level, the DR architecture spans both the primary and secondary Azure regions, ensuring that all critical components can be either replicated or recreated: Control plane synchronization: Users, groups, and workspace assets are replicated using SCIM, Terraform, and CI/CD pipelines. Workspace and metadata portability: Jobs, notebooks, and configurations are defined as code and deployed consistently across regions. Data layer replication: Managed data, external data, and streaming checkpoints are synchronized using deep clone operations. This layered approach ensures that the platform can be reconstructed end-to-end, not just partially recovered. Unity Catalog-Driven Replication A critical aspect of the implementation is the replication of Unity Catalog metadata and associated data assets. This includes: Synchronizing catalogs, schemas, tables, views, functions, and volumes Using Delta Sharing to expose datasets across regions Leveraging deep clone and storage replication to ensure data availability Recreating external and managed locations in the target region By combining metadata synchronization with data replication, the target environment becomes a fully functional mirror of the source. 5. Unity Catalog Focused DR Mechanisms Operationalizing with a DR Pipeline To make this repeatable, the architecture is supported by a DR pipeline that orchestrates the process end-to-end: Synchronize schemas and Unity Catalog structures Perform deep clone of Delta tables Recreate views and dependent objects Provision volumes and copy associated data Ensure consistency across storage layers (e.g., ADLS via AzCopy) This pipeline can operate either continuously or on demand, depending on the selected DR pattern. 6. Azure Databricks DR Replication Workflow Outcome A fully implemented disaster recovery solution where data, metadata, and platform components are consistently synchronized, enabling rapid and reliable activation of workloads in a secondary region. Phase 4: DR Drill: Validation, Operations & Continuous Improvement A disaster recovery strategy is only valuable if it works when needed. This phase focuses on validating, operating, and continuously improving the DR solution to ensure it meets business expectations. Failover & Failback in Practice In a real failure scenario, the transition to the secondary region must be simple, predictable, and fast. A typical failover process includes: Detecting primary region unavailability Executing a final synchronization (if possible) Redirecting connections to the DR workspace Resuming operations without requiring code changes Equally important is failback, once the primary region is restored: Re-synchronizing data from DR to primary Switching pipelines and configurations back Gradually restoring normal operations Because infrastructure and metadata are standardized, this process becomes operational rather than reactive. Operating DR as a Continuous Capability Beyond failover, DR must be actively managed as part of daily operations: Monitoring & Alerting: Track job failures, performance bottlenecks, and system health Governance & Change Management: Maintain consistency between environments using IaC and version-controlled pipelines Continuous Optimization: Adjust replication strategies, scaling, and performance as workloads evolve This ensures the DR solution remains aligned with both technical and business changes over time. Ensuring Performance, Integrity, and Security A production-ready DR solution must also guarantee: Performance & Scalability: Optimize compute, autoscaling, and data transfer to handle recovery scenarios efficiently Data Integrity & Consistency: Validate schema synchronization, monitor replication jobs, and ensure parity between regions Security & Compliance: Enforce consistent access controls, secure credentials, and enable audit logging across environments Outcome A validated and continuously evolving DR capability—where recovery processes are tested, monitored, and improved over time, providing confidence to both technical teams and business stakeholders. Key Takeaways and Closing Thoughts Resilience in modern data platforms is no longer defined by how quickly systems can recover, but by how effectively they are designed to withstand disruption in the first place. Azure Databricks, as a core engine for data, analytics, and AI, requires a disaster recovery approach that extends beyond infrastructure—one that treats data, metadata, pipelines, and governance as a unified system. By combining a structured discovery phase, a strategy aligned to workload criticality, and automated, repeatable implementation patterns, organizations can move from reactive recovery to resilience by design. This not only reduces risk, but also ensures that critical data workloads remain available, trusted, and performant when it matters most. The approach outlined in this post provides a practical and flexible way to enable cross-region resilience today, while also complementing the managed disaster recovery capabilities expected to be introduced by Databricks. As we anticipate the availability of these native features, this approach offers a production-ready foundation that can extend and integrate with future platform capabilities. In a world where disruption is inevitable, the objective is no longer simply to recover—but to maintain continuity of data, decisions, and business operations with confidence. Special thank you to Vasilis Zisiadis, Dimitris Kotanis who contributed their expertise to create this material and bring it to life. Thank You Antony Bitar, Collin Brian and Jason Pereira for their support in reviewing the content.472Views0likes1CommentFrom Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass
The question that never stays answered — until now As Azure Databricks workspaces evolve, complexity creeps in unnoticed. Every Azure Databricks conversation with customers eventually lands on the same question: “What do we actually have in this workspace?” Over time, clusters multiply, jobs get cloned, warehouses are spun up for one-off demos and forgotten, and Unity Catalog keeps expanding until it’s hard to reason about. In most enterprises, each business or data science team operates its own workspace, while the central platform or operations team has little to no visibility into what’s being created or why. Teams often spend days—or weeks—trying to piece together what exists, who owns it, and the business purpose behind it, only to realize they still don’t have the full picture. And when the same question comes up next quarter, the cycle starts all over again. To address this, we built a utility that helps customers answer exactly that—by providing a single pane of glass for all Databricks assets through comprehensive cataloging and usage analysis. The utility works in two phases: Discovery and Analysis. This post focuses on the first step—the Discovery phase, where we establish a clear, authoritative inventory of everything that exists in the workspace. What the Discovery Phase delivers? Think of the Discovery phase as a workspace health assessment. Once configured against a target workspace, the utility runs in a selected mode and consolidates all discovered assets into a centralized, Delta-based repository. The result is a structured, queryable, and dashboard-ready metadata store. Behind the scenes, ten purpose-built scanners run in a tiered and parallelized architecture, enabling a fast yet comprehensive scan of the entire workspace. Scanner What is Cataloged Clusters Interactive, job, SQL — configs, policies, pools Jobs Workflows, schedules, tasks, run history Warehouses SQL endpoints, sizes, serverless settings Pipelines Delta Live Tables and their state Unity Catalog Catalogs, schemas, tables, volumes Workspace Objects Notebooks, repos, ML experiments, serving endpoints, alerts, Genie spaces Security Identity, network, data protection settings Billing 30–180 days of DBU usage by SKU and product Utilization Real CPU, memory, runtime patterns (deep scan) Spark Job Optimizer (plugin) Skew, spill, small files, broadcast hints (deep scan) Design Overview # Block Role Contents / Flow 1 Source Starting point — the Databricks environments being discovered. One or more Azure Databricks workspaces. Auth via OAuth. Outputs an authenticated WorkspaceClient to the Orchestrator. 2 Orchestrator The brain of the utility — coordinates scanning, concurrency, retries, timing. Tiered thread-pool executor, scan config (mode, billing window, UC depth, max workers). Dispatches scanners in controlled waves. 3 Tier 1 Scanners Lightweight, high-concurrency scans. Run first for quick signal. Clusters, Warehouses, Pipelines, Security. Up to 12 workers, 10-min timeout. Artifacts flow to the Centralized Repository. 4 Tier 2 Scanners High-volume scans. Controlled concurrency to avoid API throttling. Jobs, Workspace Objects (notebooks, repos, experiments, serving, alerts, Genie), Unity Catalog, Billing (30–180 days DBU). 1/2 workers, 30-min timeout. 5 Tier 3 Scanners Sequential, analysis-grade scans (deep scan only). Utilization (CPU, memory, SQL usage patterns) and Spark Job Optimizer plugin (skew, spill, small files, broadcast hints). Runs after Tiers 1 & 2. 6 Centralized Repository The catalog of truth — where all output lands, timestamped and queryable. Unity Catalog Delta tables (dashboard-ready) plus portable JSON and CSV exports for offline sharing or downstream tools. 7 Single Pane of Glass The user-facing view — insight at a glance. Pre-built Lakeview dashboard: KPI strip, inventory charts, and week-over-week trends. Refresh to see current workspace state. Why users love the view — visualization that earns its keep This is where the Discovery phase stops being just a scan and starts becoming a decision-making tool. Because everything is consolidated into a single, Unity Catalog–backed source of truth, the Lakeview dashboard delivers a genuine single pane of glass for the entire Databricks workspace. At a glance, you get: KPI strip at the top — total clusters, active jobs, UC tables, SQL warehouses, DLT pipelines, workspace objects. One glance, one number each. Inventory charts — clusters by type, jobs by schedule, warehouses by size, tables by catalog. The shape of your workspace becomes obvious. The “that doesn’t look right” moments — The idle SQL warehouse with zero queries, the cluster running the wrong runtime, the notebook floating outside any repo. These surface instantly, without hunting. Change over time — because every scan is timestamped, you can literally see your platform grow (or sprawl) week over week. In the first customer walkthrough, the platform team identified an always-on SQL warehouse with zero queries and three jobs running on the wrong compute tier—all within the first 30 minutes. That single view paid for the project. Sample Item Catalog Closing thoughts The Discovery phase isn’t about governance for governance’s sake—it’s about clarity. Before teams can optimize costs, improve performance, or enforce standards, they first need a reliable answer to a basic question: what actually exists today? By giving platform and operations teams a single, authoritative view of all Databricks assets—grounded in data, not tribal knowledge—Discovery turns guesswork into informed decisions. In the next phase, Analysis, that foundation is used to go deeper: identifying inefficiencies, risks, and opportunities to simplify and optimize the platform. But it all starts here—by finally knowing what you have. Special thank you to Antony Bitar, Collin Brian and Jason Pereira for their support in reviewing the content.427Views0likes0CommentsAzure Managed Redis & Azure Databricks: Real-time Feature Serving for Low-Latency Decisions
This blog content has been a collective collaboration between the Azure Databricks and Azure Managed Redis Product and Product Marketing teams. Executive summary Modern decisioning systems, fraud scoring, payments authorization, personalization, and step-up authentication, must return answers in tens of milliseconds while still reflecting the most recent behavior. That creates a classic tension: lakehouse platforms excel at large-scale ingestion, feature engineering, governance, training, and replayable history, but they are not designed to sit directly on the synchronous request path for high-QPS, ultra-low-latency lookups. This guide shows a pattern that keeps Azure Databricks as the primary system for building and maintaining features, while using Azure Managed Redis as the online speed layer that serves those features at memory speed for real-time scoring. The result is a shorter and more predictable critical path for your application: the Payment API (or any online service) reads features from Azure Managed Redis and calls a model endpoint; Azure Databricks continuously refreshes features from streaming and batch sources; and your authoritative systems of record (for example, account/card data) remain durable and governed. You get real-time responsiveness without giving up data correctness, lineage, or operational discipline. What each service does Azure Databricks is a first-party analytics and AI platform on Azure built on Apache Spark and the lakehouse architecture. It is commonly used for batch and streaming pipelines, feature engineering, model training, governance, and operationalization of ML workflows. In this architecture, Azure Databricks is the primary data and AI platform environment where features are defined, computed, validated, published, as well as where governed history is retained. Azure Managed Redis is a Microsoft‑managed, in‑memory data store based on Redis Enterprise, designed for low‑latency, high‑throughput access patterns. It is commonly used for traditional and real‑time caching, counters, and session state, and increasingly as a fast state layer for AI‑driven applications. In this architecture, Azure Managed Redis serves as the online feature store and speed layer: it holds the most recent feature values and signals required for real‑time scoring and can also support modern agentic patterns such as short‑ and long‑term memory, vector lookups, and fast state access alongside model inference. Business story: real-time fraud scoring as a running example Consider a payment system that must decide to approve, decline, or step-up authentication in tens of milliseconds—faster than a blink of an eye! The decision depends on recent behavioral signals, velocity counters, device changes, geo anomalies, and merchant patterns, combined with a fraud model. If the online service tries to compute or retrieve those features from heavy analytics systems on-demand, the request path becomes slower and more variable, especially at peak load. Instead, Azure Databricks pipelines continuously compute and refresh those features, and Azure Managed Redis serves them instantly to the scoring service. Behavioral history, profiles, and outcomes are still written to durable Azure datastores such as Delta tables, and Azure Cosmos DB so fraud models can be retrained with governed, reproducible data. The pattern: online feature serving with a speed layer The core idea is to separate responsibilities. Azure Databricks owns “building” features, ingest, join, aggregate, compute windows, and publish validated governed results. Azure Managed Redis owns “serving” features, fast, repeated key-based access on the hot path. The model endpoint then consumes a feature payload that is already pre-shaped for inference. This division prevents the lakehouse from becoming an online dependency and lets you scale online decisioning independently from offline compute. Pseudocode: end-to-end flow (online scoring + feature refresh) The pseudocode below intentionally reads like application logic rather than a single SDK. It highlights what matters: key design, pipelined feature reads, conservative fallbacks, and continuous refresh from Azure Databricks. # ---------------------------- # Online scoring (critical path) # ---------------------------- function handleAuthorization(req): schemaV = "v3" keys = buildFeatureKeys(schemaV, req) # card/device/merchant + windows feats = redis.MGET(keys) # single round trip (pipelined) feats = fillDefaults(feats) # conservative, no blocking payload = toModelPayload(req, feats) score = modelEndpoint.predict(payload) # Databricks Model Serving or an Azure-hosted model endpoint decision = policy(score, req) # approve/decline/step-up emitEventHub("txn_events", summarize(req, score, decision)) # async emitMetrics(redisLatencyMs, modelLatencyMs, missCount(feats)) return decision # ----------------------------------------- # Feature pipeline (async): build + publish # ----------------------------------------- function streamingFeaturePipeline(): events = readEventHubs("txn_events") ref = readCosmos("account_card_reference") # system of record lookups feats = computeFeatures(events, ref) # windows, counters, signals writeDelta("fraud_feature_history", feats) # ADLS Delta tables (lakehouse) publishLatestToRedis(feats, schemaV="v3") # SET/HSET + TTL (+ jitter) # ----------------------------------- # Training + deploy (async lifecycle) # ----------------------------------- function trainAndDeploy(): hist = readDelta("fraud_feature_history") labels = readCosmos("fraud_outcomes") # delayed ground truth model = train(joinPointInTime(hist, labels)) register(model) deployToDatabricksModelServing(model) Why it works This architecture works because each layer does the job it is best at. The lakehouse and feature pipelines handle heavy computation, validation, lineage, and re-playable history. The online speed layer handles locality and frequency: it keeps the “hot” feature state close to the online compute so requests do not pay the cost of re-computation or large fan-out reads. You explicitly control freshness with TTLs and refresh cadence, and you keep clear correctness boundaries by treating Azure Managed Redis as a serving layer rather than the authoritative system of record, with durable, governed feature history and labels stored in Delta tables and Azure data stores such as Azure Cosmos DB. Design choices that matter Cost efficiency and availability start with clear separation of concerns. Serving hot features from Azure Managed Redis avoids sizing analytics infrastructure for high‑QPS, low‑latency SLAs, and enables predictable capacity planning with regional isolation for online services. Azure Databricks remains optimized for correctness, freshness, and re-playable history while the online tier scales independently by request rate and working set size. Freshness and TTLs should reflect business tolerance for staleness and the meaning of each feature. Short velocity windows need TTLs slightly longer than ingestion gaps, while profiles and reference features can live longer. Adding jitter (for example ±10%) prevents synchronized expirations that create load spikes. Key design is the control plane for safe evolution and availability. Include explicit schema version prefixes and keep keys stable by entity and window. Publish new versions alongside existing ones, switch readers, and retire old versions to enable zero‑downtime rollouts. Protect the online path from stampedes and unnecessary cost. If a hot key is missing, avoid triggering widespread re-computation in downstream systems. Use a short single‑flight mechanism and conservative defaults, especially for risk‑sensitive decisions. Keep payloads compact so performance and cost remain predictable. Online feature reads are fastest when values are small and fetched in one or two round trips. Favor numeric encodings and small blobs, and use atomic writes to avoid partial or inconsistent reads during scoring. Reference architecture notes (regional first, then global) Start with a single-region deployment to validate end-to-end freshness and latency. Co-locate the Payment API compute, Azure Managed Redis, the model endpoint, and the primary data sources for feature pipelines to minimize round trips. Once the pattern is proven, extend to multi-region by deploying the online tier and its local speed layer per region, while keeping a clear strategy for how features are published and reconciled across regions (often via regional pipelines that consume the same event stream or replicated event hubs). Operations and SRE considerations Layer What to Monitor Why It Matters Typical Signals / Metrics Online service (API / scoring) End‑to‑end request latency, error rate, fallback rate Confirms the critical path meets application SLAs even under partial degradation p50/p95/p99 latency, error %, step‑up or conservative decision rate Azure Managed Redis (speed layer) Feature fetch latency, hit/miss ratio, memory pressure Indicates whether the working set fits and whether TTLs align with access patterns GET/MGET latency, miss %, evictions, memory usage Model serving Inference latency, throughput, saturation Separates model execution cost from feature access cost Inference p95 latency, QPS, concurrency utilization Azure Databricks feature pipelines Streaming lag, job health, data freshness Ensures features are being refreshed on time and correctness is preserved Event lag, job failures, watermark delay Cross‑layer boundaries Correlation between misses, latency spikes, and pipeline lag Helps identify whether regressions originate in serving, pipelines, or models Redis miss spikes vs pipeline delays vs API latency Monitor each layer independently, then correlate at the boundaries. This makes it clear whether an SLA issue is caused by online serving pressure, model inference, or delayed feature publication, without turning the lakehouse into a synchronous dependency. Putting it all together Adopt the pattern incrementally. First, publish a small, high-value feature set from Azure Databricks into Azure Managed Redis and wire the online service to fetch those features during scoring. Measure end-to-end impact on latency, model quality, and operational stability. Next, extend to streaming refresh for near-real-time behavioral features, and add controlled fallbacks for partial misses. Finally, scale out to multi-region if needed, keeping each region’s online service close to its local speed layer and ensuring the feature pipelines provide consistent semantics across regions. Sources and further reading Azure Databricks documentation: https://learn.microsoft.com/en-us/azure/databricks/ Azure Managed Redis documentation (overview and architecture): https://learn.microsoft.com/azure/redis/ Azure Architecture Center: Stream processing with Azure Databricks: https://learn.microsoft.com/azure/architecture/reference-architectures/data/stream-processing-databricks Databricks Feature Store / feature engineering docs (Azure Databricks): https://learn.microsoft.com/azure/databricks/524Views1like0CommentsAnnouncing the New Home for the Azure Databricks Blog
We’re excited to share that the Azure Databricks blog has moved to a new address on Microsoft Tech Community Hub! Azure Databricks | Microsoft Community Hub Our new blog home is designed to make it easier than ever for you to discover the latest product updates, deep technical insights, and real-world best practices directly from the Azure Databricks product team. Whether you're a data engineer, data scientist, or analytics leader, this is your go-to destination for staying informed and inspired. What You’ll Find on the New Blog At our new address, you can expect: Latest Announcements – Stay up to date with new features, capabilities, and releases Best Practice Guidance – Learn proven approaches for building scalable data and AI solutions Technical Deep Dives – Explore detailed walkthroughs and architecture insights Customer Stories – See how organizations are driving impact with Azure Databricks Why the Move? This new blog gives us the flexibility to deliver a better reading experience, improved navigation, and richer content dedicated to Azure Databricks. It also allows us to bring you more frequent updates and more in-depth resources tailored to your needs. Stay Connected We encourage you to bookmark the new blog and check back regularly. Even better—follow along so you never miss an update. By staying connected, you’ll be among the first to hear about new features, performance improvements, and expert recommendations to help you get the most out of Azure Databricks. 👉 Follow the new Azure Databricks blog today and stay ahead with the latest announcements and best practices. We’re looking forward to continuing this journey with you—now at our new home! Check out the latest blogs if you haven’t already: • Introducing Lakeflow Connect Free Tier, now available on Azure Databricks | Microsoft Community Hub •Near–Real-Time CDC to Delta Lake for BI and ML with Lakeflow on Azure Databricks | Microsoft Community Hub288Views0likes0CommentsAzure Databricks & Fabric Disaster Recovery: The Better Together Story
Author's: Amudha Palani amudhapalani, Oscar Alvarado oscaralvarado, Eric Kwashie ekwashie, Peter Lo PeterLo and Rafia Aqil Rafia_Aqil Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions. Identify Business Critical Workloads Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform. Azure Databricks Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions. Full System Failover (Active-Passive) Strategy A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include: Infrastructure Components: Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform Deploy network infrastructure (subnets) in the secondary region Establish data synchronization mechanisms Data Replication Strategy: Use Deep Clone for Delta tables rather than geo-redundant storage Implement periodic synchronization jobs using Delta's incremental replication Measure data transfer results using time travel syntax Workspace Asset Synchronization: Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD Utilize Terraform and SCIM for identity and access management Keep job concurrencies at zero in the secondary region to prevent execution Fully Redundant (Active-Active) Strategy The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy: Requires complex data synchronization between regions Incurs highest operational costs due to duplicate processing Typically needed only for mission-critical workloads with zero-tolerance for downtime Can be implemented as partial active-active, processing most workload in primary with subset in secondary Enabling Disaster Recovery Create a secondary workspace in a paired region. Use CI/CD to keep Workspace Assets Synchronized continuously. Requirement Approach Tools Cluster Configurations Co-deploy to both regions as code Terraform Code (Notebooks, Libraries, SQL) Co-deploy with CI/CD pipelines Git, Azure DevOps, GitHub Actions Jobs Co-deploy with CI/CD, set concurrency to zero in secondary Databricks Asset Bundles, Terraform Permissions (Users, Groups, ACLs) Use IdP/SCIM and infrastructure as code Terraform, SCIM Secrets Co-deploy using secret management Terraform, Azure Key Vault Table Metadata Co-deploy with CI/CD workflows Git, Terraform Cloud Services (ADLS, Network) Co-deploy infrastructure Terraform Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution. Replicate all dependent services (Key Vault, Storage accounts, SQL DB). Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions. Introduce an application‑level “Sync Tool” that redirects: data ingestion compute execution Enable parallel processing in both regions for selected or all workloads. Use bi‑directional synchronization for Delta data to maintain consistency across regions. For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm. Implement Three-Pillar DR Design Primary Workspace: Your production Databricks environment running normal operations Secondary Workspace: A standby Databricks workspace in a different(paired) Azure region that remains ready to take over if the primary fails. This architecture ensures business continuity while optimizing costs by keeping the secondary workspace dormant until needed. The DR solution is built on three fundamental pillars that work together to provide comprehensive protection: 1. Infrastructure Provisioning (Terraform) The infrastructure layer creates and manages all Azure resources required for disaster recovery using Infrastructure as Code (Terraform). What It Creates: Secondary Resource Group: A dedicated resource group in your paired DR region (e.g., if primary is in East US, secondary might be in West US 2) Secondary Databricks Workspace: A standby Databricks workspace with the same SKU as your primary, ready to receive failover traffic DR Storage Account: An ADLS Gen2 storage account that serves as the backup destination for your critical data Monitoring Infrastructure: Azure Monitor Log Analytics workspace and alert action groups to track DR health Protection Locks: Management locks to prevent accidental deletion of critical DR resources Key Design Principle: The Terraform configuration references your existing primary workspace without modifying it. It only creates new resources in the secondary region, ensuring your production environment remains untouched during setup. 2. Data Synchronization (Delta Notebooks) The data synchronization layer ensures your critical data is continuously backed up to the secondary region. How It Works: The solution uses a Databricks notebook that runs in your primary workspace on a scheduled basis. This notebook: Connects to Backup Storage: Uses Unity Catalog with Azure Managed Identity for secure, credential-free authentication to the secondary storage account Identifies Critical Tables: Reads from a configuration list you define (sales data, customer data, inventory, financial transactions, etc.) Performs Deep Clone: Uses Delta Lake's native CLONE functionality to create exact copies of your tables in the backup storage Tracks Sync Status: Logs each synchronization operation, tracks row counts, and reports on data freshness Authentication Flow: The synchronization process leverages Unity Catalog's managed identity capabilities: An existing Access Connector for Unity Catalog is granted "Storage Blob Data Contributor" permissions on the backup storage. Storage credentials are created in Databricks that reference this Access Connector. The notebook uses these credentials transparently—no storage keys or secrets are required. What Gets Synced: You define which tables are critical to your business operations. The notebook creates backup copies including: Full table data and schema Table partitioning structure Delta transaction logs for point-in-time recovery 3. Failover Automation (Python Scripts) The failover automation layer orchestrates the switch from primary to secondary workspace when disaster strikes. Microsoft Fabric Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication. Power BI Business Continuity Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering: No opt-in required: DR capabilities are automatically included. Azure storage geo-redundant replication: Ensures backup instances exist in other regions. Read-only access during disasters: Semantic models, reports, and dashboards remain accessible. Always supported: BCDR for Power BI remains active regardless of OneLake DR setting. Microsoft Fabric Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers: Microsoft's Responsibilities: Ensure baseline infrastructure and platform services availability Maintain Azure regional pairings for geo-redundancy. Provide DR capabilities for Power BI as default. Customer Responsibilities: Enable disaster recovery settings for capacities Set up secondary capacity and workspaces in paired regions Replicate data and configurations Enabling Disaster Recovery Organizations can enable BCDR through the Admin portal under Capacity settings: Navigate to Admin portal → Capacity settings Select the appropriate Fabric Capacity Access Disaster Recovery configuration Enable the disaster recovery toggle Critical Timing Considerations: 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted. 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective. Azure Databricks & Microsoft Fabric DR Considerations Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different. Recovery Procedures Procedure Databricks Fabric Failover Stop workloads, update routing, resume in secondary region. Microsoft initiates failover; customers restore services in DR capacity. Restore to Primary Stop secondary workloads, replicate data/code back, test, resume production. Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. Asset Syncing Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. Business Considerations Consideration Databricks Fabric Control Customers manage DR strategy, failover timing, and asset replication. Microsoft manages failover; customers restore services post-failover. Regional Dependencies Must ensure secondary region has sufficient capacity and services. DR only available in Azure regions with Fabric support and paired regions. Power BI Continuity Not applicable. Power BI offers built-in BCDR with read-only access to semantic models and reports. Activation Timeline Immediate upon configuration. DR setting takes up to 72 hours to activate; 30-day wait before changes allowed.1.2KViews4likes0CommentsEnd-to-End Observability for Azure Databricks: From Infrastructure to Internal Application Logging
Author's: Peter Lo PeterLo, Amudha Palani amudhapalani, Geoffrey Rathinapandi geofegeo and Rafia Aqil Rafia_Aqil Observability in Azure Databricks is the ability to continuously monitor and troubleshoot the health, performance, and usage of data workloads by capturing metrics, logs, and traces. In a structured observability approach, we consider two broad categories of logging: Internal Databricks Logging (within the Databricks environment) and External Databricks Logging (leveraging Azure services). Each plays a distinct role in providing insights. By combining internal and external observability mechanisms, organizations can achieve a comprehensive view: internal logs enable detailed analysis of Spark jobs and data quality, while external logs ensure global visibility, auditing, and integration with broader monitoring dashboards and alerting systems. The article is organized into two main sections: Infrastructure Logging for Azure Databricks (external observability) Internal Databricks Logging (in-platform observability) Considerations Addressing key questions upfront ensures your observability strategy is tailored to your organization’s unique workloads, risk profile, and operational needs. By proactively evaluating what to monitor, where to store logs, and who needs access, you can avoid blind spots, streamline incident response, and align monitoring investments with business priorities. What types of workloads are running in Databricks? Why it matters: Different workloads (e.g., batch ETL, streaming pipelines, ML training, interactive notebooks) have distinct performance profiles and failure modes. Business impact: Understanding workload types helps prioritize monitoring for mission-critical processes like real-time fraud detection or daily financial reporting. What failure scenarios need to be monitored? Examples: Job failures, cluster provisioning errors, quota limits, authentication issues. Business impact: Early detection of failures reduces downtime, improves SLA adherence, and prevents data pipeline disruptions that could affect reporting or customer-facing analytics. Where should logs be stored and analyzed? Options: Centralized Log Analytics workspace, Azure Storage for archival, Event Hub for streaming analysis. Business impact: Centralized logging enables unified dashboards, cross-team visibility, and faster incident response across data engineering, operations, and compliance teams. Who needs access to logs and alerts? Stakeholders: Data engineers, platform administrators, security analysts, compliance officers. Business impact: Role-based access ensures that the right teams can act on insights while maintaining data governance and privacy controls. Infrastructure Logging for Azure Databricks Approach 1: Diagnostic Settings for Azure Databricks Diagnostic settings in Azure Monitor allow you to capture detailed logs and metrics from your Azure Databricks workspace, supporting operational monitoring, troubleshooting, and compliance. By configuring diagnostic settings at the workspace level, administrators can route Databricks logs—including cluster events, job statuses, and audit logs—to destinations such as Log Analytics, Azure Storage, or Event Hub. This enables unified analysis, alerting, and long-term retention of critical operational data. Configuration Overview Enable Diagnostic Settings on the Databricks workspace to route logs to Log Analytics Workspace. Logs can also be combined with other logs mentioned below for full Azure Databricks observability. Here is a Guide to Azure Databricks Diagnostic Settings Log Reference: Configure diagnostic log delivery Implement tagging strategy-organizations can gain granular visibility into resource consumption and align spending with business priorities. Default tags: Automatically applied by Databricks to cloud-deployed resources. Custom tags: User-defined tags that you can add to compute resources and serverless workloads. Use Cases Operational Monitoring: Detect job or resource bottlenecks. Security & Compliance: Audit user actions and enforce governance policies. Incident Response: Correlate Databricks logs with infrastructure events for faster troubleshooting. Best Practices Enable only relevant log categories to optimize cost and performance. Use role-based access control (RBAC) to secure access to logs. Approach 2: Azure Databricks Compute Log Delivery Compute log delivery in Azure Databricks enables you to automatically collect and archive logs from Spark driver nodes, worker nodes, and cluster events for both all-purpose and job compute resources. When you create a cluster, you can specify a log delivery location—such as DBFS, Azure Storage, or a Unity Catalog volume—where logs are delivered every five minutes and archived hourly. All logs generated up until the compute resource is terminated are guaranteed to be delivered, supporting troubleshooting, auditing, and compliance. Configure: To configure the log delivery location: On the compute page, click the Advanced toggle. Click the Logging tab. Select a destination type: DBFS or Volumes (Preview). Enter the Log path. To store the logs, Databricks creates a subfolder in your chosen log path named after the compute's cluster_id. Approach 3: Azure Activity Logs Whenever you create, update, or delete Databricks resources (such as provisioning a new workspace, scaling a cluster, or modifying network settings), these actions are captured in the Activity Log. This enables teams to track who made changes, when, and what impact those changes had on the environment. Each event in the Activity Log has a particular category that is described in the following document: Azure Activity Log event schema. For Databricks, this is especially valuable for: Auditing resource deployments and configuration changes Investigating failed provisioning or quota errors Monitoring compliance with organizational policies Responding to incidents or unauthorized actions Use Cases Auditing infrastructure-level changes outside the Databricks workspace. Monitoring provisioning delays or resource availability. Best Practices Use Activity Logs in conjunction with other logs for full-stack visibility. Set up alerts for critical infrastructure events. Review logs regularly to ensure compliance and operational health. Approach 4: Azure Monitor VM Insights Azure Databricks cluster nodes run on Azure virtual machines (VMs), and their infrastructure-level performance can be monitored using Azure Monitor VM Insights (formerly OMS). This approach provides visibility into resource utilization across individual cluster VMs, helping identify bottlenecks that may affect Spark job performance or overall workload efficiency. Configuration Overview: To enable VM performance monitoring: Enable VM Insights on the Databricks cluster for VMs. Monitored Metrics: Once enabled, VM Insights collects: CPU usage, Memory consumption, Disk I/O, Network throughput, Process-level statistics. These metrics help assess whether Spark workloads are constrained by infrastructure limits, such as insufficient memory or high disk latency. Considerations This is a standard Azure VM monitoring technique and is not specific to Databricks. Use role-based access control (RBAC) to secure access to performance data. Approach 5: Virtual Network Flow Logs For Azure Databricks workspaces deployed in a custom Azure Virtual Network (VNet-injected mode), enabling Virtual Network Flow Logs provides deep visibility into IP traffic flowing through the virtual network. These logs help monitor and optimize resources or support large enterprises that are trying to detect intrusion, flow logs can help. Review common use cases here: Vnet Flow Logs Common Usecases and how logging works here: Key properties of virtual network flow logs. Follow these steps to setup Vnet Flow Logs: Create a flow log Configuration Overview Virtual Network Flow Logs are a feature of Azure Network Watcher. Optionally, logs can be analyzed using Traffic Analytics for deeper insights. These logs help identify: Unexpected or unauthorized traffic Bandwidth usage patterns Effectiveness of NSG rules and network segmentation Considerations NSG flow logging is only available for VNet-injected deployment modes. Ensure Network Watcher is enabled in the region where the Databricks workspace is deployed. Use Traffic Analytics to visualize trends and detect anomalies in network flows. Approach 6: Spark Monitoring Logging & Metrics The spark-monitoring library is a Python toolkit designed to interact with the Spark History Server REST API. Its main purpose is to help users programmatically access, analyze, and visualize Spark application metrics and job details after execution. Here’s what it offers: Application Listing: Retrieve a list of all Spark applications available on the History Server, including metadata such as application ID, name, start/end time, and status. Job and Stage Details: Access detailed information about jobs and stages within each application, including execution times, status, and resource usage. Task Metrics: Extract metrics for individual tasks, such as duration, input/output size, and shuffle statistics, supporting performance analysis and bottleneck identification. Considerations The Spark Monitoring Library must be installed, see Git Repository here. Metrics can be exported to external observability platforms for long-term retention and alerting. Use cases Automated reporting of Spark job performance and resource usage Batch analysis of completed Spark applications Integration of Spark metrics into external dashboards or monitoring systems Post-execution troubleshooting and optimization Internal Databricks Logging Approach 7: Databricks System Tables (Unity Catalog) Databricks System Tables are a recent addition to Azure Databricks observability, offering structured, SQL-accessible insights into workspace usage, performance, and cost. These tables reside in the Unity Catalog and are organized into schemas such as system.billing, system.lakeflow, and system.compute. You can enable System Tables through these steps: _enable_system_tables - Databricks Overview and Capabilities When enabled by an administrator, system tables allow users to query operational metadata directly using SQL. Examples include: system.billing.usage: Tracks compute usage (CPU core-hours, memory) per job. system.compute.clusters: Captures cluster lifecycle events. system.lakeflow.job_run: Provides job execution details. Use Cases Cost Monitoring: Aggregate usage records to identify high-cost jobs or users. Import pre-built usage dashboards to your workspaces to monitor account- and workspace-level usage: Usage dashboards and Create and monitor budgets Operational Efficiency: Track job durations, cluster concurrency, and resource utilization. In-Platform BI: Build dashboards in Databricks SQL to visualize usage trends without relying on external billing tools. Best Practices Schedule regular queries to track cost trends, job performance, and resource usage. Apply role-based access control to restrict sensitive usage data. Integrate system table insights into Databricks SQL dashboards for real-time visibility. Approach 8: Data Quality Monitoring Data Quality Monitoring is a native Azure Databricks feature designed to track data quality and machine learning model performance over time. It enables automated monitoring of Delta tables and ML inference outputs, helping teams detect anomalies, data drift, and reliability issues directly within the Databricks environment. Follow these steps to enable Data Quality Monitoring. Data Quality Monitoring supports three profile types: Time Series: Monitors time-partitioned data, computing metrics per time window. Inference: Tracks prediction drift and anomalies in model request/response logs. Snapshot: Performs full-table scans to compute metrics across the entire dataset. From the enabling Lakehouse Monitoring, on step 5 you can also enable data profiling to view Data Profiling Dashboards. Use Cases Data Quality Monitoring: Track null values, column distributions, and schema changes. Model Performance Monitoring: Detect concept drift, prediction anomalies, and accuracy degradation. Operational Reliability: Ensure consistent data pipelines and ML inference behavior. Approach 9: Databricks SQL Dashboards and Alerts Databricks SQL Dashboards and Alerts provide in-platform observability for operational monitoring, enabling teams to visualize metrics and receive notifications based on SQL query results. This approach complements infrastructure-level monitoring by focusing on application-level conditions, data correctness, and workflow health. Users can build dashboards using Databricks SQL or SQL Warehouses by querying: System tables (e.g., job runs, billing usage), Data Quality Monitoring metric tables, Custom operational datasets. You can create alerts through these steps: Databricks SQL alerts. Alerting Features: Databricks SQL supports alerting on query results, allowing users to define conditions that trigger notifications via: Email, Slack (via webhook integration). Alerts can be configured for scenarios such as: Job failure counts exceeding thresholds Row count drops in critical tables Cost/Workload spikes or resource usage anomalies Considerations Alerts are query-driven and run on a schedule; ensure queries are optimized for performance. Dashboards and alerts are workspace-specific and require appropriate permissions. Best Practices Use system tables and Data Quality Monitoring metrics as data sources for dashboards. Schedule alerts to run at appropriate intervals (e.g., hourly for job failures). Combine internal alerts with external monitoring for full-stack coverage. Approach 10: Custom Tags for Workspace-Level Assets Custom tags allow organizations to classify and organize Databricks resources (clusters, jobs, pools, notebooks) for better governance, cost tracking, and observability. Tags are key-value pairs applied at the resource level and can be propagated to Azure for billing and monitoring. Why Use Custom Tags? Cost Attribution: Assign tags like Environment=Prod, Project=HealthcareAnalytics to track costs in Azure Cost Management. Governance: Enforce policies based on tags (e.g., restrict high-cost clusters to Environment=Dev). Observability: Filter logs and metrics by tags for dashboards and alerts. Taggable Assets Clusters: Apply tags during cluster creation via the Databricks UI or REST API. Jobs: Include tags in job configurations for workload-level tracking. Instance Pools: Tag pools to manage shared compute resources. Notebooks & Workflows: Use tags in metadata for classification and reporting. Best Practices Define a standard tag taxonomy (e.g., Environment, Owner, CostCenter, Compliance). Validate tags regularly to ensure consistency across workspaces. Use tags in Log Analytics queries for cost and performance dashboards.2.3KViews4likes0Comments