azure databricks
115 TopicsNew Microsoft Certified: Azure Databricks Data Engineer Associate Certification
As a data engineer, you understand that AI performance depends directly on the quality of its data. If the data isn’t clean, well-managed, and accessible at scale, even the most sophisticated AI models won’t perform as expected. Introducing the Microsoft Certified: Azure Databricks Data Engineer Associate Certification, designed to prove that you have the skills required to build and operate reliable data systems by using Azure Databricks. To earn the Certification, you need to pass Exam DP-750: Implementing Data Engineering Solutions Using Azure Databricks, currently in beta. Is this Certification right for you? This Certification offers you the opportunity to prove your skills and validate your expertise in the following areas: Core technical skills Ingesting, transforming, and modeling data using SQL and Python Building production data pipelines on Azure Databricks Implementing software development lifecycle (SDLC) practices with Git-based workflows Integrating Azure Databricks with key Microsoft services, such as Azure Storage, Azure Data Factory, Azure Monitor, Azure Key Vault, and Microsoft Entra ID Governance and security Securing and governing data with Unity Catalog and Microsoft Purview Applying workspace, cluster, and data-level security best practices Performance and reliability Optimizing compute, caching, partitioning, and Delta Lake design patterns Troubleshooting and resolving issues with jobs and pipelines Managing workloads across development, staging, and production For engineers already familiar with Azure Databricks, this Certification bridges the gap between general Azure Databricks skills and the Azure‑specific architecture, security, and operational patterns that employers increasingly expect. Ready to prove your skills? The first 300 candidates can save 80% Take advantage of the discounted beta exam offer. The first 300 people who take Exam DP-750 (beta) on or before April 2, 2026, can get 80% off. To receive the discount, when you register for the exam and are prompted for payment, use code DP750Deltona. This is not a private access code. The seats are offered on a first-come, first-served basis. As noted, you must take the exam on or before April 2, 2026. Please note that this discount is not available in Turkey, Pakistan, India, or China. How to prepare Get ready to take Exam DP-750 (beta): Review the Exam DP-750 (beta) exam page for details. The Exam DP-750 study guide explores key topics covered in the exam. Work through the Plan on Microsoft Learn: Get Exam‑Ready for DP‑750: Azure Databricks Data Engineer Associate Certification. Need other preparation ideas? Check out Just How Does One Prepare for Beta Exams? You can take Certification exams online, from your home or office. Learn what to expect in Online proctored exams: What to expect and how to prepare. Interested in unlocking more Azure Databricks expertise? Grow your skills and take the next step by exploring Databricks credentials and show what you can do with Azure Databricks. Ready to get started? Remember, only the first 300 candidates can get 80% Exam DP-750 (beta) with code DP750Deltona on or before April 2, 2026. Beta exam rescoring begins when the exam goes live, with final results released approximately 10 days later. For more details, read Creating high-quality exams: The path from beta to live. Stay tuned for general availability of this Certification in early May 2026. Get involved: Help shape future Microsoft Credentials Join our Microsoft Worldwide Learning SME Group for Credentials on LinkedIn for beta exam alerts and opportunities to help shape future Microsoft learning and assessments. Additional information For more cloud and AI Certification updates, read our recent blog post, The AI job boom is here. Are you ready to showcase your skills? Explore Microsoft Credentials on AI Skills Navigator.28KViews4likes33CommentsSecure Medallion Architecture Pattern on Azure Databricks (Part II)
Disclaimer: The views in this article are my own and do not represent Microsoft or Databricks. This article is part of a series focused on deploying a secure Medallion Architecture. The series follows a top-down approach , beginning with a high-level architectural perspective and gradually drilling down into implementation details using repeatable, code. In this part we will discuss the implementation of the pattern using GitHub Copilot If you have missed, please read first the first part of this blog series. It can be found at: Secure Medallion Architecture Pattern on Azure Databricks (Part I). I waited a while before publishing this article. Partly due to other priorities, but also because I wanted to experiment with deploying infrastructure and data pipelines using agents. At that point, I was looking to leverage agents with a spec-driven approach, and through using GitHub Copilot, I learned what skills are and how I can use them to achieve my scope. In this blog I'll share what I learned using GitHub Copilot for spec-driven development. I'll use the content from my previous article, Secure Medallion Architecture Pattern on Azure Databricks (Part I) , as a technical specification to extract implementation details and generate two outputs: Terraform code for infrastructure, platform configuration, and deployment Databricks Declarative Automation Bundles for jobs, pipelines, and other deployment-ready workload resources I've tried not to overfit the prompts within the skills I've developed, so they remain portable to other technical articles, not just the one mentioned in this blog. Separate the platform from the workload When I started the design, I decided to modularise the automation scripts by separating the platform from the actual data platform workloads. I assigned networking, storage, identities, secret scopes, and workspace configuration to Terraform, while Databricks notebook runs, job clusters, pipelines, and environment-specific deployments were developed within Databricks Declarative Automation Bundles (formerly known as Databricks Asset Bundles). That may sound obvious, but it's exactly where generated code often goes wrong. Without explicit instructions, AI tools tend to blur these boundaries and produce one oversized block of configuration. That's why my Copilot skill needs to enforce a clear contract by: Infer the architecture from the article Identify what is explicit and what is assumed Emit Terraform only for infrastructure concerns Emit bundle files only for workload concerns Leave placeholders for anything the article does not specify That last point is critical. A blog post or low-level technical specification is not a source of truth for account IDs, hostnames, catalog names, secret values, or subnet IDs. Good automation should never fabricate those values. Instead, I decided to produce a starter implementation with TODO markers wherever environment-specific values are required. Skills are a great way to get more consistent, repeatable output across runs, so I decided to use them for this project. I could have used one of the tools listed in the table below, but I chose to go my own way, into developing a Spec-Driven Development (SDD) framework which I hope it will carryon improve with time. Tool Creator Type Link Description GitHub Spec Kit GitHub Open source github/spec-kit Turns feature ideas into specs, plans, and task lists before any code is written. Works with multiple AI coding agents. Specification first, code as generated output. BMAD Method BMad Code LLC Open source bmad-code-org/BMAD-METHOD An AI-driven agile framework with specialised agents covering the full lifecycle from ideation to deployment. Scale-adaptive — adjusts planning depth from a bug fix to an enterprise system. OpenSpec Fission AI Open source Fission-AI/OpenSpec Lightweight spec layer that sits above your existing AI tools. Each change gets a proposal, specs, design, and task list. No rigid phase gates, no IDE lock-in. What are skills, and why are they a good fit? Skills are essentially reusable prompt modules that aim to force LLMs to produce repeatable answers. Within a skill, I define the behavior and then attach supporting resources or scripts so Copilot can perform the task consistently. That means a skill can do more than just "write some code." A skill can define a repeatable workflow like this: Fetch the blog URL Extract headings, paragraphs, and code snippets Normalize the article into a lightweight implementation spec Decide what belongs in Terraform Decide what belongs in the Databricks bundle Generate files in a predictable project structure Produce a TODO.md file for unresolved values This approach turns Copilot from a generic assistant into a specialized code-conversion tool. However, there are some constraints I had to be mindful of when developing skills: Context window limits. The model has limited space to read instructions, process input, and generate output. Long prompts can cause files to be cut off or steps to be skipped. Non-determinism. Output may vary between runs, even with strict instructions. I always lint, validate, and review the diff before committing. Boundary leakage. Models may invent plausible but incorrect values. The TODO.md pattern must be enforced as a rule, not a suggestion. Model and tool drift. Copilot's model and tool surface change over time. I use example inputs and outputs as repeatable sanity checks. Maintainability. A skill is code-as-prompt and will age with the platforms it targets. I keep skills narrowly scoped so they stay easy to update. I'll explain the TODO.md file in more detail later in this post. The GitHub repo The repository can be found at the link MarcoScagliola/CopilotBlogToCode Below you will find a function I have added that, when invoked, deletes all the files produced by the skills, so you can test the repo from a clean state. python .github/skills/blog-to-databricks-iac/scripts/reset_generated.py --force; If you want to tried it out, please clone and try it on your copy. In GitHub Copilot, I usually keep: Model as Auto Foer the configure tools I keep just the built-in tools selected. Below you can find the prompt that I use to run the skills and have the blog analysed. Use the blog-to-databricks-iac skill on this article: https://techcommunity.microsoft.com/blog/analyticsonazure/secure-medallion-architecture-pattern-on-azure-databricks-part-i/4459268 Inputs: workload: blg environment: dev azure_region: uksouth github_environment: To make this more repeatable and less manual, I've added a prompt file at run-blogToDatabricksIac-selected-tools.prompt.md, which can be run directly from VS Code by opening the file and clicking the run button at the top. Feel free to experiment with it and let me know what you think. Further instructions on how to use the repo are available READ_FIRST.md. Following you will find the exact repository setup I used for this workflow, starting with my initial configuration and ending with the final directory structure and files. 1. Create a new GitHub repository and clone it locally I started by creating a new repository on GitHub, then cloned it to my local machine so I could add the Copilot skill, Terraform scaffolding, and Databricks bundle files in a centralized location. git clone https://github.com/YOUR-ORG/blog-to-databricks-iac.git cd blog-to-databricks-iac This approach keeps the workflow organised from the start: the repository exists on GitHub first, and the local clone becomes the working directory for all subsequent setup steps. 2. Create the GitHub skill folder structure (first iteration) GitHub Copilot skills are file-based and centered on a SKILL.md file inside a skill folder. GitHub's current pattern places these under .github/skills/ . I used the script below to create the folder hierarchy for my initial integration. mkdir -p .github/skills/blog-to-databricks-iac/scripts mkdir -p .github/skills/blog-to-databricks-iac/templates mkdir -p infra/terraform mkdir -p databricks-bundle/resources mkdir -p databricks-bundle/src This script generates the structure depicted below. 3. Add the main skill definition Next, I created the SKILL.md file at .github/skills/blog-to-databricks-iac/ . The orchestrator decides what happens and in what order, while each specialist decides what its own file should contain (as an example the Terraform specialist owns the Terraform, the bundle specialist owns the bundle, and so on). In practice, SKILL.md turns Copilot from a general assistant into a domain-specific generator for this repo. GitHub documents this SKILL.md-based structure as the foundation of agent skills. My first iteration of .github/skills/blog-to-databricks-iac/SKILL.md> was very simple and can be found here. 4. Add a script to fetch and normalize the blog article Next, I created a Python script that the main orchestrator SKILL.md invokes to read the blog article. This script is stored at .github/skills/blog-to-databricks-iac/scripts/ and named fetch_blog.py . Within SKILL.md , the script is invoked as shown below. ### 1. Fetch article ```bash python .github/skills/blog-to-databricks-iac/scripts/fetch_blog.py "<url>" ``` If fetch fails, stop and return the fetch error output. Do not retry; surface the error to the user and wait for guidance.</url> The script validates the URL, fetches the HTML with a 30-second timeout, and uses a spoofed Mozilla User-Agent to avoid being blocked by CDNs (Content Delivery Networks). It reads through the HTML one tag at a time, flagging when it enters relevant sections like paragraphs, headings, or code blocks, and buffering text until the tag closes. Before storing anything, it cleans the text by decoding HTML objects, collapsing whitespace, and trimming edges. As it parses, the script also scans for cloud platform keywords: AWS, S3, Azure, ADLS, GCP, Google Cloud. The first match wins; if none are found, it returns unknown. This is a quick heuristic, not authoritative. Finally, it outputs clean JSON with the extracted data: title, headings, paragraphs, code blocks, and cloud hint, capped at reasonable sizes to keep the output manageable. If anything goes wrong, such as a network error, timeout, bad HTML, or empty content, the script exits cleanly with a structured error message, making it easy to integrate into larger workflows without surprises. The Python scrip can be found here. 5. The output and output contract Now I needed to think about the output I wanted GitHub Copilot to deliver through the skills. To reiterate, I needed the following: File Name Description README.md This is the operator-facing runbook that turns the generated artifacts into a working deployment. It contains no unresolved placeholders and no embedded credentials. The header summarizes the architecture and links back to the source blog. A prerequisites section lists required Azure access, Entra permissions, GitHub Environment setup, and local CLI versions. It includes tables of always-required GitHub secrets and variables, plus conditional ones based on deployment mode. Step-by-step numbered sections walk through bootstrapping the deployment principal and populating the GitHub Environment. Workflow blocks describe each Terraform validation, infrastructure deployment, and DAB deployment step, including file paths, triggers, and outputs. A commands section lists the exact Terraform and Databricks bundle sequences to run. Finally, assumption notes point the operator to TODO.md and SPEC.md for context. TODO.md The operator's checklist of remaining tasks. It uses a strict five-section format (Heading, What this is, Why deferred, Source, Resolution, Done looks like) with no commands or code, only concepts and decisions. Each section captures a different layer of post-deployment work, pre-deployment tasks like RBAC roles and GitHub secrets, deployment-time inputs like region and environment, post-infrastructure setup like Key Vault secrets and external locations, post-DAB work like Unity Catalog grants and job schedules, and architectural choices the orchestrator couldn't make (network posture, schemas, partitioning). Every entry comes from something the article left unstated, plus the universal post-deploy work for any Databricks deployment. The operator works through TODO.md sequentially, resolving each item before the system is production-ready. SPEC.md The structured, source-faithful read of the blog article, organized by checklist. Every item is marked as a stated value, inferred from code or diagrams, or "not stated in article." It includes architecture details, Azure services configuration, Databricks setup, data model, security and identity requirements, and observations. SPEC.md is the single source of truth that Terraform and DAB generators read from, TODO.md is populated from every "not stated" entry, and README.md references it for assumptions. This ensures the deployment is built on documented decisions, not hidden assumptions. Together, these files create a clear boundary: SPEC.md answers what the blog says, TODO.md captures what's missing or must be decided, README.md tells you exactly how to deploy. This split is enforced by validation rules that fail if any content duplicates across the three files. To make these files as repeatable as possible, I needed two things: Two templates, one for README.md and one for TODO.md , that the orchestrator fills in from SPEC.md at generation time. A broader delivery contract, output-contract.md , which lists the five files the orchestrator must produce. README.md and TODO.md are two of those five, and the templates are how they get produced. The output-contract.md file defines a strict, ordered format that the agent must follow when transforming a blog article about Databricks-on-Azure architecture into a runnable repository. The first commit was deliberately minimal, as you can see from the file available here. No leaf-skill routing, no repo-context.md, no GitHub Actions workflows, no validation rules, no entry-field templates for TODO.md . That commit's single job was to lock down the shape of the output: what gets produced and in what order. Every commit since has refined how to produce that shape without changing what gets produced. Putting the contract in the very first commit gave every later change a fixed reference point. Every leaf skill, generator script, and validation rule I've added since has fit into one of its five sections. The pipeline has changed; the deliverables haven't. The structure of the GitHub repo at commit 17ab443 can be see in the pictorial below. 6. The README.md and TODO.md templates After iteratively working on the orchestrator, a clear pattern emerged, the code-generation paths were kind of stable, but the documentation outputs weren't. Every run produced README.md and TODO.md from scratch in free-form Markdown. Across runs, the same content kept drifting. Section ordering changed between runs and the explanation of GitHub Environments was rewritten with subtle wording differences. RBAC roles appeared sometimes as lists, sometimes in prose, sometimes split across sections. Universal post-deploy actions (create the secret scope, populate the vault, set up Unity Catalog grants) were re-derived every time, occasionally with steps missing. The root cause was that the orchestrator was treating durable, universal content as if it were per-run content. So I've decided to add two templates: README.md.template and TODO.md.template. Templates separate universal content (RBAC, TODO sections, GitHub setup) in the template from per-workload content (catalog names, credentials) substituted from SPEC.md. This delivers consistency across runs. The README and TODO are structurally identical, so readers can navigate them intuitively. Universal content is correct by construction; I write it once, review carefully, and every run inherits that quality. Validation also becomes more precise, and the agent's job shrinks from open-ended writing to mechanical substitution, which is easier to validate and maintain. Templates introduce clear vocabulary: {placeholder} is filled by the orchestrator at generation time, by the deployer at run time. Finally, templates enforce traceability: every "not stated in article" entry in SPEC.md automatically becomes a TODO entry via the from SPEC.md slot, making this an automatically-enforced rule. I'm invoking the templates in the orchestrator as shown below. The Git commit with this code can be found at this link. ### 3.1 Generate README from template Load the template: `.github/skills/blog-to-databricks-iac/templates/README.md.template` ### 3.2 Generate TODO from template Load the template: `.github/skills/blog-to-databricks-iac/templates/TODO.md.template` 7. The output of the fetch_blog.py file and the interaction with the orchestrator When the orchestrator invokes fetch_blog.py , the script produces a JSON output and passes it back to the orchestrator. The orchestrator then reads the JSON document into its working context and maps each field onto an analysis checklist. The title and meta description establish the article identity and scope. Headings with their levels reveal the structure, helping the agent locate sections about architecture, security, data flow, and naming. Paragraphs provide evidence for stated values like regions, resource types, and RBAC models. Code blocks become the source of inferred values. As an example, a Terraform snippet might reveal SKU choices or naming patterns not mentioned in the text. These inferred values get tagged "inferred from code snippet" when recorded. The cloud hint acts as a sanity check that the article actually describes an Azure architecture. For every checklist item, the agent records either an extracted value or the literal string "not stated in article". This becomes SPEC.md , the single source of truth for everything downstream. SPEC.md drives every subsequent step. Steps 3 through 7 (the Terraform module, workflows, and Databricks bundle generators) read architectural decisions from it. Step 8 then produces TODO.md by converting every "not stated in article" entry into a TODO item the operator must resolve before deployment. What I find worth pointing out is how little the output contract has actually moved since that very first commit. The implementation underneath has changed completely. Leaf skills emerged, generator scripts came in, validation rules got added, a soft-delete state machine showed up to handle Key Vault recovery. None of those existed at the start. But what the orchestrator delivers, the list of files it puts on disk, has stayed exactly the same. We have a much larger SKILL.md today that still mirrors the initial five-item output list. The contract itself has changed by exactly one line: the addition of "Design of the architecture" to section 5. SPEC.md : the structured, source-faithful read of the article, organised by the analysis checklist ( link ) TODO.md : the operator's checklist of everything the article didn't specify, plus the universal post-deploy actions ( link ) Terraform code under infra/terraform/ : the platform layer with networking, storage, identities, Key Vault, workspace ( link ) Databricks Asset Bundle under databricks-bundle/ : the workload layer with jobs, entry points, environment configuration ( link ) README.md : the operator runbook, with the architecture design diagram embedded ( link ) If the JSON contains an error, the orchestrator stops immediately. Per the skill rule "If fetch fails, stop and return the fetch error output. Do not retry," the error surfaces to the user rather than propagating downstream. So the script's output is the raw evidence pack: title, structure, prose, code, cloud hint. The agent uses it to fill the architecture spec, which parameterises every generated artifact. At this point the fetch_blog.py output is sent to Step 2 of the orchestrator, as shown in the code snippet below. ### 2. Analyse article Analyse the fetched article against the structured checklist in `.github/skills/blog-to-databricks-iac/references/blog-analysis-checklist.md`. The analysis covers the article text, diagrams, screenshots, and code snippets. And, much later in the orchestrator, Step 8 closes the loop by turning everything that's been recorded into the two operator-facing documents: ### 8. Generate README and TODO from templates Use the templates in `.github/skills/blog-to-databricks-iac/templates/`: - `README.md.template` -> `README.md` - `TODO.md.template` -> `TODO.md` 8. How this actually came together What I've described so far is how the orchestrator works currently. The reality of building it was much cumbersome , but also fun. I got from the first version to the current one by iterating. Rerun the orchestrator, find the defect, identify the rule that would have caught it, add the rule to the skill that owns the artifact, rerun. The reason I'm calling this out now, before walking through the rest of the pipeline, is that everything from this point on is a story about a specific lesson learned that way. The leaf skills exist because a single SKILL.md got too dense. The restricted-tenant guardrails exist because the deployment failed against a tenant that couldn't read Microsoft Graph. The validation harness exists because prose rules weren't catching the regressions that mattered. The soft-delete state machine exists because the same vault name kept colliding with a previous deploy. None of these rules were present from day-one. So in the next sections I'll walk through how the pipeline actually matured: how the single skill split into a graph, what the inner regenerate-fix loop felt like in practice, the day the project pivoted to support restricted tenants, the bugs that became rules, and the Key Vault soft-delete state machine that closed the project out. 9. From a single skill to a skill graph When I started, everything lived inside a single SKILL.md . It was simpler that way, and to be honest, at that point I didn't yet know which rules would actually matter. But as I kept rerunning the orchestrator on the article, a pattern emerged. Each rerun produced something that broke in a slightly different way, and the fix always belonged to a very specific concern: Terraform authoring, bundle structure, workflow generation, or the orchestration logic itself. Stuffing the rules for all of them into one file was making the orchestrator unreadable and, worse, was silently dropping rules when the context window got tight. So I split it. The orchestrator stayed at the top, kept routing the work and validating the result, and each concern got promoted to its own leaf skill. The Databricks bundle skill itself ended up needing one more split a few days later, it had got too dense, so I broke it into two leaves: databricks-yml-authoring ( link ) Python-entrypoints ( link ) The diagram below shows the shape the repo has today. The orchestrator now does almost no authoring. It owns the sequence of steps, the contract, and the validation gates, while everything else is delegated. This was the single biggest readability win. I wish I'd done it earlier. The REPO_CONTEXT.md is one extra node in that diagram that I want to call out But I'll come back to later in section 12. 10. The inner loop: rerun, fail, fix the skill If I had to describe the middle of this project in one sentence, it would be: every commit was a regeneration. I'd run the orchestrator end-to-end against the article, inspect the generated Terraform, the bundle, the workflows. I'd find a defect, identify the rule that would have prevented it, add that rule to the skill that owns the artifact, then rerun. As shown in the image below. This loop is what I think people miss when they treat AI-generated infrastructure code as a one-shot. The first run is never the deliverable. The deliverable is the skill that produces good runs. The generated files are disposable and can always be reproduced. The skill is what carries the knowledge forward. I had to actively resist the temptation to fix bugs in the generated code directly. Patching infra/terraform/main.tf by hand fixes today's run but not tomorrow's, because the rule that would prevent the bug doesn't exist anywhere. So I made it a discipline: never edit the output, always edit the skill, then regenerate. 11. Restricted-tenant compatibility The bug was simple to describe and brutal to fix: the deployment principal in the target tenant couldn't read Microsoft Graph. Any Terraform data source that resolved an Entra name to an object ID at plan time (e.g., azuread_user , azuread_group , azuread_service_principal ) blew up at terraform plan. My first instinct was to think "I just give the principal Graph permissions". But in a lot of real environments this is not possible. The principal that runs your IaC is governed by a security team, the team has a policy, and the policy says no Graph reads. The pivot was getting the skill to produce Terraform that never reads Graph. Object IDs are inputs, not lookups. They come in as trusted secrets, the workflow exports them as TF_VAR_* , and Terraform consumes them as variables. No data " azuread_* " block is allowed in the generated code, ever. I thought this was a simple fix. It wasn't. It cascaded into about six other things: App Registration vs Service Principal object IDs. The workflow was being given the wrong one. Role assignments need the Enterprise Application (Service Principal) object ID, not the App Registration object ID. The two are different objects in Entra with different IDs. I encoded the distinction in the skill as *_SP_OBJECT_ID (the Service Principal) versus *_CLIENT_ID (the App Registration's application ID). Naming carries the meaning now, so the wrong value is hard to pass. Single-principal mapping. In some tenants you only have one principal and it has to play both deployment and runtime roles. The skill grew a layer_sp_mode = existing input so the generator stops trying to create a new Service Principal and reuses the deployment one instead. Key Vault access policies, gone. Access policies were Graph-touching, and not all tenants support them anyway. The skill switched fully to RBAC role assignments (Key Vault Secrets User, and so on). A few cascading bugs followed, but this was the right call. It took some time to harden the Terraform skill against everything the restricted tenant was throwing back. Each iterations had the same shape, each orchestrator runs, hits a fresh provider error, I add the rule, run again, hit the next one. The commit subjects from that run are basically a transcript of the conversation I was having with the platform. 12. The bugs that became rules There are three bugs that I believe are worth telling the story of, because they each illustrate a slightly different lesson. The HCL trim() arity bug. The generator emitted trim(var.something) in a validation block. HCL's trim() takes two arguments, not one. The function I actually wanted was trimspace() . This is the kind of bug that any human would catch in a code review in two seconds, and which the model produced confidently because the shape of the call looked right. I added the rule to the Terraform skill ("for whitespace trimming use trimspace, never trim") and the bug never came back. Lesson: even for trivial syntactic mistakes, the fix belongs in the skill. The variable shadowing bug. The deploy workflow had a job-level env: block that set TF_VAR_key_vault_recover_soft_deleted to a static value. A detection step earlier in the workflow was supposed to compute the right value at runtime and write it via $GITHUB_ENV . The problem is that GitHub Actions resolves job-level environment variables before $GITHUB_ENV writes take effect, so the static value always won and the dynamic one was silently ignored. The fix was to never set the recovery flag at job level. It must be written in the detection step, on every code path, including the trivial "no recovery needed" path. Lesson: state must be explicit, not inherited. If a flag has three possible meanings, three code paths must each write it. The hardcoded -platform suffix. The workflow had a shell-side suffix that someone (let's be honest, the model) had invented to make the resource group name "look right". When recovery logic started running and the workflow looked for the canonical resource group, it looked for -platform instead of whatever the Terraform locals.tf actually emitted. The result was that the recovery handler was happily reaching past the real resource group and into a different one. I made it a rule in the orchestrator: workflow-invented suffixes are not permitted. Naming is owned by Terraform's locals.tf . There are seventeen more defects in the catalogue, and the pattern is the same in every case. The bug surfaces, the rule gets written, the rule lives in the skill that owns the affected artifact. There is no implementation-learnings.md in the repo. There used to be, but I've deleted it because a tracked log of past bugs, sitting next to a skill that's already supposed to encode the lessons from those bugs, is a duplication waiting to drift. I believe that if the rule is in the skill, the log is redundant. If the rule isn't in the skill, the log is an evidence that I haven't finished the work. Either way, the right place for bug history is git log. 13. Splitting "the skill" from "this repo's defaults" I then wanted the orchestrator to be portable, but every run kept needing the same handful of decisions. Which Azure region by default? Which environment names? Which catalog naming convention? These weren't part of the article. They weren't part of the Terraform skill either. They were specific to this repository's opinion about how things should be deployed. If I baked them into the orchestrator, the orchestrator stopped being portable. If I left them out, every run produced unhelpful "not stated in article" entries for the same five universal decisions. The answer was a new file called REPO_CONTEXT.md stored in the repo root. It's read by the orchestrator before generation and it carries the defaults that are owned by the repo, not by the skill. The split looks like this in practice: SKILL.md answers the question "how do I turn an article into a runnable repo?" It is portable. REPO_CONTEXT.md answers the question "what does this repo default to when the article doesn't say?" It is local. Cloning the orchestrator into another GitHub project is now a clean operation. You take the skill, you write your own REPO_CONTEXT.md , and the same generator produces output appropriate to your environment. 14. The Validations Most of the rules I'd written into the skills were prose. "Don't invent suffixes." "Object IDs are inputs, not lookups." "Every required Terraform variable must have a matching TF_VAR_* in the workflow." The model is good at following prose rules most of the time. So a few of the most regression-prone rules became executable. The most important one is scripts/validate_workflow_parity.sh . Every variable declared in infra/terraform/variables.tf must appear as a TF_VAR_* export in the deploy workflow. The script greps both files, diffs the sets, and exits non-zero if they don't match. It is run at the end of generation. If it fails, the run failed, even if everything else looks fine. This caught real bugs. The most embarrassing was a variable I'd added to variables.tf and forgot to wire through the workflow. Terraform plan would prompt interactively for it on a non-interactive runner, and the run would hang. The rule of thumb I've ended up with is: prose rules are the default, but if a rule has been violated more than twice, it gets promoted to an executable check. There's a short list of those checks now, and it's the load-bearing one. 15. Key Vault soft-delete state machine Key Vaults in Azure have soft delete on by default. When you delete a vault, it sticks around for ninety days in a "soft-deleted" state. If you try to create a vault with the same name in the same subscription during that window, the deploy fails. The right behaviour is to recover the soft-deleted vault, not create a new one. The first version of my recovery handler covered exactly one case: if the vault is soft-deleted, recover it. This worked the first time I ran it. The second time, the recovered vault came back into the previous resource group, not the new one I had just created. Terraform then tried to create a new vault in the correct resource group and failed because the name was already taken globally. The handler had no concept of "the recovered vault is in the wrong resource group." So I added that case. The third time, the previous resource group itself was gone, and the handler was looking for it to verify the move. So I added that case too. By the end, the state machine had three distinct cases and two preconditions, as shown in the diagram below. The reason I keep coming back to this state machine is that it captures something that I think is generally true about agent-generated infrastructure code. The happy path is easy and meaningless, while the value is in the failure modes. The first version that worked on a clean tenant was about ten lines of bash. The version that works on a tenant that has been deployed-into and partially-torn-down five times is six times longer, and every additional line of it corresponds to a real environmental condition that I had to learn the hard way. 16. What I've learned so far I'm not going to pretend the full list of principles below was clear to me on day one. Every single one of these was learned by getting it wrong first. Looking back at the history, though, they are the ones that survived contact with reality. The contract precedes the implementation. output-contract.md was committed before any generator existed. Locking the shape of the deliverable first meant every later change had a fixed reference point. Generators, not stencils. Workflows are produced by Python scripts that take parameters and emit YAML. When restricted-tenant logic and the soft-delete state machine arrived, they needed conditional structure that a static template can't express. Every bug becomes a rule. Patching the generated code is a tax on tomorrow's run. While patching the skill is an investment. Each concern has a clear owner. The orchestrator routes, the leaves author, and the repo context holds the local defaults. Restricted-tenant compatibility is non-negotiable. No Microsoft Graph reads from generated Terraform. Object IDs are trusted inputs. Single-principal mapping is supported. Naming is owned by Terraform. No suffixes invented in shell. The validation harness enforces this. State must be explicit, not inherited. Every workflow run writes its own flags. No reliance on env defaults from a previous step or a previous run. Validation is executable when a rule has been violated more than twice. Prose rules are the default. Promotion to a script is earned. Operator docs describe concepts, not commands. Command syntax ages out, while conceptual descriptions don't. The TODO template enforces this rule. Add strong testing at the end of the process, once all the files are generated. Each run may produce slightly different output and introduce bugs, even if the previous run was successful. End-to-end runs against dirty tenants are the truth. The acceptance test isn't a clean-room deploy. It's a deploy into a tenant that has soft-deleted vaults, lingering RGs, and existing role assignments. Until that works, the project isn't done. From time to time, skills need to be reviewed and consolidated. The summary above of the journey is the one I find most useful to share when people ask whether this approach actually goes anywhere. From an empty repo to a generator that produces a deployable, restricted-tenant-compatible infrastructure-as-code repository from a blog URL, with executable validation and a recovery state machine that survives a previously-deployed environment. The first commit was an empty workspace. The last commit was the one where the same orchestrator, run against the same blog, against a tenant carrying state from five previous runs, deployed cleanly with no manual intervention. That is what I what I was aiming to achieve when I started! Thanks for reading.350Views0likes0CommentsHow to Secure Azure Databricks without Public Exposure using WAF + Private Endpoints
This blog outlines a Zero Trust–aligned architecture for securing Azure Databricks using Application Gateway (WAF) and Private Endpoints within a Hub-Spoke network model. Enables a true Zero Trust model, ensuring: No direct exposure of Databricks Full traffic inspection Compliance-ready secure access for both internal and external users1.4KViews1like1CommentResilient by Design: Azure Databricks Disaster Recovery Strategy
Introduction: From Recovery Plans to Resilience Strategy As organizations increasingly rely on Azure Databricks for mission-critical analytics and data engineering workloads, the need for robust disaster recovery (DR) strategies becomes paramount. These platforms are no longer just analytics engines, they power real-time decisions, AI models, and core business operations. Yet many organizations still approach Disaster Recovery (DR) as a reactive safeguard rather than a strategic capability. Resilience today is not about “if something fails,” but about ensuring continuity, trust, and performance under any condition. A modern DR strategy must therefore evolve beyond backup configurations and failover scripts. It must align with business priorities, regulatory requirements, risk tolerance, and operational maturity to become a core pillar of the enterprise data platform. In this context, organizations are increasingly adopting architecture patterns that enable cross-region resilience for the Azure Databricks Lakehouse. This pattern includes synchronizing Unity Catalog objects—catalogs, schemas, tables, views, function, models, and volumes—across regions, combined with scalable data movement mechanisms and secure data access approaches such as Delta Sharing and high-performance transfer tools. To help organizations operationalize this approach today, we have defined a structured strategy for synchronizing Unity Catalog objects and associated data across regions, enabling a resilient-by-design Azure Databricks architecture. This post focuses on that approach, outlining the key architectural patterns, strategic considerations, and practical implementation steps required to design and enable cross-region resilience. In October 2025, Databricks announced a Managed Disaster Recovery solution, developed in collaboration with Capital One, which includes managed replication, customer-specified failover, and read-only secondary capabilities. The approach outlined in this post serves as a complementary, customer-managed pattern, providing a practical and production-ready path for organizations to achieve robust disaster recovery and business continuity while Databricks continues to expand its native DR capabilities. Why Disaster Recovery for Azure Databricks is Different Traditional Disaster Recovery approaches do not fully apply to modern Lakehouse platforms. In Azure Databricks, resilience must account for: Tight coupling between data, compute, and metadata (Unity Catalog) Distributed pipelines (batch, streaming, ML) Decentralized workspace ownership and rapid platform growth This makes disaster recovery not just an infrastructure concern, but a data platform design challenge. Figure 1. Main Disaster Recovery Considerations Understanding the Fundamentals: RTO, RPO, and DR Trade-offs Before defining a disaster recovery strategy, it is essential to understand the core concepts that drive design decisions. Recovery Time Objective (RTO) defines how quickly a system must be restored after a disruption; while Recovery Point Objective (RPO) defines how much data loss is acceptable. These two metrics directly influence the architecture, cost, and complexity of any DR solution. As illustrated in Figure 1, there is a clear trade-off between cost and recovery performance: Active-active (hot) architectures, minimize downtime and data loss but come at a higher cost. Warm standby provides a balance between cost and recovery time. Cold DR is cost-efficient but results in longer recovery times and higher data loss risk. Understanding these trade-offs is critical to aligning DR strategy with business expectations. Understanding the Fundamentals: RTO, RPO, and DR Trade-offs Before defining a disaster recovery strategy, it is essential to understand the core concepts that drive design decisions. Recovery Time Objective (RTO) defines how quickly a system must be restored after a disruption; while Recovery Point Objective (RPO) defines how much data loss is acceptable. These two metrics directly influence the architecture, cost, and complexity of any DR solution. As illustrated in Figure 1, there is a clear trade-off between cost and recovery performance: Active-active (hot) architectures, minimize downtime and data loss but come at a higher cost. Warm standby provides a balance between cost and recovery time. Cold DR is cost-efficient but results in longer recovery times and higher data loss risk. Understanding these trade-offs is critical to aligning DR strategy with business expectations. Designing for Resilience: A Phased Disaster Recovery Approach Disaster recovery has evolved beyond a one-time setup into a structured, lifecycle-driven capability. Leading organizations design resilience intentionally, implement it systematically, and continuously validate it to ensure ongoing effectiveness. The framework outlined below provides a practical and strategic approach to operationalizing disaster recovery in Azure Databricks environments, bridging the gap between architectural intent and true operational readiness. Figure 2. Different Phases of Azure Databricks Disaster Recovery Phase 1: Discovery & Assessment A resilient disaster recovery strategy starts with clarity—yet in many Azure Databricks environments, that clarity is often missing. As platforms evolve, clusters multiply, jobs are duplicated, and data assets grow, making it increasingly difficult to answer a simple question: what do we actually have, and how critical is it? The Discovery phase addresses this by establishing a single, authoritative view of the platform. By consolidating all assets, dependencies, and usage patterns into a structured baseline, organizations can move from fragmented visibility to informed decision-making. This approach aligns closely with the concepts outlined in “From Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass”, where establishing a comprehensive inventory becomes the foundation for governance, optimization, and ultimately resilience. This foundation enables teams to identify what matters most, define appropriate RTO and RPO targets, and understand the dependencies that will ultimately shape their disaster recovery strategy. Outcome A clear, data-driven baseline of the environment—enabling confident workload prioritization and effective disaster recovery design. Phase 2: Strategy & Design Once visibility is established, the next step is making deliberate design choices—balancing resilience, cost, and complexity. At this stage, organizations define how their platform should behave under failure. This typically starts with selecting a multi-site deployment pattern, in which two primary approaches are commonly adopted: Active–Active, where both regions are fully operational and serve live workloads Active–Passive (Warm Standby), where a secondary region is pre-provisioned and activated only during failover Active–active architectures provide near-zero downtime and minimal data loss but come with increased cost and architectural complexity. Active–passive patterns offer a more cost-efficient alternative, with slightly higher recovery times depending on how failover is orchestrated. Beyond selecting the deployment pattern, a key architectural decision is how data is replicated across the Medallion architecture (Bronze, Silver, Gold). Our approach introduces a set of practical scenarios that allow organizations to tailor resilience based on both workload criticality and recovery requirements. A common starting point is aligning the DR strategy to workload tiers, such as: Tier 1 (Mission-critical): Active–Active with full replication Tier 2 (Business-critical) : Active–Passive with partial replication Building on this, organizations can further refine their approach by defining how data is replicated across the Medallion layers: Full replication (Bronze, Silver, Gold) , i.e. fastest recovery at highest cost; Bronze-only replication, lower cost, with re-computation required during recovery; Gold-only replication, optimized for consumption-focused use cases. This combination of workload tiering and Medallion replication strategies enables a flexible, fit-for-purpose approach to disaster recovery, which balances performance, cost, and operational complexity. Below we demonstrate, as an example, two representative patterns: (a) Active–Active architecture, where data pipelines operate in continuous trigger mode across regions, enabling near real-time synchronization; and (b) Active–Passive architecture, where all layers are replicated using a clone-based approach and activated on demand during failover. These scenarios highlight how organizations can balance recovery performance and cost by adjusting both the deployment model and the depth of data replication. 3. Active - Active Scenario - Continuous Trigger Mode Within the active–passive model, multiple variations can be applied, ranging from full replication of all medallion layers to more selective approaches (such as replicating only Bronze or Gold layers). This flexibility allows organizations to further balance recovery performance, cost, and operational complexity. 4. Active - Passive Scenario - Clone All Layers Mode Phase 3: Disaster Recovery Implementation & Enablement With the strategy defined, the focus shifts to translating design into a repeatable and operational solution. At this stage, resilience is no longer conceptual, it is embedded into the platform through automation, data replication, and standardized deployment patterns. From Strategy to Architecture At a high level, the DR architecture spans both the primary and secondary Azure regions, ensuring that all critical components can be either replicated or recreated: Control plane synchronization: Users, groups, and workspace assets are replicated using SCIM, Terraform, and CI/CD pipelines. Workspace and metadata portability: Jobs, notebooks, and configurations are defined as code and deployed consistently across regions. Data layer replication: Managed data, external data, and streaming checkpoints are synchronized using deep clone operations. This layered approach ensures that the platform can be reconstructed end-to-end, not just partially recovered. Unity Catalog-Driven Replication A critical aspect of the implementation is the replication of Unity Catalog metadata and associated data assets. This includes: Synchronizing catalogs, schemas, tables, views, functions, and volumes Using Delta Sharing to expose datasets across regions Leveraging deep clone and storage replication to ensure data availability Recreating external and managed locations in the target region By combining metadata synchronization with data replication, the target environment becomes a fully functional mirror of the source. 5. Unity Catalog Focused DR Mechanisms Operationalizing with a DR Pipeline To make this repeatable, the architecture is supported by a DR pipeline that orchestrates the process end-to-end: Synchronize schemas and Unity Catalog structures Perform deep clone of Delta tables Recreate views and dependent objects Provision volumes and copy associated data Ensure consistency across storage layers (e.g., ADLS via AzCopy) This pipeline can operate either continuously or on demand, depending on the selected DR pattern. 6. Azure Databricks DR Replication Workflow Outcome A fully implemented disaster recovery solution where data, metadata, and platform components are consistently synchronized, enabling rapid and reliable activation of workloads in a secondary region. Phase 4: DR Drill: Validation, Operations & Continuous Improvement A disaster recovery strategy is only valuable if it works when needed. This phase focuses on validating, operating, and continuously improving the DR solution to ensure it meets business expectations. Failover & Failback in Practice In a real failure scenario, the transition to the secondary region must be simple, predictable, and fast. A typical failover process includes: Detecting primary region unavailability Executing a final synchronization (if possible) Redirecting connections to the DR workspace Resuming operations without requiring code changes Equally important is failback, once the primary region is restored: Re-synchronizing data from DR to primary Switching pipelines and configurations back Gradually restoring normal operations Because infrastructure and metadata are standardized, this process becomes operational rather than reactive. Operating DR as a Continuous Capability Beyond failover, DR must be actively managed as part of daily operations: Monitoring & Alerting: Track job failures, performance bottlenecks, and system health Governance & Change Management: Maintain consistency between environments using IaC and version-controlled pipelines Continuous Optimization: Adjust replication strategies, scaling, and performance as workloads evolve This ensures the DR solution remains aligned with both technical and business changes over time. Ensuring Performance, Integrity, and Security A production-ready DR solution must also guarantee: Performance & Scalability: Optimize compute, autoscaling, and data transfer to handle recovery scenarios efficiently Data Integrity & Consistency: Validate schema synchronization, monitor replication jobs, and ensure parity between regions Security & Compliance: Enforce consistent access controls, secure credentials, and enable audit logging across environments Outcome A validated and continuously evolving DR capability—where recovery processes are tested, monitored, and improved over time, providing confidence to both technical teams and business stakeholders. Key Takeaways and Closing Thoughts Resilience in modern data platforms is no longer defined by how quickly systems can recover, but by how effectively they are designed to withstand disruption in the first place. Azure Databricks, as a core engine for data, analytics, and AI, requires a disaster recovery approach that extends beyond infrastructure—one that treats data, metadata, pipelines, and governance as a unified system. By combining a structured discovery phase, a strategy aligned to workload criticality, and automated, repeatable implementation patterns, organizations can move from reactive recovery to resilience by design. This not only reduces risk, but also ensures that critical data workloads remain available, trusted, and performant when it matters most. The approach outlined in this post provides a practical and flexible way to enable cross-region resilience today, while also complementing the managed disaster recovery capabilities expected to be introduced by Databricks. As we anticipate the availability of these native features, this approach offers a production-ready foundation that can extend and integrate with future platform capabilities. In a world where disruption is inevitable, the objective is no longer simply to recover—but to maintain continuity of data, decisions, and business operations with confidence. Special thank you to Vasilis Zisiadis, Dimitris Kotanis who contributed their expertise to create this material and bring it to life. Thank You Antony Bitar, Collin Brian and Jason Pereira for their support in reviewing the content.249Views0likes1CommentApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.5.7KViews9likes1CommentFrom Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass
The question that never stays answered — until now As Azure Databricks workspaces evolve, complexity creeps in unnoticed. Every Azure Databricks conversation with customers eventually lands on the same question: “What do we actually have in this workspace?” Over time, clusters multiply, jobs get cloned, warehouses are spun up for one-off demos and forgotten, and Unity Catalog keeps expanding until it’s hard to reason about. In most enterprises, each business or data science team operates its own workspace, while the central platform or operations team has little to no visibility into what’s being created or why. Teams often spend days—or weeks—trying to piece together what exists, who owns it, and the business purpose behind it, only to realize they still don’t have the full picture. And when the same question comes up next quarter, the cycle starts all over again. To address this, we built a utility that helps customers answer exactly that—by providing a single pane of glass for all Databricks assets through comprehensive cataloging and usage analysis. The utility works in two phases: Discovery and Analysis. This post focuses on the first step—the Discovery phase, where we establish a clear, authoritative inventory of everything that exists in the workspace. What the Discovery Phase delivers? Think of the Discovery phase as a workspace health assessment. Once configured against a target workspace, the utility runs in a selected mode and consolidates all discovered assets into a centralized, Delta-based repository. The result is a structured, queryable, and dashboard-ready metadata store. Behind the scenes, ten purpose-built scanners run in a tiered and parallelized architecture, enabling a fast yet comprehensive scan of the entire workspace. Scanner What is Cataloged Clusters Interactive, job, SQL — configs, policies, pools Jobs Workflows, schedules, tasks, run history Warehouses SQL endpoints, sizes, serverless settings Pipelines Delta Live Tables and their state Unity Catalog Catalogs, schemas, tables, volumes Workspace Objects Notebooks, repos, ML experiments, serving endpoints, alerts, Genie spaces Security Identity, network, data protection settings Billing 30–180 days of DBU usage by SKU and product Utilization Real CPU, memory, runtime patterns (deep scan) Spark Job Optimizer (plugin) Skew, spill, small files, broadcast hints (deep scan) Design Overview # Block Role Contents / Flow 1 Source Starting point — the Databricks environments being discovered. One or more Azure Databricks workspaces. Auth via OAuth. Outputs an authenticated WorkspaceClient to the Orchestrator. 2 Orchestrator The brain of the utility — coordinates scanning, concurrency, retries, timing. Tiered thread-pool executor, scan config (mode, billing window, UC depth, max workers). Dispatches scanners in controlled waves. 3 Tier 1 Scanners Lightweight, high-concurrency scans. Run first for quick signal. Clusters, Warehouses, Pipelines, Security. Up to 12 workers, 10-min timeout. Artifacts flow to the Centralized Repository. 4 Tier 2 Scanners High-volume scans. Controlled concurrency to avoid API throttling. Jobs, Workspace Objects (notebooks, repos, experiments, serving, alerts, Genie), Unity Catalog, Billing (30–180 days DBU). 1/2 workers, 30-min timeout. 5 Tier 3 Scanners Sequential, analysis-grade scans (deep scan only). Utilization (CPU, memory, SQL usage patterns) and Spark Job Optimizer plugin (skew, spill, small files, broadcast hints). Runs after Tiers 1 & 2. 6 Centralized Repository The catalog of truth — where all output lands, timestamped and queryable. Unity Catalog Delta tables (dashboard-ready) plus portable JSON and CSV exports for offline sharing or downstream tools. 7 Single Pane of Glass The user-facing view — insight at a glance. Pre-built Lakeview dashboard: KPI strip, inventory charts, and week-over-week trends. Refresh to see current workspace state. Why users love the view — visualization that earns its keep This is where the Discovery phase stops being just a scan and starts becoming a decision-making tool. Because everything is consolidated into a single, Unity Catalog–backed source of truth, the Lakeview dashboard delivers a genuine single pane of glass for the entire Databricks workspace. At a glance, you get: KPI strip at the top — total clusters, active jobs, UC tables, SQL warehouses, DLT pipelines, workspace objects. One glance, one number each. Inventory charts — clusters by type, jobs by schedule, warehouses by size, tables by catalog. The shape of your workspace becomes obvious. The “that doesn’t look right” moments — The idle SQL warehouse with zero queries, the cluster running the wrong runtime, the notebook floating outside any repo. These surface instantly, without hunting. Change over time — because every scan is timestamped, you can literally see your platform grow (or sprawl) week over week. In the first customer walkthrough, the platform team identified an always-on SQL warehouse with zero queries and three jobs running on the wrong compute tier—all within the first 30 minutes. That single view paid for the project. Sample Item Catalog Closing thoughts The Discovery phase isn’t about governance for governance’s sake—it’s about clarity. Before teams can optimize costs, improve performance, or enforce standards, they first need a reliable answer to a basic question: what actually exists today? By giving platform and operations teams a single, authoritative view of all Databricks assets—grounded in data, not tribal knowledge—Discovery turns guesswork into informed decisions. In the next phase, Analysis, that foundation is used to go deeper: identifying inefficiencies, risks, and opportunities to simplify and optimize the platform. But it all starts here—by finally knowing what you have. Special thank you to Antony Bitar, Collin Brian and Jason Pereira for their support in reviewing the content.306Views0likes0CommentsGuide for Architecting Azure-Databricks: Design to Deployment
Author's: Chris Walk cwalk, Dan Johnson danjohn1234, Eduardo dos Santos eduardomdossantos, Ted Kim tekim, Eric Kwashie ekwashie, Chris Haynes Chris_Haynes, Tayo Akigbogun takigbogun and Rafia Aqil Rafia_Aqil Peer Reviewed: Mohamed Sharaf mohamedsharaf Note: We are currently updating this article to add: Serverless Workspace option. Also, while Terraform is the recommended method for production deployments due to its automation and repeatability, for simplicity in this article we will demonstrate deployment through the Azure portal. Introduction Video to Databricks: what is databricks | introduction - databricks for dummies DESIGN: Architecting a Secure Azure Databricks Environment Step 1: Plan Workspace, Subscription Organization, Analytics Architecture and Compute Planning your Azure Databricks environment can follow various arrangements depending on your organization’s structure, governance model, and workload requirements. The following guidance outlines key considerations to help you design a well-architected foundation. 1.1 Align Workspaces with Business Units A recommended best practice is to align each Azure Databricks workspace with a specific business unit. This approach—often referred to as the “Business Unit Subscription” design pattern—offers several operational and governance advantages. Streamlined Access Control: Each unit manages its own workspace, simplifying permissions and reducing cross-team access risks. For example, Sales can securely access only their data and notebooks. Cost Transparency: Mapping workspaces to business units enables accurate cost attribution and supports internal chargeback models. Each workspace can be tagged to a cost center for visibility and accountability. Even within the same workspace, costs can be controlled using system tables that provide detailed usage metrics and resource consumption insights. Challenges to keep-in-mind: While per-BU workspaces have high impact, be mindful of workspace sprawl. If every small team spins up its own workspace, you might end up with dozens or hundreds of workspaces, which introduces management overhead. Databricks recommends a reasonable upper limit (on Azure, roughly 20–50 workspaces per account/subscription) because managing “collaboration, access, and security across hundreds of workspaces can become extremely difficult, even with good automation” [1]. Each workspace will need governance (user provisioning, monitoring, compliance checks), so there is a balance to strike. 1.2 Workspace Alignment and Shared Metastore Strategy As you align workspaces with business units, it's essential to understand how Unity Catalog and the metastore fit into your architecture. Unity Catalog is Databricks’ unified governance layer that centralizes access control, auditing, and data lineage across workspaces. Each Unity Catalog is backed by a metastore, which acts as the central metadata repository for tables, views, volumes, and other data assets. In Azure Databricks, you can have one metastore per region, and all workspaces within that region share it. This enables consistent governance and simplifies data sharing across teams. If your organization spans multiple regions, you’ll need to plan for cross-region sharing, which Unity Catalog supports through Delta Sharing. By aligning workspaces with business units and connecting them to a shared metastore, you ensure that governance policies are enforced uniformly, while still allowing each team to manage its own data assets securely and independently. 1.3 Distribute Workspaces Across Subscriptions When scaling Azure Databricks, consider not just the number of workspaces, but also how to distribute them across Azure subscriptions. Using multiple Azure subscriptions can serve both organizational needs and technical requirements: Environment Segmentation (Dev/Test/Prod): A common pattern is to put production workspaces in a separate Azure subscription from development or test workspaces. This provides an extra layer of isolation. Microsoft highly recommends separating workspaces into prod and dev, in separate subscriptions. This way, you can apply stricter Azure policies or network rules to the prod subscription and keep the dev subscription a bit more open for experimentation without risking prod resources. Honor Azure Resource Limits: Azure subscriptions come with certain capacity limits and Azure Databricks workspaces have their own limits (since it’s a multi-tenant PaaS). If you put all workspaces in one subscription, or all teams in one workspace, you might hit those limits. Most enterprises naturally end up with multiple subscriptions as they grow – planning this early avoids later migration headaches. If you currently have everything in one subscription, evaluate usage and consider splitting off heavy workloads or prod workloads into a new one to adhere to best practices. 1.4 Consider Completing Azure Landing Zone Assessment When evaluating and planning your next deployment, it’s essential to ensure that your current landing zone aligns with Microsoft best practices. This helps establish a robust Databricks architecture and minimizes the risk of avoidable issues. Additionally, customers who are early in their cloud journey can benefit from Cloud Assessments—such as an Azure Landing Zone Review and a review of the “Prepare for Cloud Adoption” documentation—to build a strong foundation. 1.5 Planning Your Azure Databricks Workspace Architecture Your workspace architecture should reflect the operational model of your organization and support the workloads you intend to run, from exploratory notebooks to production-grade ETL pipelines. To support your planning, Microsoft provides several reference architectures that illustrate well-architected patterns for Databricks deployments. These solution ideas can serve as starting points for designing maintainable environments: Simplified Architecture: Modern Data Platform Architecture, ETL-Intensive Workload Reference Architecture: Building ETL Intensive Architecture, End-to-End Analytics Architecture: Create a Modern Analytics Architecture. 1.6 Planning for that “Right” Compute Choosing the right compute setup in Azure Databricks is crucial for optimizing performance and controlling costs, as billing is based on Databricks Units (DBUs) using a per-second pricing model. Classic Compute: You can fine-tune your own compute by enabling auto-termination and autoscaling, using Photon acceleration, leveraging spot instances, selecting the right VM type and node count for your workload, and choosing SSDs for performance or HDDs for archival storage. Preferred by mature internal teams and developers who need advanced control over clusters—such as custom VM selection, tuning, and specialized configurations. Serverless Compute: Alternatively, managed services can simplify operations with built-in optimizations. Removes infrastructure management and offers instant scaling without cluster warm-up, making it ideal for agility and simplicity. Step 2: Plan the “Right” CIDR Range (Classic Compute) Note: You can skip this step if you plan to use serverless compute for all your resources, as CIDR range planning is not required in serverless deployments. When planning CIDR ranges for your Azure Databricks workspace, it's important to ensure your virtual network has enough IP address capacity to support cluster scaling. Why this matters: If you choose a small VNet address space and your analytics workloads grow, you might hit a ceiling where you simply cannot launch more clusters or scale-out because there are no free IPs in the subnet. The subnet sizes—and by extension, the VNet CIDR—determine how many nodes you can. Databricks recommends using a CIDR block between /16 and /24 for the VNet, and up to /26 for the two required subnets: the container subnet and the host subnet. Here’s a reference Microsoft provides. If your current workspace’s VNet lacks sufficient IP space for active cluster nodes, you can request a CIDR range update through your Azure Databricks account team as noted in the Microsoft documentation. 2.1 Considerations for CIDR Range Workload Type & Concurrency: Consider what kinds of workloads will run (ETL Pipelines, Machine Learning Notebooks, BI Dashboards, etc.) and how many jobs or clusters may need to run in parallel. High concurrency (e.g. multiple ETL jobs or many interactive clusters) means more nodes running at the same time, requiring a larger pool of IP addresses. Data Volume (Historical vs. Incremental): Are you doing a one-time historical data load or only processing new incremental data? A large backfill of terabytes of data may require spinning up a very large cluster (hundreds of nodes) to process in a reasonable time. Ongoing smaller loads might get by with fewer nodes. Estimate how much data needs processing. Transformation Complexity: The complexity of data transformations or machine learning workloads matters. Heavy transformations (joins, aggregations on big data) or complex model training can benefit more workers. If your use cases include these, you may need larger clusters (more nodes) to meet performance SLAs, which in turn demands more IP addresses available in the subnet. Data Sources and Integration: Consider how your Databricks environment will connect to data. If you have multiple data sources or sinks (e.g. ingest from many event hubs, databases, or IoT streams), you might design multiple dedicated clusters or workflows, potentially all active at once. Also, if using separate job clusters per job (Databricks Jobs), multiple clusters might launch concurrently. All these scenarios increase concurrent node count. 2.2 Configuring a Dedicated Network (VNet) per Workspace with Egress Control By default, Azure Databricks deploys its classic compute resources into a Microsoft-managed virtual network (VNet) within your Azure subscription. While this simplifies setup, it limits control over network configuration. For enhanced security and flexibility, it's recommended to use VNet Injection, which allows you to deploy the compute plane into your own customer-managed VNet. This approach enables secure integration with other Azure services using service endpoints or private endpoints, supports user-defined routes for accessing on-premises data sources, allows traffic inspection via network virtual appliances or firewalls, and provides the ability to configure custom DNS and enforce egress restrictions through network security group (NSG) rules. Within this VNet (which must reside in the same region and subscription as the Azure Databricks workspace), two subnets are required for Azure Databricks: a container subnet (referred to as private subnet) and a host subnet (referred to as public subnet). To implement front-end Private Link, back-end Private Link, or both, your workspace VNet needs a third subnet that will contain the private endpoint (PrivateLink subnet). It is recommended to also deploy an Azure Firewall for egress control. Step 3: Plan Network Architecture for Securing Azure-Databricks 3.1 Secure Cluster Connectivity Secure Cluster Connectivity, also known as No Public IP (NPIP), is a foundational security feature for Azure Databricks deployments. When enabled, it ensures that compute resources within the customer-managed virtual network (VNet) do not have public IP addresses, and no inbound ports are exposed. Instead, each cluster initiates a secure outbound connection to the Databricks control plane using port 443 (HTTPS), through a dedicated relay. This tunnel is used exclusively for administrative tasks, separate from the web application and REST API traffic, significantly reducing the attack surface. For the most secure deployment, Microsoft and Databricks strongly recommend enabling Secure Cluster Connectivity, especially in environments with strict compliance or regulatory requirements. When Secure Cluster Connectivity is enabled, both workspace subnets become private, as cluster nodes don’t have public IP addresses. 3.2 Egress with VNet Injection (NVA) For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.3 Front-End Connectivity with Azure Private Link (Standard Deployment) To further enhance security, Azure Databricks supports Private Link for front-end connections. In a standard deployment, Private Link enables users to access the Databricks web application, REST API, and JDBC/ODBC endpoints over a private VNet interface, bypassing the public internet. For organizations with no public internet access from user networks, a browser authentication private endpoint is required. This endpoint supports SSO login callbacks from Microsoft Entra ID and is shared across all workspaces in a region using the same private DNS zone. It is typically hosted in a transit VNet that bridges on-premises networks and Azure. Note: There are two deployment types: standard and simplified. To compare these deployment types, see Choose standard or simplified deployment. 3.4 Serverless Compute Networking Azure Databricks offers serverless compute options that simplify infrastructure management and accelerate workload execution. These resources run in a Databricks-managed serverless compute plane, isolated from the public internet and connected to the control plane via the Microsoft backbone network. To secure outbound traffic from serverless workloads, administrators can configure Serverless Egress Control using network policies that restrict connections by location, FQDN, or Azure resource type. Additionally, Network Connectivity Configurations (NCCs) allow centralized management of private endpoints and firewall rules. NCCs can be attached to multiple workspaces and are essential for enabling secure access to Azure services like Data Lake Storage from serverless SQL warehouses. DEPLOYMENT: Step-to-Step Implementation using Azure Portal Step 1: Create an Azure Resource Group For each new workspace, create a dedicated Resource Group (to contain the Databricks workspace resource and associated resources). Ensure that all resources are deployed in the same Region and Resource Group (i.e. workspace, subnets...) to optimize data movement performance and enhance security. Step 2: Deploy Workspace Specific Virtual Network (VNET) From your Resource Group, create a Virtual Network. Under the Security section, enable Azure Firewall. Deploying an Azure Firewall is recommended for egress control, ensuring that outbound traffic from your Databricks environment is securely managed. Define address spaces for your Virtual Network (Review Step 2 from Design). As documented, you could create a VNet with these values: IP range: First remove the default IP range, and then add IP range 10.28.0.0/23. Create subnet public-subnet with range 10.28.0.0/25. Create subnet private-subnet with range 10.28.0.128/25. Create subnet private-link with range 10.28.1.0/27. Please note: your IP values can be different depending on your IPAM and available scopes. Review + Create your Virtual Network. Step 3: Deploy Azure-Databricks Workspace: Now that networking is in place, create the Databricks workspace. Below are detailed steps your organization should review while creating workspace creation: In Azure Portal, search for Azure Databricks and click Create. Choose the Subscription, RG, Region, select Premium, enter in “Managed Resource Group name” and click Next. Managed Resource Group- will be created after your Databrick workspace is deployed and contains infrastructure resources for the workspace i.e. VNets, DBFS. Required: Enable “Secure Cluster Connectivity” (No Public IP for clusters), to ensure that Databricks clusters are deployed without public IP addresses (Review Section 3.1). Required: Enable the option to deploy into your Virtual Network (VNet Injection), also known as “Bring Your Own VNet” (Review Section 3.2). Select the Virtual Network created in Step 2. Enter Private, Public Subnet Names. Enable or Disable “Deploying Nat Gateway”, according to your workspace requirement. Disable “Allow Public Network Access”. Select “No Azure Databricks Rules” for Required NSG Rules. Select “Click on add to create a private endpoint”, this will open a panel for private endpoint setup. Click “Add” to enter your Private Link details created in Step 2. Also, ensure that Private DNS zone integration is set to “Yes” and that a new Private DNS Zone is created, indicated by (New)privatelink.azuredatabricks.net. Unless an existing DNS zone for this purpose already exists. (Optional) Under Encryption Tab, Enable Infrastructure Encryption, if you have requirement for FIPS 140-2. It comes at a cost, it takes time to encrypt and decrypt. By default your data is already encrypted. If you have a standard regulatory requirement (ex. HIPAA). (Optional) Compliance security profile- for HIPAA. (Optional) Automatic cluster updates, First Sunday of every Month. Review + Create the workspace and wait for it to deploy. Step 4: Create a private endpoint to support SSO for web browser access: Note: This step is required when front-end Private Link is enabled, and client networks cannot access the public internet. After creating your Azure Databricks workspace, if you try to launch it without the proper Private Link configuration, you will see an error like the image below: This happens because the workspace is configured to block public network access, and the necessary Private Endpoints (including the browser_authentication endpoint for SSO) are not yet in place. Create Web-Auth Workspace Note: Deploy a “dummy”: WEB_AUTH_DO_NOT_DELETE_<region> workspace in the same region as your production workspace. Purpose: Host the browser_authentication private endpoint (one required per region). Lock the workspace (Delete lock) to prevent accidental removal. Follow step 2 to create Virtual Network (Vnet) Follow step 3 and create a VNet injected “dummy” workspace. Create Browser Authentication Private Endpoint In Azure Portal, Databricks workspace (dummy), Networking, Private endpoint connections, + Private endpoint. Resource step: Target sub-resource: browser_authentication Virtual Network step: VNet: Transit/Hub VNet (central network for Private Link) Subnet: Private Endpoint subnet in that VNet (not Databricks host subnets) DNS step: Integrate with Private DNS zone: Yes Zone: privatelink.azuredatabricks.net Ensure DNS zone is linked to the Transit VNet After creation: A-records for *.pl-auth.azuredatabricks.net are auto-created in the DNS zone. Workspace Connectivity Testing If you have VPN or ExpressRoute, Bastion is not required. However, for the purposes of this article we will be testing our workpace connectivity through Bastion. If you don’t have private connectivity and need to test from inside the VNet, Azure Bastion is a convenient option. Step 5: Create Storage Account From your Resource Group, click Create and select Storage account. On the configuration page: Select Preferred Storage type as: Azure Blob Storage or Azure Data Lake Storage Gen 2. Choose Performance and Redundancy options based on your business requirements. Click Next to proceed. Under the Advanced tab: Enable Hierarchical namespace under Data Lake Storage Gen2. This is critical for: Directory and file-level operations, Access Control Lists (ACLs). Under the Networking tab: Set Public Network Access to Disabled. Complete the creation process and then create container(s) inside the storage account. Step 6: Create Private Endpoints for Workspace Storage Account Pre-requisite: You need to create two private endpoints from the VNet used for VNet injection to your workspace storage account for the following Target sub-resources: dfs and blob. Navigate to your Storage Account. Go to Networking, Private Endpoints tab and click on to + Create Private Endpoint. In the Create Private Endpoint wizard: Resource tab: Select your Storage Account. Set Target sub-resource to dfs for the first endpoint. Virtual Network tab: Choose the VNet you used for VNet injection. Select the appropriate subnet. Complete the creation process. The private endpoint will be auto approved and visible under Private Endpoints. Repeat the process for the second private endpoint: This time set Target sub-resource to blob. Step 7: Link Storage and Databricks Workspace: Create Access Connector In your Resource Group, create an Access Connector for Azure Databricks. No additional configuration is required during creation. Assign Role to Access Connector Navigate to your Storage Account, Access Control (IAM), Add role assignment. Select: Role: Storage Blob Data Contributor Assign access to: Managed Identity Under Members: Click Select members. Find and select your newly created Access Connector for Azure Databricks. Save the role assignment. Copy Resource ID Go to the Access Connector Overview page. Copy the Resource ID for later use in Databricks configuration. Step 8: Link Storage and Databricks Workspace: Navigate to Unity Catalog In your Databricks Workspace, go to Unity Catalog, External Data and select “Create external Location” button. Configure External Location Select ADLS as the storage type. Enter the ADLS storage URL in the following format: abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/ Update these two parameters: <container_name> and <storage_name> Provide Access Connector Select “Create new storage credential” from Storage credential field. Paste the Resource ID of the Access Connector for Azure Databricks (from Step 10) into the Access Connector ID field. Validate Connection Click Submit. You should see a “Successful” message confirming the connection. Click submit and you should receive a “Successful” message, indicating your connection has succeeded. You can now create Catalogs and link your secure storage. Step 9: Configuring Serverless Compute Networking: If your organization plans to use Serverless SQL Warehouses or Serverless Jobs Compute, you must configure Serverless Networking. Add Network Connectivity Configuration (NCC) Go to the Databricks Account Console: https://accounts.azuredatabricks.net/ Navigate to Cloud resources, click Add Network Connectivity Configuration. Fill in the required fields and create a new NCC. Associate NCC with Workspace In the Account Console, go to Workspaces. Select your workspace, click Update Workspace. From the Network Connectivity Configuration dropdown, select the NCC you just created. Add Private Endpoint Rule In Cloud resources, select your NCC, select Private Endpoint Rules and click Add Private Endpoint Rule. Provide: Resource ID: Enter your Storage Account Resource ID. Note: this can be found from your storage account, click on “JSON View” top right. Azure Subresource type: dfs & blob. Approve Pending Connection Go to your Storage Account, Networking, Private Endpoints. You will see a Pending connection from Databricks. Approve the connection and you will see the Connection status in your Account Console as ESTABLISHED. Step 10: Test Your Workspace: Launch a small test cluster and verify the following: It can start (which means it can talk to the control plane). It can read/write from the storage, following the following code to confirm read/write to storage: Set Spark properties to configure Azure credentials to access Azure storage. Check Private DNS Record has been created. (Optional) If on-prem data is needed: try connecting to an on-prem database (using the ExpressRoute path): Connect your Azure Databricks workspace to your on-premises network - Azure Databricks | Microsoft Learn. Step 11: Account Console, Planning Workspace Access Controls and Getting Started: Once your Azure Databricks workspace is deployed, it's essential to configure access controls and begin onboarding users with the right permissions. From your account console: https://accounts.azuredatabricks.net/, you can centrally manage your environment: add users and groups, enable preview features, and view or configure all your workspaces. Azure Databricks supports fine-grained access management through Unity Catalog, cluster policies, and workspace-level roles. Start by defining who needs access to what—whether it's notebooks, tables, jobs, or clusters—and apply least-privilege principles to minimize risk. DBFS Limitation: DBFS is automatically created upon Databricks Workspace creation. DBFS can be found in your Managed Resource Group. Databricks cannot secure DBFS (see reference image below). If there is a business need to avoid DBFS then you can disable DBFS access following instructions here: Disable access to DBFS root and mounts in your existing Azure Databricks workspace. Use Unity Catalog to manage data access across catalogs, schemas, and tables, and consider implementing cluster policies to standardize compute configurations across teams. To help your teams get started, Microsoft provides a range of tutorials and best practice guides: Best practice articles - Azure Databricks | Microsoft Learn. Step 12: Planning Data Migration: As you prepare to move data into your Azure Databricks environment, it's important to assess your migration strategy early. This includes identifying source systems, estimating data volumes, and determining the appropriate ingestion methods—whether batch, streaming, or hybrid. For organizations with complex migration needs or legacy systems, Microsoft offers specialized support through its internal Azure Cloud Accelerated Factory program. Reach out to your Microsoft account team to explore nomination for Azure Cloud Accelerated Factory, which provides hands-on guidance, tooling, and best practices to accelerate and streamline your data migration journey. Summary Regular maintenance and governance are as important as the initial design. Continuously review the environment and update configurations as needed to address evolving requirements and threats. For example, tag all resources (workspaces, VNets, clusters, etc.) with clear identifiers (workspace name, environment, department) to track costs and ownership effectively. Additionally, enforce least privilege across the platform: ensure that only necessary users are given admin privileges, and use cluster-level access control to restrict who can create or start clusters. By following the above steps, an organization will have an Azure Databricks architecture that is securely isolated, well-governed, and scalable. References: [1] 5 Best Practices for Databricks Workspaces AzureDatabricksBestPractices/toc.md at master · Azure ... - GitHub Deploy a workspace using the Azure Portal Additional Links: Quick Introduction to Databricks: what is databricks | introduction - databricks for dummies Connect Purview with Azure Databricks: Integrating Microsoft Purview with Azure Databricks Secure Databricks Delta Share between Workspaces: Secure Databricks Delta Share for Serverless Compute Azure-Databricks Cost Optimization Guide: Databricks Cost Optimization: A Practical Guide Integrate Azure Databricks with Microsoft Fabric: Integrating Azure Databricks with Microsoft Fabric Databricks Solution Accelerators for Data & AI Azure updates Appendix 3.5 Understanding Data Transfer (Express Route vs. Public Internet) For data transfers, your organization must decide to use ExpressRoute or Internet Egress. There are several considerations that can help you determine your choice: 3.5.1. Connectivity Model • ExpressRoute: Provides a private, dedicated connection between your on-premises infrastructure and Microsoft Azure. It bypasses the public internet entirely and connects through a network service provider. • Internet Egress: Refers to outbound data traffic from Azure to the public internet. This is the default path for most Azure services unless configured otherwise. 3.6 Planning for User-Defined Routes (UDRs) When working with Databricks deployments—especially in VNet-injected workspaces—setting up User Defined Routes (UDRs) is a smart move. It’s a best practice that helps manage and secure network traffic more effectively. By using UDRs, teams can steer traffic between Databricks components and external services in a controlled way, which not only boosts security but also supports compliance efforts. 3.6.1 UDRs and Hub and Spoke Topology If your Databricks workspace is deployed into your own virtual network (VNet), you’ll need to configure standard user-defined routes (UDRs) to manage traffic flow. In a typical hub-and-spoke architecture, UDRs are used to route all traffic from the spoke VNets to the hub VNet. 3.6.2 Hub and Spoke with VWANHUB If your Databricks workspace is deployed into your own virtual network (VNet) and is peered to a Virtual WAN (VWAN) hub as the primary connectivity hub into Azure, a user-defined route (UDR) is not required—provided that a private traffic routing policy or internet traffic routing policy is configured in the VWAN hub. 3.6.3 Use of NVAs and Service Tags For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.6.4 Default Outbound Access Retirement (Non-Serverless Compute) Microsoft is retiring default outbound internet access for new deployments starting September 30,2025. Going forward, outbound connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.” Configure connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.”3.3KViews4likes0CommentsIntegrate Jenkins with Azure Databricks & GitHub into VSCode
Hello Team, Greetings of the Day!!! Hope you have a great day ahead!!! We have installed extension of Azure Databricks, GitHub & Jenkins in VSCode. Now the configuration parts come into the picture, so we have configured Azure Databricks & Logged in GitHub in VSCode. Now Turn comes of Jenkins. We want to know that how can we configure Jenkins with GitHub. All Notebooks from Azure Databricks will be version controlled in GitHub for doing that we want to use Jenkins. There is no documentation to do so. Can you guide us how to do it. Reference Link :- https://learn.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-jenkins Thank You in advance for any Support or Suggestion : ) Looking forward for your valuable input. Regards, Niral Dave.489Views0likes1CommentDesigning a Medallion Framework — A Decision Guide
Everyone draws the same picture: Bronze → Silver → Gold. Three boxes, three arrows. Done. What that picture hides is the dozen design decisions you have to make inside each box — and the ones you make at the boundaries between them. Get those right and onboarding the 200th table feels like onboarding the 2nd. Get them wrong and you’ll be rewriting the framework in 18 months. This post is a generic walkthrough of how to think about a medallion framework on Databricks (or any other platform): what each layer should own, where the responsibilities blur, and a few opinionated patterns I’ve found worth defending The classic template - Bronze → Silver → Gold. Three layers, broadly: Press enter or click to view image in full size This template is intentionally vague — and that’s the point. The same three labels can describe a framework for a 10-table marketing pipeline and a 2,000-table enterprise lakehouse. The differences are in how you tweak the template to match your project. This post walks through the questions that drive those tweaks. There isn’t a single right answer for any of them — only the answer that fits your project’s requirements. How to read this guide For each architectural choice, I’ll frame it as: The question — the requirement you need to clarify The options — the realistic ways to answer it When each option fits — what kind of project picks which option Use this to make your tradeoffs explicit. Document the answers in your design doc. They’ll inform a hundred downstream decisions. Question 1 — Do you need a Staging layer? A Staging (stg_*) layer is a transient zone that holds just the current run’s data before it lands in Bronze. Options: No staging. Source → Bronze directly. Staging as a transient table per object, overwritten every run. Staging as a checkpointed zone (e.g., Auto Loader checkpoints + raw files in a landing path). When to pick which: The decision usually comes down to failure isolation and incremental capture clarity. If both are non-issues, you can skip it. Question 2 — How “raw” should Bronze be? This is the single biggest tweak point in the medallion architecture. The textbook says “Bronze = raw bytes.” Real projects often deviate. Options: A. Strictly raw. Source schema preserved exactly. All columns as STRING. No casting, no trimming. B. Lightly cleaned. Strong typing, whitespace trimmed, null normalization (“”, “N/A” → NULL), audit columns added. Schema stable. C. Cleansed + minor enrichment. Above plus reference data lookups, basic standardization (e.g., country codes), key normalization. When to pick which: A useful rule of thumb: the more sources and consumers you have, the cleaner Bronze should be. The cost of not cleaning compounds with every notebook downstream. If you choose B or C, you’ve shifted some traditional Silver responsibilities into Bronze. That’s fine — just be explicit about it so Silver’s contract changes accordingly. Question 3 — What does Silver actually own? Silver is the most overloaded layer in any medallion framework. Decide upfront which of these responsibilities Silver owns vs. defers to other layers: How to decide what Silver owns: If Silver is the only layer business users query, give it more — including light history and aggregations. (Common in smaller projects.) If you have a strong Gold layer with multiple marts, keep Silver narrow: business entities only, current state. If you have multiple consuming teams with different needs, push everything consumer-specific to Gold and keep Silver as the shared canonical model. The clearest signal that Silver is overloaded: you have one Silver table per source table. Silver should be organized by business entity, not by source. If they line up 1:1, you’ve effectively built “Bronze with cleaning” and skipped Silver’s real value. Question 4 — Is Gold one zone or several? The default picture shows Gold as one box. In real projects it often splits. Options: Single Gold zone. Marts and history live together. Gold-Reporting + Gold-History. Reporting marts (denormalized, aggregated, fast) separated from historized snapshots (SCD2, point-in-time, append-mostly). Gold per consumer. Separate zones per business unit, dashboard family, or external API. The cost of splitting Gold is some duplication and more pipelines. The benefit is independent SLAs — your dashboard refresh isn’t held hostage by your audit history rebuild. Question 5 — Load patterns: FullLoad vs DeltaLoad vs CDC Per source table, decide the load pattern. This decision drives staging design, watermark management, and merge logic. It’s normal to mix patterns inside the same framework. The metadata-driven approach below makes this trivial — load pattern is just a column in your config table. Question 6 — How metadata-driven should the framework be? Options: Code-per-table. One notebook per ingestion. Simple, easy to reason about, scales poorly. Hybrid. Generic ingestion notebooks for common patterns, custom notebooks for exceptions. Fully metadata-driven. Generic notebooks for every layer, behavior driven entirely by metadata tables. When to pick which: A fully metadata-driven framework has higher upfront cost but flattens the per-table cost dramatically. The break-even point is usually around 30–50 tables. Question 7 — Orchestration shape How do you fan out work across tables? Options: Sequential. One table at a time. Simple, slow. Parallel pool. ThreadPoolExecutor or Databricks Workflows fan-out. Tables run concurrently, no inter-table dependencies. DAG. Dependency-aware execution. Required when tables depend on each other. Per-layer guidance: The decision driver is whether tables in that layer depend on each other. If they don’t, don’t pay the DAG complexity tax. Question 8 — Failure handling and retries Options to decide on: Retry scope. Per statement, per child notebook, per master run, none. Retry counts. Per layer? Per table? Per environment? Backoff. Fixed, linear, exponential. Failure semantics. Fail-fast (stop on first failure) or best-effort (continue and report at the end). When to pick which: A good default for most projects: process-level retry (master retries the failed child), exponential backoff, per-layer max retry count, fail-fast within a child. Question 9 — Observability: how much do you log? Decide what every run captures: Execution status, start/end timestamps, duration Row counts per activity (source read, staging write, target write) MERGE metrics (inserted, updated, deleted) Watermark used and watermark captured Retry attempts Error message (truncated) Options for storage: Logs in source-side metadata DB (e.g., Azure SQL). Easy to query with SQL, integrates with monitoring tools. Logs in a Delta table in the lakehouse. Native to Databricks, queryable with Spark. Logs in both. Source-side for ops dashboards, Delta for analytics on the pipeline itself. When to pick which: Whatever you pick, make count validation a first-class output. The moment counts mismatch, you want to know — not three reports later. Question 10 — Schema evolution policy The cheapest decision to defer and the most painful one to retrofit. Decide which changes are allowed automatically: Where to enforce: At Bronze ingestion — fail loudly if source schema changes in a disallowed way At Silver — handle by transformation; new Bronze columns don’t auto-flow to Silver At Gold — strict contracts; consumers depend on the shape The contract changes per layer reflects the audience. Bronze is forgiving (data engineers see issues); Gold is strict (consumers can’t tolerate surprises). Question 11 — Idempotency and replay Can you re-run yesterday’s load and get the same result? Options: Idempotent by run_id. Re-running the same run_id is a no-op or produces identical output. Idempotent by data. Re-running with the same source data produces identical output (regardless of run_id). Non-idempotent. Replays may produce different results (e.g., timestamps based on current_timestamp()). Recommendation: aim for data-idempotent in every layer. Concretely: Staging: overwrite-per-run → idempotent by construction. Bronze: keyed MERGE → idempotent. Silver: pure transformation of Bronze inputs → idempotent. Gold: pure transformation of Silver inputs → idempotent. If you can’t replay a layer cleanly, that’s a design bug worth fixing early. Question 12 — Environment topology How many environments? How do they differ? Common patterns: Dev / Test/ Stage / Prod, separate workspaces and data. Per-developer dev, shared Test/Stage, isolated Prod. What changes between environments (drive these from config): Source connection strings Target storage paths / catalog names Retry counts (often higher in prod) Parallelism (often lower in dev to save cost) Logging verbosity Data masking rules Keep code identical across environments. Differences live in environment-scoped config (dev.yml, test.yml, stage.yml, prod.yml) loaded at runtime. Putting it together — three example shapes The same framework, three different projects, three different shapes: Shape A — Small marketing analytics project 15 tables, single source, weekly batch No staging — source is reliable, volumes small Bronze: lightly cleaned — analysts query it directly Silver: full ownership including light history and aggregations (no separate Gold needed) Gold: optional, only for the executive dashboard Code-per-table, sequential orchestration, fail-fast, minimal logging Shape B — Mid-size enterprise data platform 80 tables, 5 source systems, daily batch with some hourly Staging as transient table for Delta Loads Bronze: lightly cleaned + audit columns Silver: business entities (Customer, Policy, Claim), DAG orchestration Gold: split into Reporting + History zones Hybrid metadata-driven (generic ingestion, custom transforms), per-layer retry, structured count logs Shape C — Large multi-tenant Lakehouse 500+ tables, 20+ source systems, mixed batch/streaming Staging zone with file-level checkpoints (Auto Loader) Bronze: strictly raw + a parallel Bronze-Curated layer for cleansed views Silver: shared canonical model, narrow scope Gold: per-consumer zones with independent SLAs Fully metadata-driven, DAG everywhere, multi-store logging, strict schema contracts Notice none of these are “wrong.” They’re calibrated to the project. A short checklist for your own framework Before writing code, write down your answers to: Do we need a Staging layer? Why? How clean is Bronze? What’s allowed and what’s not? What does Silver own? Where does it stop? Is Gold one zone or multiple? How are they divided? Which load patterns do we support? Per table or universal? How metadata-driven? Where do exceptions live? What’s the orchestration shape per layer? What’s our retry and failure policy per layer? What does every run log? Where? What’s our schema evolution policy per layer? Are all layer's data-idempotent? What changes per environment, and what stays the same? If you have an answer for each, you have a framework design. If you skip any, you have a framework that will surprise you in production. Closing thought The medallion architecture isn’t a prescription — it’s a vocabulary. Bronze, Silver, Gold give you words to describe responsibilities. The actual responsibilities are yours to assign, based on what your project actually needs. Tweak deliberately. Document your tweaks. And revisit them when the project’s requirements change — because they will.416Views1like0CommentsAzure Managed Redis & Azure Databricks: Real-time Feature Serving for Low-Latency Decisions
This blog content has been a collective collaboration between the Azure Databricks and Azure Managed Redis Product and Product Marketing teams. Executive summary Modern decisioning systems, fraud scoring, payments authorization, personalization, and step-up authentication, must return answers in tens of milliseconds while still reflecting the most recent behavior. That creates a classic tension: lakehouse platforms excel at large-scale ingestion, feature engineering, governance, training, and replayable history, but they are not designed to sit directly on the synchronous request path for high-QPS, ultra-low-latency lookups. This guide shows a pattern that keeps Azure Databricks as the primary system for building and maintaining features, while using Azure Managed Redis as the online speed layer that serves those features at memory speed for real-time scoring. The result is a shorter and more predictable critical path for your application: the Payment API (or any online service) reads features from Azure Managed Redis and calls a model endpoint; Azure Databricks continuously refreshes features from streaming and batch sources; and your authoritative systems of record (for example, account/card data) remain durable and governed. You get real-time responsiveness without giving up data correctness, lineage, or operational discipline. What each service does Azure Databricks is a first-party analytics and AI platform on Azure built on Apache Spark and the lakehouse architecture. It is commonly used for batch and streaming pipelines, feature engineering, model training, governance, and operationalization of ML workflows. In this architecture, Azure Databricks is the primary data and AI platform environment where features are defined, computed, validated, published, as well as where governed history is retained. Azure Managed Redis is a Microsoft‑managed, in‑memory data store based on Redis Enterprise, designed for low‑latency, high‑throughput access patterns. It is commonly used for traditional and real‑time caching, counters, and session state, and increasingly as a fast state layer for AI‑driven applications. In this architecture, Azure Managed Redis serves as the online feature store and speed layer: it holds the most recent feature values and signals required for real‑time scoring and can also support modern agentic patterns such as short‑ and long‑term memory, vector lookups, and fast state access alongside model inference. Business story: real-time fraud scoring as a running example Consider a payment system that must decide to approve, decline, or step-up authentication in tens of milliseconds—faster than a blink of an eye! The decision depends on recent behavioral signals, velocity counters, device changes, geo anomalies, and merchant patterns, combined with a fraud model. If the online service tries to compute or retrieve those features from heavy analytics systems on-demand, the request path becomes slower and more variable, especially at peak load. Instead, Azure Databricks pipelines continuously compute and refresh those features, and Azure Managed Redis serves them instantly to the scoring service. Behavioral history, profiles, and outcomes are still written to durable Azure datastores such as Delta tables, and Azure Cosmos DB so fraud models can be retrained with governed, reproducible data. The pattern: online feature serving with a speed layer The core idea is to separate responsibilities. Azure Databricks owns “building” features, ingest, join, aggregate, compute windows, and publish validated governed results. Azure Managed Redis owns “serving” features, fast, repeated key-based access on the hot path. The model endpoint then consumes a feature payload that is already pre-shaped for inference. This division prevents the lakehouse from becoming an online dependency and lets you scale online decisioning independently from offline compute. Pseudocode: end-to-end flow (online scoring + feature refresh) The pseudocode below intentionally reads like application logic rather than a single SDK. It highlights what matters: key design, pipelined feature reads, conservative fallbacks, and continuous refresh from Azure Databricks. # ---------------------------- # Online scoring (critical path) # ---------------------------- function handleAuthorization(req): schemaV = "v3" keys = buildFeatureKeys(schemaV, req) # card/device/merchant + windows feats = redis.MGET(keys) # single round trip (pipelined) feats = fillDefaults(feats) # conservative, no blocking payload = toModelPayload(req, feats) score = modelEndpoint.predict(payload) # Databricks Model Serving or an Azure-hosted model endpoint decision = policy(score, req) # approve/decline/step-up emitEventHub("txn_events", summarize(req, score, decision)) # async emitMetrics(redisLatencyMs, modelLatencyMs, missCount(feats)) return decision # ----------------------------------------- # Feature pipeline (async): build + publish # ----------------------------------------- function streamingFeaturePipeline(): events = readEventHubs("txn_events") ref = readCosmos("account_card_reference") # system of record lookups feats = computeFeatures(events, ref) # windows, counters, signals writeDelta("fraud_feature_history", feats) # ADLS Delta tables (lakehouse) publishLatestToRedis(feats, schemaV="v3") # SET/HSET + TTL (+ jitter) # ----------------------------------- # Training + deploy (async lifecycle) # ----------------------------------- function trainAndDeploy(): hist = readDelta("fraud_feature_history") labels = readCosmos("fraud_outcomes") # delayed ground truth model = train(joinPointInTime(hist, labels)) register(model) deployToDatabricksModelServing(model) Why it works This architecture works because each layer does the job it is best at. The lakehouse and feature pipelines handle heavy computation, validation, lineage, and re-playable history. The online speed layer handles locality and frequency: it keeps the “hot” feature state close to the online compute so requests do not pay the cost of re-computation or large fan-out reads. You explicitly control freshness with TTLs and refresh cadence, and you keep clear correctness boundaries by treating Azure Managed Redis as a serving layer rather than the authoritative system of record, with durable, governed feature history and labels stored in Delta tables and Azure data stores such as Azure Cosmos DB. Design choices that matter Cost efficiency and availability start with clear separation of concerns. Serving hot features from Azure Managed Redis avoids sizing analytics infrastructure for high‑QPS, low‑latency SLAs, and enables predictable capacity planning with regional isolation for online services. Azure Databricks remains optimized for correctness, freshness, and re-playable history while the online tier scales independently by request rate and working set size. Freshness and TTLs should reflect business tolerance for staleness and the meaning of each feature. Short velocity windows need TTLs slightly longer than ingestion gaps, while profiles and reference features can live longer. Adding jitter (for example ±10%) prevents synchronized expirations that create load spikes. Key design is the control plane for safe evolution and availability. Include explicit schema version prefixes and keep keys stable by entity and window. Publish new versions alongside existing ones, switch readers, and retire old versions to enable zero‑downtime rollouts. Protect the online path from stampedes and unnecessary cost. If a hot key is missing, avoid triggering widespread re-computation in downstream systems. Use a short single‑flight mechanism and conservative defaults, especially for risk‑sensitive decisions. Keep payloads compact so performance and cost remain predictable. Online feature reads are fastest when values are small and fetched in one or two round trips. Favor numeric encodings and small blobs, and use atomic writes to avoid partial or inconsistent reads during scoring. Reference architecture notes (regional first, then global) Start with a single-region deployment to validate end-to-end freshness and latency. Co-locate the Payment API compute, Azure Managed Redis, the model endpoint, and the primary data sources for feature pipelines to minimize round trips. Once the pattern is proven, extend to multi-region by deploying the online tier and its local speed layer per region, while keeping a clear strategy for how features are published and reconciled across regions (often via regional pipelines that consume the same event stream or replicated event hubs). Operations and SRE considerations Layer What to Monitor Why It Matters Typical Signals / Metrics Online service (API / scoring) End‑to‑end request latency, error rate, fallback rate Confirms the critical path meets application SLAs even under partial degradation p50/p95/p99 latency, error %, step‑up or conservative decision rate Azure Managed Redis (speed layer) Feature fetch latency, hit/miss ratio, memory pressure Indicates whether the working set fits and whether TTLs align with access patterns GET/MGET latency, miss %, evictions, memory usage Model serving Inference latency, throughput, saturation Separates model execution cost from feature access cost Inference p95 latency, QPS, concurrency utilization Azure Databricks feature pipelines Streaming lag, job health, data freshness Ensures features are being refreshed on time and correctness is preserved Event lag, job failures, watermark delay Cross‑layer boundaries Correlation between misses, latency spikes, and pipeline lag Helps identify whether regressions originate in serving, pipelines, or models Redis miss spikes vs pipeline delays vs API latency Monitor each layer independently, then correlate at the boundaries. This makes it clear whether an SLA issue is caused by online serving pressure, model inference, or delayed feature publication, without turning the lakehouse into a synchronous dependency. Putting it all together Adopt the pattern incrementally. First, publish a small, high-value feature set from Azure Databricks into Azure Managed Redis and wire the online service to fetch those features during scoring. Measure end-to-end impact on latency, model quality, and operational stability. Next, extend to streaming refresh for near-real-time behavioral features, and add controlled fallbacks for partial misses. Finally, scale out to multi-region if needed, keeping each region’s online service close to its local speed layer and ensuring the feature pipelines provide consistent semantics across regions. Sources and further reading Azure Databricks documentation: https://learn.microsoft.com/en-us/azure/databricks/ Azure Managed Redis documentation (overview and architecture): https://learn.microsoft.com/azure/redis/ Azure Architecture Center: Stream processing with Azure Databricks: https://learn.microsoft.com/azure/architecture/reference-architectures/data/stream-processing-databricks Databricks Feature Store / feature engineering docs (Azure Databricks): https://learn.microsoft.com/azure/databricks/441Views1like0Comments