azure databricks
111 TopicsDesigning a Medallion Framework — A Decision Guide
Everyone draws the same picture: Bronze → Silver → Gold. Three boxes, three arrows. Done. What that picture hides is the dozen design decisions you have to make inside each box — and the ones you make at the boundaries between them. Get those right and onboarding the 200th table feels like onboarding the 2nd. Get them wrong and you’ll be rewriting the framework in 18 months. This post is a generic walkthrough of how to think about a medallion framework on Databricks (or any other platform): what each layer should own, where the responsibilities blur, and a few opinionated patterns I’ve found worth defending The classic template - Bronze → Silver → Gold. Three layers, broadly: Press enter or click to view image in full size This template is intentionally vague — and that’s the point. The same three labels can describe a framework for a 10-table marketing pipeline and a 2,000-table enterprise lakehouse. The differences are in how you tweak the template to match your project. This post walks through the questions that drive those tweaks. There isn’t a single right answer for any of them — only the answer that fits your project’s requirements. How to read this guide For each architectural choice, I’ll frame it as: The question — the requirement you need to clarify The options — the realistic ways to answer it When each option fits — what kind of project picks which option Use this to make your tradeoffs explicit. Document the answers in your design doc. They’ll inform a hundred downstream decisions. Question 1 — Do you need a Staging layer? A Staging (stg_*) layer is a transient zone that holds just the current run’s data before it lands in Bronze. Options: No staging. Source → Bronze directly. Staging as a transient table per object, overwritten every run. Staging as a checkpointed zone (e.g., Auto Loader checkpoints + raw files in a landing path). When to pick which: The decision usually comes down to failure isolation and incremental capture clarity. If both are non-issues, you can skip it. Question 2 — How “raw” should Bronze be? This is the single biggest tweak point in the medallion architecture. The textbook says “Bronze = raw bytes.” Real projects often deviate. Options: A. Strictly raw. Source schema preserved exactly. All columns as STRING. No casting, no trimming. B. Lightly cleaned. Strong typing, whitespace trimmed, null normalization (“”, “N/A” → NULL), audit columns added. Schema stable. C. Cleansed + minor enrichment. Above plus reference data lookups, basic standardization (e.g., country codes), key normalization. When to pick which: A useful rule of thumb: the more sources and consumers you have, the cleaner Bronze should be. The cost of not cleaning compounds with every notebook downstream. If you choose B or C, you’ve shifted some traditional Silver responsibilities into Bronze. That’s fine — just be explicit about it so Silver’s contract changes accordingly. Question 3 — What does Silver actually own? Silver is the most overloaded layer in any medallion framework. Decide upfront which of these responsibilities Silver owns vs. defers to other layers: How to decide what Silver owns: If Silver is the only layer business users query, give it more — including light history and aggregations. (Common in smaller projects.) If you have a strong Gold layer with multiple marts, keep Silver narrow: business entities only, current state. If you have multiple consuming teams with different needs, push everything consumer-specific to Gold and keep Silver as the shared canonical model. The clearest signal that Silver is overloaded: you have one Silver table per source table. Silver should be organized by business entity, not by source. If they line up 1:1, you’ve effectively built “Bronze with cleaning” and skipped Silver’s real value. Question 4 — Is Gold one zone or several? The default picture shows Gold as one box. In real projects it often splits. Options: Single Gold zone. Marts and history live together. Gold-Reporting + Gold-History. Reporting marts (denormalized, aggregated, fast) separated from historized snapshots (SCD2, point-in-time, append-mostly). Gold per consumer. Separate zones per business unit, dashboard family, or external API. The cost of splitting Gold is some duplication and more pipelines. The benefit is independent SLAs — your dashboard refresh isn’t held hostage by your audit history rebuild. Question 5 — Load patterns: FullLoad vs DeltaLoad vs CDC Per source table, decide the load pattern. This decision drives staging design, watermark management, and merge logic. It’s normal to mix patterns inside the same framework. The metadata-driven approach below makes this trivial — load pattern is just a column in your config table. Question 6 — How metadata-driven should the framework be? Options: Code-per-table. One notebook per ingestion. Simple, easy to reason about, scales poorly. Hybrid. Generic ingestion notebooks for common patterns, custom notebooks for exceptions. Fully metadata-driven. Generic notebooks for every layer, behavior driven entirely by metadata tables. When to pick which: A fully metadata-driven framework has higher upfront cost but flattens the per-table cost dramatically. The break-even point is usually around 30–50 tables. Question 7 — Orchestration shape How do you fan out work across tables? Options: Sequential. One table at a time. Simple, slow. Parallel pool. ThreadPoolExecutor or Databricks Workflows fan-out. Tables run concurrently, no inter-table dependencies. DAG. Dependency-aware execution. Required when tables depend on each other. Per-layer guidance: The decision driver is whether tables in that layer depend on each other. If they don’t, don’t pay the DAG complexity tax. Question 8 — Failure handling and retries Options to decide on: Retry scope. Per statement, per child notebook, per master run, none. Retry counts. Per layer? Per table? Per environment? Backoff. Fixed, linear, exponential. Failure semantics. Fail-fast (stop on first failure) or best-effort (continue and report at the end). When to pick which: A good default for most projects: process-level retry (master retries the failed child), exponential backoff, per-layer max retry count, fail-fast within a child. Question 9 — Observability: how much do you log? Decide what every run captures: Execution status, start/end timestamps, duration Row counts per activity (source read, staging write, target write) MERGE metrics (inserted, updated, deleted) Watermark used and watermark captured Retry attempts Error message (truncated) Options for storage: Logs in source-side metadata DB (e.g., Azure SQL). Easy to query with SQL, integrates with monitoring tools. Logs in a Delta table in the lakehouse. Native to Databricks, queryable with Spark. Logs in both. Source-side for ops dashboards, Delta for analytics on the pipeline itself. When to pick which: Whatever you pick, make count validation a first-class output. The moment counts mismatch, you want to know — not three reports later. Question 10 — Schema evolution policy The cheapest decision to defer and the most painful one to retrofit. Decide which changes are allowed automatically: Where to enforce: At Bronze ingestion — fail loudly if source schema changes in a disallowed way At Silver — handle by transformation; new Bronze columns don’t auto-flow to Silver At Gold — strict contracts; consumers depend on the shape The contract changes per layer reflects the audience. Bronze is forgiving (data engineers see issues); Gold is strict (consumers can’t tolerate surprises). Question 11 — Idempotency and replay Can you re-run yesterday’s load and get the same result? Options: Idempotent by run_id. Re-running the same run_id is a no-op or produces identical output. Idempotent by data. Re-running with the same source data produces identical output (regardless of run_id). Non-idempotent. Replays may produce different results (e.g., timestamps based on current_timestamp()). Recommendation: aim for data-idempotent in every layer. Concretely: Staging: overwrite-per-run → idempotent by construction. Bronze: keyed MERGE → idempotent. Silver: pure transformation of Bronze inputs → idempotent. Gold: pure transformation of Silver inputs → idempotent. If you can’t replay a layer cleanly, that’s a design bug worth fixing early. Question 12 — Environment topology How many environments? How do they differ? Common patterns: Dev / Test/ Stage / Prod, separate workspaces and data. Per-developer dev, shared Test/Stage, isolated Prod. What changes between environments (drive these from config): Source connection strings Target storage paths / catalog names Retry counts (often higher in prod) Parallelism (often lower in dev to save cost) Logging verbosity Data masking rules Keep code identical across environments. Differences live in environment-scoped config (dev.yml, test.yml, stage.yml, prod.yml) loaded at runtime. Putting it together — three example shapes The same framework, three different projects, three different shapes: Shape A — Small marketing analytics project 15 tables, single source, weekly batch No staging — source is reliable, volumes small Bronze: lightly cleaned — analysts query it directly Silver: full ownership including light history and aggregations (no separate Gold needed) Gold: optional, only for the executive dashboard Code-per-table, sequential orchestration, fail-fast, minimal logging Shape B — Mid-size enterprise data platform 80 tables, 5 source systems, daily batch with some hourly Staging as transient table for Delta Loads Bronze: lightly cleaned + audit columns Silver: business entities (Customer, Policy, Claim), DAG orchestration Gold: split into Reporting + History zones Hybrid metadata-driven (generic ingestion, custom transforms), per-layer retry, structured count logs Shape C — Large multi-tenant Lakehouse 500+ tables, 20+ source systems, mixed batch/streaming Staging zone with file-level checkpoints (Auto Loader) Bronze: strictly raw + a parallel Bronze-Curated layer for cleansed views Silver: shared canonical model, narrow scope Gold: per-consumer zones with independent SLAs Fully metadata-driven, DAG everywhere, multi-store logging, strict schema contracts Notice none of these are “wrong.” They’re calibrated to the project. A short checklist for your own framework Before writing code, write down your answers to: Do we need a Staging layer? Why? How clean is Bronze? What’s allowed and what’s not? What does Silver own? Where does it stop? Is Gold one zone or multiple? How are they divided? Which load patterns do we support? Per table or universal? How metadata-driven? Where do exceptions live? What’s the orchestration shape per layer? What’s our retry and failure policy per layer? What does every run log? Where? What’s our schema evolution policy per layer? Are all layer's data-idempotent? What changes per environment, and what stays the same? If you have an answer for each, you have a framework design. If you skip any, you have a framework that will surprise you in production. Closing thought The medallion architecture isn’t a prescription — it’s a vocabulary. Bronze, Silver, Gold give you words to describe responsibilities. The actual responsibilities are yours to assign, based on what your project actually needs. Tweak deliberately. Document your tweaks. And revisit them when the project’s requirements change — because they will.131Views0likes0CommentsNew Microsoft Certified: Azure Databricks Data Engineer Associate Certification
As a data engineer, you understand that AI performance depends directly on the quality of its data. If the data isn’t clean, well-managed, and accessible at scale, even the most sophisticated AI models won’t perform as expected. Introducing the Microsoft Certified: Azure Databricks Data Engineer Associate Certification, designed to prove that you have the skills required to build and operate reliable data systems by using Azure Databricks. To earn the Certification, you need to pass Exam DP-750: Implementing Data Engineering Solutions Using Azure Databricks, currently in beta. Is this Certification right for you? This Certification offers you the opportunity to prove your skills and validate your expertise in the following areas: Core technical skills Ingesting, transforming, and modeling data using SQL and Python Building production data pipelines on Azure Databricks Implementing software development lifecycle (SDLC) practices with Git-based workflows Integrating Azure Databricks with key Microsoft services, such as Azure Storage, Azure Data Factory, Azure Monitor, Azure Key Vault, and Microsoft Entra ID Governance and security Securing and governing data with Unity Catalog and Microsoft Purview Applying workspace, cluster, and data-level security best practices Performance and reliability Optimizing compute, caching, partitioning, and Delta Lake design patterns Troubleshooting and resolving issues with jobs and pipelines Managing workloads across development, staging, and production For engineers already familiar with Azure Databricks, this Certification bridges the gap between general Azure Databricks skills and the Azure‑specific architecture, security, and operational patterns that employers increasingly expect. Ready to prove your skills? The first 300 candidates can save 80% Take advantage of the discounted beta exam offer. The first 300 people who take Exam DP-750 (beta) on or before April 2, 2026, can get 80% off. To receive the discount, when you register for the exam and are prompted for payment, use code DP750Deltona. This is not a private access code. The seats are offered on a first-come, first-served basis. As noted, you must take the exam on or before April 2, 2026. Please note that this discount is not available in Turkey, Pakistan, India, or China. How to prepare Get ready to take Exam DP-750 (beta): Review the Exam DP-750 (beta) exam page for details. The Exam DP-750 study guide explores key topics covered in the exam. Work through the Plan on Microsoft Learn: Get Exam‑Ready for DP‑750: Azure Databricks Data Engineer Associate Certification. Need other preparation ideas? Check out Just How Does One Prepare for Beta Exams? You can take Certification exams online, from your home or office. Learn what to expect in Online proctored exams: What to expect and how to prepare. Interested in unlocking more Azure Databricks expertise? Grow your skills and take the next step by exploring Databricks credentials and show what you can do with Azure Databricks. Ready to get started? Remember, only the first 300 candidates can get 80% Exam DP-750 (beta) with code DP750Deltona on or before April 2, 2026. Beta exam rescoring begins when the exam goes live, with final results released approximately 10 days later. For more details, read Creating high-quality exams: The path from beta to live. Stay tuned for general availability of this Certification in early May 2026. Get involved: Help shape future Microsoft Credentials Join our Microsoft Worldwide Learning SME Group for Credentials on LinkedIn for beta exam alerts and opportunities to help shape future Microsoft learning and assessments. Additional information For more cloud and AI Certification updates, read our recent blog post, The AI job boom is here. Are you ready to showcase your skills? Explore Microsoft Credentials on AI Skills Navigator.26KViews4likes23CommentsApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.5.2KViews9likes1CommentAzure Managed Redis & Azure Databricks: Real-time Feature Serving for Low-Latency Decisions
This blog content has been a collective collaboration between the Azure Databricks and Azure Managed Redis Product and Product Marketing teams. Executive summary Modern decisioning systems, fraud scoring, payments authorization, personalization, and step-up authentication, must return answers in tens of milliseconds while still reflecting the most recent behavior. That creates a classic tension: lakehouse platforms excel at large-scale ingestion, feature engineering, governance, training, and replayable history, but they are not designed to sit directly on the synchronous request path for high-QPS, ultra-low-latency lookups. This guide shows a pattern that keeps Azure Databricks as the primary system for building and maintaining features, while using Azure Managed Redis as the online speed layer that serves those features at memory speed for real-time scoring. The result is a shorter and more predictable critical path for your application: the Payment API (or any online service) reads features from Azure Managed Redis and calls a model endpoint; Azure Databricks continuously refreshes features from streaming and batch sources; and your authoritative systems of record (for example, account/card data) remain durable and governed. You get real-time responsiveness without giving up data correctness, lineage, or operational discipline. What each service does Azure Databricks is a first-party analytics and AI platform on Azure built on Apache Spark and the lakehouse architecture. It is commonly used for batch and streaming pipelines, feature engineering, model training, governance, and operationalization of ML workflows. In this architecture, Azure Databricks is the primary data and AI platform environment where features are defined, computed, validated, published, as well as where governed history is retained. Azure Managed Redis is a Microsoft‑managed, in‑memory data store based on Redis Enterprise, designed for low‑latency, high‑throughput access patterns. It is commonly used for traditional and real‑time caching, counters, and session state, and increasingly as a fast state layer for AI‑driven applications. In this architecture, Azure Managed Redis serves as the online feature store and speed layer: it holds the most recent feature values and signals required for real‑time scoring and can also support modern agentic patterns such as short‑ and long‑term memory, vector lookups, and fast state access alongside model inference. Business story: real-time fraud scoring as a running example Consider a payment system that must decide to approve, decline, or step-up authentication in tens of milliseconds—faster than a blink of an eye! The decision depends on recent behavioral signals, velocity counters, device changes, geo anomalies, and merchant patterns, combined with a fraud model. If the online service tries to compute or retrieve those features from heavy analytics systems on-demand, the request path becomes slower and more variable, especially at peak load. Instead, Azure Databricks pipelines continuously compute and refresh those features, and Azure Managed Redis serves them instantly to the scoring service. Behavioral history, profiles, and outcomes are still written to durable Azure datastores such as Delta tables, and Azure Cosmos DB so fraud models can be retrained with governed, reproducible data. The pattern: online feature serving with a speed layer The core idea is to separate responsibilities. Azure Databricks owns “building” features, ingest, join, aggregate, compute windows, and publish validated governed results. Azure Managed Redis owns “serving” features, fast, repeated key-based access on the hot path. The model endpoint then consumes a feature payload that is already pre-shaped for inference. This division prevents the lakehouse from becoming an online dependency and lets you scale online decisioning independently from offline compute. Pseudocode: end-to-end flow (online scoring + feature refresh) The pseudocode below intentionally reads like application logic rather than a single SDK. It highlights what matters: key design, pipelined feature reads, conservative fallbacks, and continuous refresh from Azure Databricks. # ---------------------------- # Online scoring (critical path) # ---------------------------- function handleAuthorization(req): schemaV = "v3" keys = buildFeatureKeys(schemaV, req) # card/device/merchant + windows feats = redis.MGET(keys) # single round trip (pipelined) feats = fillDefaults(feats) # conservative, no blocking payload = toModelPayload(req, feats) score = modelEndpoint.predict(payload) # Databricks Model Serving or an Azure-hosted model endpoint decision = policy(score, req) # approve/decline/step-up emitEventHub("txn_events", summarize(req, score, decision)) # async emitMetrics(redisLatencyMs, modelLatencyMs, missCount(feats)) return decision # ----------------------------------------- # Feature pipeline (async): build + publish # ----------------------------------------- function streamingFeaturePipeline(): events = readEventHubs("txn_events") ref = readCosmos("account_card_reference") # system of record lookups feats = computeFeatures(events, ref) # windows, counters, signals writeDelta("fraud_feature_history", feats) # ADLS Delta tables (lakehouse) publishLatestToRedis(feats, schemaV="v3") # SET/HSET + TTL (+ jitter) # ----------------------------------- # Training + deploy (async lifecycle) # ----------------------------------- function trainAndDeploy(): hist = readDelta("fraud_feature_history") labels = readCosmos("fraud_outcomes") # delayed ground truth model = train(joinPointInTime(hist, labels)) register(model) deployToDatabricksModelServing(model) Why it works This architecture works because each layer does the job it is best at. The lakehouse and feature pipelines handle heavy computation, validation, lineage, and re-playable history. The online speed layer handles locality and frequency: it keeps the “hot” feature state close to the online compute so requests do not pay the cost of re-computation or large fan-out reads. You explicitly control freshness with TTLs and refresh cadence, and you keep clear correctness boundaries by treating Azure Managed Redis as a serving layer rather than the authoritative system of record, with durable, governed feature history and labels stored in Delta tables and Azure data stores such as Azure Cosmos DB. Design choices that matter Cost efficiency and availability start with clear separation of concerns. Serving hot features from Azure Managed Redis avoids sizing analytics infrastructure for high‑QPS, low‑latency SLAs, and enables predictable capacity planning with regional isolation for online services. Azure Databricks remains optimized for correctness, freshness, and re-playable history while the online tier scales independently by request rate and working set size. Freshness and TTLs should reflect business tolerance for staleness and the meaning of each feature. Short velocity windows need TTLs slightly longer than ingestion gaps, while profiles and reference features can live longer. Adding jitter (for example ±10%) prevents synchronized expirations that create load spikes. Key design is the control plane for safe evolution and availability. Include explicit schema version prefixes and keep keys stable by entity and window. Publish new versions alongside existing ones, switch readers, and retire old versions to enable zero‑downtime rollouts. Protect the online path from stampedes and unnecessary cost. If a hot key is missing, avoid triggering widespread re-computation in downstream systems. Use a short single‑flight mechanism and conservative defaults, especially for risk‑sensitive decisions. Keep payloads compact so performance and cost remain predictable. Online feature reads are fastest when values are small and fetched in one or two round trips. Favor numeric encodings and small blobs, and use atomic writes to avoid partial or inconsistent reads during scoring. Reference architecture notes (regional first, then global) Start with a single-region deployment to validate end-to-end freshness and latency. Co-locate the Payment API compute, Azure Managed Redis, the model endpoint, and the primary data sources for feature pipelines to minimize round trips. Once the pattern is proven, extend to multi-region by deploying the online tier and its local speed layer per region, while keeping a clear strategy for how features are published and reconciled across regions (often via regional pipelines that consume the same event stream or replicated event hubs). Operations and SRE considerations Layer What to Monitor Why It Matters Typical Signals / Metrics Online service (API / scoring) End‑to‑end request latency, error rate, fallback rate Confirms the critical path meets application SLAs even under partial degradation p50/p95/p99 latency, error %, step‑up or conservative decision rate Azure Managed Redis (speed layer) Feature fetch latency, hit/miss ratio, memory pressure Indicates whether the working set fits and whether TTLs align with access patterns GET/MGET latency, miss %, evictions, memory usage Model serving Inference latency, throughput, saturation Separates model execution cost from feature access cost Inference p95 latency, QPS, concurrency utilization Azure Databricks feature pipelines Streaming lag, job health, data freshness Ensures features are being refreshed on time and correctness is preserved Event lag, job failures, watermark delay Cross‑layer boundaries Correlation between misses, latency spikes, and pipeline lag Helps identify whether regressions originate in serving, pipelines, or models Redis miss spikes vs pipeline delays vs API latency Monitor each layer independently, then correlate at the boundaries. This makes it clear whether an SLA issue is caused by online serving pressure, model inference, or delayed feature publication, without turning the lakehouse into a synchronous dependency. Putting it all together Adopt the pattern incrementally. First, publish a small, high-value feature set from Azure Databricks into Azure Managed Redis and wire the online service to fetch those features during scoring. Measure end-to-end impact on latency, model quality, and operational stability. Next, extend to streaming refresh for near-real-time behavioral features, and add controlled fallbacks for partial misses. Finally, scale out to multi-region if needed, keeping each region’s online service close to its local speed layer and ensuring the feature pipelines provide consistent semantics across regions. Sources and further reading Azure Databricks documentation: https://learn.microsoft.com/en-us/azure/databricks/ Azure Managed Redis documentation (overview and architecture): https://learn.microsoft.com/azure/redis/ Azure Architecture Center: Stream processing with Azure Databricks: https://learn.microsoft.com/azure/architecture/reference-architectures/data/stream-processing-databricks Databricks Feature Store / feature engineering docs (Azure Databricks): https://learn.microsoft.com/azure/databricks/379Views1like0CommentsAnnouncing the New Home for the Azure Databricks Blog
We’re excited to share that the Azure Databricks blog has moved to a new address on Microsoft Tech Community Hub! Azure Databricks | Microsoft Community Hub Our new blog home is designed to make it easier than ever for you to discover the latest product updates, deep technical insights, and real-world best practices directly from the Azure Databricks product team. Whether you're a data engineer, data scientist, or analytics leader, this is your go-to destination for staying informed and inspired. What You’ll Find on the New Blog At our new address, you can expect: Latest Announcements – Stay up to date with new features, capabilities, and releases Best Practice Guidance – Learn proven approaches for building scalable data and AI solutions Technical Deep Dives – Explore detailed walkthroughs and architecture insights Customer Stories – See how organizations are driving impact with Azure Databricks Why the Move? This new blog gives us the flexibility to deliver a better reading experience, improved navigation, and richer content dedicated to Azure Databricks. It also allows us to bring you more frequent updates and more in-depth resources tailored to your needs. Stay Connected We encourage you to bookmark the new blog and check back regularly. Even better—follow along so you never miss an update. By staying connected, you’ll be among the first to hear about new features, performance improvements, and expert recommendations to help you get the most out of Azure Databricks. 👉 Follow the new Azure Databricks blog today and stay ahead with the latest announcements and best practices. We’re looking forward to continuing this journey with you—now at our new home! Check out the latest blogs if you haven’t already: • Introducing Lakeflow Connect Free Tier, now available on Azure Databricks | Microsoft Community Hub •Near–Real-Time CDC to Delta Lake for BI and ML with Lakeflow on Azure Databricks | Microsoft Community Hub201Views0likes0CommentsIntroducing Lakeflow Connect Free Tier, now available on Azure Databricks
We're excited to introduce the Lakeflow Connect Free Tier on Azure Databricks, so you can easily bring your enterprise data into your lakehouse to build analytics and AI applications faster. Modern applications require reliable access to operational data, especially for training analytics and AI agents, but connecting and gathering data across silos can be challenging. With this new release, you can seamlessly ingest all of your enterprise data from SaaS and database sources to unlock data intelligence for your AI agents. Ingest millions of records per day, per workspace for free This new Lakeflow Connect Free Tier provides 100 DBUs per day, per workspace, which allows you to ingest approximately 100 million records* from many popular data sources**, including SaaS applications and databases. Unlock your enterprise data for free with Lakeflow Connect This new offering provides all the benefits of Lakeflow Connect, eliminating the heavy lifting so your teams can focus on unlocking data insights instead of managing infrastructure. In the past year, Databricks has continued rolling out several fully managed connectors, supporting popular data sources. The free tier supports popular SaaS applications (Salesforce, ServiceNow, Google Analytics, Workday, Microsoft Dynamics 365), and top-used databases (SQL Server, Oracle, Teradata, PostgreSQL, MySQL, Snowflake, Redshift, Synapse, and BigQuery). Lakeflow Connect benefits include: Simple UI: Avoid complex setups and architectural overhead, these fully managed connectors provide a simple UI and API to democratize data access. Automated features also help simplify pipeline maintenance with minimal overhead. Efficient ingestion: Increase efficiency and accelerate time to value. Optimized incremental reads and writes and data transformation help improve the performance and reliability of your pipelines, reduce bottlenecks, and reduce impact to the source data for scalability. Unified with the Databricks Platform: Create ingestion pipelines with governance from Unity Catalog, observability from Lakehouse Monitoring, and seamless orchestration with Lakeflow Jobs for analytics, AI and BI. Availability The Lakeflow Connect Free Tier is available starting today on Azure Databricks. If you are at FabCon in Atlanta, Accelerating Data and AI with Azure Databricks on Thursday, March 19th, 8:00–9:00 AM, room C302 to see how these capabilities come together to accelerate performance, simplify architecture, and maximize value on Azure Getting Started Resources To learn more about the Lakeflow Connect Free Tier and Lakeflow Connect, review our pricing page, and documentation. Get started ingesting your data today for free, signup with an Azure free account. Get started with Azure Databricks for free Product tour: Databricks Lakeflow Connect for Salesforce: Powering Smarter Selling with AI and Analytics Product tour: Effortless ServiceNow Data Ingestion with Databricks Lakeflow Connect Product tour: Simplify Data Ingestion with Lakeflow Connect: From Google Analytics to AI On-demand video: Use Lakeflow Connect for Salesforce to predict customer churn On-demand video: Databricks Lakeflow Connect for Workday Reports: Connect, Ingest, and Analyze Workday Data Without Complexity On-demand video: Data Ingestion With Lakeflow Connect —-- * Your actual ingestion capacity will vary based on specific workload characteristics, record sizes, and source types. ** Excludes Zerobus Ingest, Auto Loader and other self-managed connectors. Customer will continue to incur charges for underlying infrastructure consumption from the cloud vendor.4KViews0likes0CommentsNear–Real-Time CDC to Delta Lake for BI and ML with Lakeflow on Azure Databricks
The Challenge: Too Many Tools, Not Enough Clarity Modern data teams on Azure often stitch together separate orchestrators, custom streaming consumers, hand-rolled transformation notebooks, and third-party connectors — each with its own monitoring UI, credential system, and failure modes. The result is observability gaps, weeks of work per new data source, disconnected lineage, and governance bolted on as an afterthought. Lakeflow, Databricks’ unified data engineering solution, solves this by consolidating ingestion, transformation, and orchestration natively inside Azure Databricks — governed end-to-end by Unity Catalog. Component What It Does Lakeflow Connect Point-and-click connectors for databases (using CDC), SaaS apps, files, streaming, and Zerobus for direct telemetry Lakeflow Spark Declarative Pipelines Declarative ETL with AutoCDC, data quality enforcement, and automatic incremental processing Lakeflow Jobs Managed orchestration with 99.95% uptime, a visual task DAG, and repair-and-rerun Architecture Step 1: Stream Application Telemetry with Zerobus Ingest Zerobus Ingest, part of Lakeflow Connect, lets your application push events directly to a Delta table over gRPC — no message bus, no Structured Streaming job. Sub-5-second latency, up to 100 MB/sec per connection, immediately queryable in Unity Catalog. Prerequisites Azure Databricks workspace with Unity Catalog enabled and serverless compute on A service principal with write access to the target table Setup First, create the target table in a SQL notebook: CREATE CATALOG IF NOT EXISTS prod; CREATE SCHEMA IF NOT EXISTS prod.bronze; CREATE TABLE IF NOT EXISTS prod.bronze.telemetry_events ( event_id STRING, user_id STRING, event_type STRING, session_id STRING, ts BIGINT, page STRING, duration_ms INT ); 1. Go to Settings → Identity and Access → Service Principals → Add service principal 2. Open the service principal → Secrets tab → Generate secret. Save the Client ID and secret. 3. In a SQL notebook, grant access: GRANT USE CATALOG ON CATALOG prod TO `<client-id>`; GRANT USE SCHEMA ON SCHEMA prod.bronze TO `<client-id>`; GRANT MODIFY, SELECT ON TABLE prod.bronze.telemetry_events TO `<client-id>`; 4. Derive your Zerobus endpoint from your workspace URL: <workspace-id>.zerobus.<region>.azuredatabricks.net (The workspace ID is the number in your workspace URL, e.g. adb-**1234567890**.12.azuredatabricks.net) 5. Install the SDK: pip install databricks-zerobus-ingest-sdk 6. In your application, open a stream and push records: from zerobus.sdk.sync import ZerobusSdk from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties sdk = ZerobusSdk("<workspace-id>.zerobus.<region>.azuredatabricks.net", "https://<workspace-url>") stream = sdk.create_stream( "<client-id>", "<client-secret>", TableProperties("prod.bronze.telemetry_events"), StreamConfigurationOptions(record_type=RecordType.JSON) ) stream.ingest_record({"event_id": "e1", "user_id": "u42", "event_type": "page_view", "ts": 1700000000000}) stream.close() 7. Verify in Catalog → prod → bronze → telemetry_events → Sample Data Step 2: Ingest from On-Premises SQL Server via CDC Lakeflow Connect reads SQL Server's transaction log incrementally — no full table scans, no custom extraction software. Connectivity to your on-prem server is over Azure ExpressRoute. Prerequisites SQL Server reachable from Databricks over ExpressRoute (TCP port 1433) CDC enabled on the source database and tables (see setup below) A SQL login with CDC read permissions on the source database Databricks: CREATE CONNECTION privilege on the metastore; USE CATALOG, CREATE TABLE on the destination catalog Setup Enable CDC on SQL Server: USE YourDatabase; EXEC sys.sp_cdc_enable_db; EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'orders', @role_name = NULL; EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'customers', @role_name = NULL; Configure the connector in Databricks: Click Data Ingestion in the sidebar (or + New → Add Data) Select SQL Server from the connector list Ingestion Gateway page — enter a gateway name, select staging catalog/schema, click Next Ingestion Pipeline page — name the pipeline, click Create connection: Host: your on-prem IP (e.g. 10.0.1.50) · Port: 1433 · Database: YourDatabase Enter credentials, click Create, then Create pipeline and continue Source page — expand the database tree, check dbo.orders and dbo.customers; optionally enable History tracking (SCD Type 2) per table. Set Destination table name to orders_raw and customers_raw respectively. Destination page — set catalog: prod, schema: bronze, click Save and continue Settings page — set a sync schedule (e.g. every 5 minutes), click Save and run pipeline Step 3: Transform with Spark Declarative Pipelines The Lakeflow Pipelines Editor is an IDE built for developing pipelines in Lakeflow Spark Declarative Pipelines (SDP), and lets you define Bronze → Silver → Gold in SQL. SDP then handles incremental execution, schema evolution, and lineage automatically. Prerequisites Bronze tables populated (from Steps 1 and 2) CREATE TABLE and USE SCHEMA privileges on prod.silver and prod.gold Setup 1. In the sidebar, click Jobs & Pipelines → ETL pipeline → Start with an empty file → SQL 2. Rename the pipeline (click the name at top) to lakeflow-demo-pipeline 3. Paste the following SQL: -- Silver: latest order state (SCD Type 1) CREATE OR REFRESH STREAMING TABLE prod.silver.orders; APPLY CHANGES INTO prod.silver.orders FROM STREAM(prod.bronze.orders_raw) KEYS (order_id) SEQUENCE BY updated_at STORED AS SCD TYPE 1; -- Silver: full customer history (SCD Type 2) CREATE OR REFRESH STREAMING TABLE prod.silver.customers; APPLY CHANGES INTO prod.silver.customers FROM STREAM(prod.bronze.customers_raw) KEYS (customer_id) SEQUENCE BY updated_at STORED AS SCD TYPE 2; -- Silver: telemetry with data quality check CREATE OR REFRESH STREAMING TABLE prod.silver.telemetry_events ( CONSTRAINT valid_event_type EXPECT (event_type IN ('page_view', 'add_to_cart', 'purchase')) ON VIOLATION DROP ROW ) AS SELECT * FROM STREAM(prod.bronze.telemetry_events); -- Gold: materialized view joining all three Silver tables CREATE OR REFRESH MATERIALIZED VIEW prod.gold.customer_activity AS SELECT o.order_id, o.customer_id, c.customer_name, c.email, o.order_amount, o.order_status, COUNT(e.event_id) AS total_events, SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_events FROM prod.silver.orders o LEFT JOIN prod.silver.customers c ON o.customer_id = c.customer_id LEFT JOIN prod.silver.telemetry_events e ON CAST(o.customer_id AS STRING) = e.user_id -- user_id in telemetry is string GROUP BY o.order_id, o.customer_id, c.customer_name, c.email, o.order_amount, o.order_status; 4. Click Settings (gear icon) → set Pipeline mode: Continuous → Target catalog: prod → Save 5. Click Start — the editor switches to the live Graph view Step 4: Govern with Unity Catalog All tables from Steps 1–3 are automatically registered in Unity Catalog, Databricks’ built-in governance and security offering, with full lineage. No manual registration needed. View lineage Go to Catalog → prod → gold → customer_activity Click the Lineage tab → See Lineage Graph Click the expand icon on each upstream node to reveal the full chain: Bronze sources → Silver → Gold Set Permissions -- Grant analysts read access to the Gold layer only GRANT SELECT ON TABLE prod.gold.customer_activity TO `analysts@contoso.com`; -- Mask PII for non-privileged users CREATE FUNCTION prod.security.mask_email(email STRING) RETURNS STRING RETURN CASE WHEN is_account_group_member('data-engineers') THEN email ELSE CONCAT(LEFT(email, 2), '***@***.com') END; ALTER TABLE prod.silver.customers ALTER COLUMN email SET MASK prod.security.mask_email; Step 5: Orchestrate and Monitor with Lakeflow Jobs Wire the Connect pipeline and SDP pipeline into a single job with dependencies, scheduling, and alerting, all from the UI with Lakeflow Jobs. Prerequisites Pipelines from Steps 2 and 3 saved in the workspace Setup Go to Jobs & Pipelines → Create → Job Task 1: click the Pipeline tile → name it ingest_sql_server_cdc → select your Lakeflow Connect pipeline → Create task Task 2: click + Add task → Pipeline → name it transform_bronze_to_gold → select lakeflow-demo-pipeline → set Depends on: ingest_sql_server_cdc → Create task In the Job details panel on the right: click Add schedule → set frequency → add email notification on failure → Save Click Run now to trigger a run, then click the run ID to open the Run detail view For health monitoring across all jobs, query system tables in any notebook or SQL warehouse: SELECT job_name, result_state, DATEDIFF(second, start_time, end_time) AS duration_sec FROM system.lakeflow.job_run_timeline WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 24 HOURS ORDER BY start_time DESC; Step 6: Visualize with AI/BI Dashboards and Genie AI/BI Dashboard helps you create AI-powered, low-code dashboards. Click + New → Dashboard Click Add a visualization, connect to prod.gold.customer_activity, and build charts Click Publish — viewers see data under their own Unity Catalog permissions automatically Genie allows you to interact with their data using natural language 1. In the sidebar, click Genie → New 2. On Choose data sources, select prod.gold.customer_activity → Create 3. Add context in the Instructions box (e.g., table relationships, business definitions) 4. Switch to the Chat tab and ask a question: "Which customers have the highest total events and what were their order amounts?" 5. Genie generates and executes SQL, returning a result table. Click View SQL to inspect the query. Everything in One Platform Capability Lakeflow Previously Required Telemetry ingestion Zerobus Ingest Message bus + custom consumer Database CDC Lakeflow Connect Custom scripts or 3rd-party tools Transformation + AutoCDC Spark Declarative Pipelines Hand-rolled MERGE logic Data quality SDP Expectations Separate validation tooling Orchestration Lakeflow Jobs External schedulers (Airflow, etc.) Governance Unity Catalog Disconnected ACLs and lineage Monitoring Job UI + System Tables Separate APM tools BI + NL Query AI/BI Dashboards + Genie External BI tools Customers seeing results on Azure Databricks: Ahold Delhaize — 4.5x faster deployment and 50% cost reduction running 1,000+ ingestion jobs daily Porsche Holding — 85% faster ingestion pipeline development vs. a custom-built solution Next Steps Lakeflow product page Lakeflow Connect documentation Live demos on Demo Center Get started with Azure Databricks510Views0likes0CommentsAzure Databricks & Fabric Disaster Recovery: The Better Together Story
Author's: Amudha Palani amudhapalani, Oscar Alvarado oscaralvarado, Eric Kwashie ekwashie, Peter Lo PeterLo and Rafia Aqil Rafia_Aqil Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions. Identify Business Critical Workloads Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform. Azure Databricks Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions. Full System Failover (Active-Passive) Strategy A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include: Infrastructure Components: Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform Deploy network infrastructure (subnets) in the secondary region Establish data synchronization mechanisms Data Replication Strategy: Use Deep Clone for Delta tables rather than geo-redundant storage Implement periodic synchronization jobs using Delta's incremental replication Measure data transfer results using time travel syntax Workspace Asset Synchronization: Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD Utilize Terraform and SCIM for identity and access management Keep job concurrencies at zero in the secondary region to prevent execution Fully Redundant (Active-Active) Strategy The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy: Requires complex data synchronization between regions Incurs highest operational costs due to duplicate processing Typically needed only for mission-critical workloads with zero-tolerance for downtime Can be implemented as partial active-active, processing most workload in primary with subset in secondary Enabling Disaster Recovery Create a secondary workspace in a paired region. Use CI/CD to keep Workspace Assets Synchronized continuously. Requirement Approach Tools Cluster Configurations Co-deploy to both regions as code Terraform Code (Notebooks, Libraries, SQL) Co-deploy with CI/CD pipelines Git, Azure DevOps, GitHub Actions Jobs Co-deploy with CI/CD, set concurrency to zero in secondary Databricks Asset Bundles, Terraform Permissions (Users, Groups, ACLs) Use IdP/SCIM and infrastructure as code Terraform, SCIM Secrets Co-deploy using secret management Terraform, Azure Key Vault Table Metadata Co-deploy with CI/CD workflows Git, Terraform Cloud Services (ADLS, Network) Co-deploy infrastructure Terraform Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution. Replicate all dependent services (Key Vault, Storage accounts, SQL DB). Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions. Introduce an application‑level “Sync Tool” that redirects: data ingestion compute execution Enable parallel processing in both regions for selected or all workloads. Use bi‑directional synchronization for Delta data to maintain consistency across regions. For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm. Implement Three-Pillar DR Design Primary Workspace: Your production Databricks environment running normal operations Secondary Workspace: A standby Databricks workspace in a different(paired) Azure region that remains ready to take over if the primary fails. This architecture ensures business continuity while optimizing costs by keeping the secondary workspace dormant until needed. The DR solution is built on three fundamental pillars that work together to provide comprehensive protection: 1. Infrastructure Provisioning (Terraform) The infrastructure layer creates and manages all Azure resources required for disaster recovery using Infrastructure as Code (Terraform). What It Creates: Secondary Resource Group: A dedicated resource group in your paired DR region (e.g., if primary is in East US, secondary might be in West US 2) Secondary Databricks Workspace: A standby Databricks workspace with the same SKU as your primary, ready to receive failover traffic DR Storage Account: An ADLS Gen2 storage account that serves as the backup destination for your critical data Monitoring Infrastructure: Azure Monitor Log Analytics workspace and alert action groups to track DR health Protection Locks: Management locks to prevent accidental deletion of critical DR resources Key Design Principle: The Terraform configuration references your existing primary workspace without modifying it. It only creates new resources in the secondary region, ensuring your production environment remains untouched during setup. 2. Data Synchronization (Delta Notebooks) The data synchronization layer ensures your critical data is continuously backed up to the secondary region. How It Works: The solution uses a Databricks notebook that runs in your primary workspace on a scheduled basis. This notebook: Connects to Backup Storage: Uses Unity Catalog with Azure Managed Identity for secure, credential-free authentication to the secondary storage account Identifies Critical Tables: Reads from a configuration list you define (sales data, customer data, inventory, financial transactions, etc.) Performs Deep Clone: Uses Delta Lake's native CLONE functionality to create exact copies of your tables in the backup storage Tracks Sync Status: Logs each synchronization operation, tracks row counts, and reports on data freshness Authentication Flow: The synchronization process leverages Unity Catalog's managed identity capabilities: An existing Access Connector for Unity Catalog is granted "Storage Blob Data Contributor" permissions on the backup storage. Storage credentials are created in Databricks that reference this Access Connector. The notebook uses these credentials transparently—no storage keys or secrets are required. What Gets Synced: You define which tables are critical to your business operations. The notebook creates backup copies including: Full table data and schema Table partitioning structure Delta transaction logs for point-in-time recovery 3. Failover Automation (Python Scripts) The failover automation layer orchestrates the switch from primary to secondary workspace when disaster strikes. Microsoft Fabric Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication. Power BI Business Continuity Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering: No opt-in required: DR capabilities are automatically included. Azure storage geo-redundant replication: Ensures backup instances exist in other regions. Read-only access during disasters: Semantic models, reports, and dashboards remain accessible. Always supported: BCDR for Power BI remains active regardless of OneLake DR setting. Microsoft Fabric Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers: Microsoft's Responsibilities: Ensure baseline infrastructure and platform services availability Maintain Azure regional pairings for geo-redundancy. Provide DR capabilities for Power BI as default. Customer Responsibilities: Enable disaster recovery settings for capacities Set up secondary capacity and workspaces in paired regions Replicate data and configurations Enabling Disaster Recovery Organizations can enable BCDR through the Admin portal under Capacity settings: Navigate to Admin portal → Capacity settings Select the appropriate Fabric Capacity Access Disaster Recovery configuration Enable the disaster recovery toggle Critical Timing Considerations: 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted. 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective. Azure Databricks & Microsoft Fabric DR Considerations Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different. Recovery Procedures Procedure Databricks Fabric Failover Stop workloads, update routing, resume in secondary region. Microsoft initiates failover; customers restore services in DR capacity. Restore to Primary Stop secondary workloads, replicate data/code back, test, resume production. Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. Asset Syncing Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. Business Considerations Consideration Databricks Fabric Control Customers manage DR strategy, failover timing, and asset replication. Microsoft manages failover; customers restore services post-failover. Regional Dependencies Must ensure secondary region has sufficient capacity and services. DR only available in Azure regions with Fabric support and paired regions. Power BI Continuity Not applicable. Power BI offers built-in BCDR with read-only access to semantic models and reports. Activation Timeline Immediate upon configuration. DR setting takes up to 72 hours to activate; 30-day wait before changes allowed.1KViews4likes0CommentsHow Azure NetApp Files Object REST API powers Azure and ISV Data and AI services – on YOUR data
This article introduces the Azure NetApp Files Object REST API, a transformative solution for enterprises seeking seamless, real-time integration between their data and Azure's advanced analytics and AI services. By enabling direct, secure access to enterprise data—without costly transfers or duplication—the Object REST API accelerates innovation, streamlines workflows, and enhances operational efficiency. With S3-compatible object storage support, it empowers organizations to make faster, data-driven decisions while maintaining compliance and data security. Discover how this new capability unlocks business potential and drives a new era of productivity in the cloud.1.3KViews0likes0CommentsAzure Databricks Lakebase is now generally available
Modern applications are built on real-time, intelligent, and increasingly powered by AI agents that need fast, reliable access to operational data—without sacrificing governance, scale, or simplicity. To solve for this, Azure Databricks Lakebase introduces a serverless, Postgres database architecture that separates compute from storage and integrates natively with the Databricks Data Intelligence Platform on Azure. Lakebase is now generally available in Azure Databricks enabling you and your team to start building and validating real-time and AI-driven applications directly on your lakehouse foundation. Why Azure Databricks Lakebase? Lakebase was created for modern workloads and reduce silos. By decoupling compute from storage, Lakebase treats infrastructure as an on-demand service—scaling automatically with workload needs and scaling to zero when idle. Key capabilities include: Serverless Postgres for Production Workloads: Lakebase delivers a managed Postgres experience with predictable performance and built-in reliability features suitable for production applications, while abstracting away infrastructure management. Instant Branching and Point-in-Time Recovery: Teams can create zero-copy branches of production data in seconds for testing, debugging, or experimentation, and restore databases to precise points in time to recover from errors or incidents. Unified Governance with Unity Catalog: Operational data in Lakebase can be governed using the same Unity Catalog policies that secure analytics and AI workloads, enabling consistent access control, auditing, and compliance across the platform. Built for AI and Real-Time Applications: Lakebase is designed to support AI-native patterns such as real-time feature serving, agent memory, and low-latency application state—while keeping data directly connected to the lakehouse for analytics and learning workflows. Lakebase allows applications to operate directly on governed, lake-backed data—reducing complexity with pipeline synchronization or duplicating storage On Azure Databricks, this unlocks new scenarios such as: Real-time applications built on lakehouse data AI agents with persistent, governed memory Faster release cycles with safe, isolated database branches Simplified architectures with fewer moving parts All while using familiar Postgres interfaces and tools. Get Started with Azure Databricks Lakebase Lakebase is integrated into the Azure Databricks experience and can be provisioned directly within Azure Databricks workspaces. For Azure Databricks customers building intelligent, real-time applications, it offers a new foundation—one designed for the pace and complexity of modern data-driven systems. We’re excited to see what you build, get started today!1.1KViews0likes0Comments