Blog Post

Azure Databricks
7 MIN READ

Near–Real-Time CDC to Delta Lake for BI and ML with Lakeflow on Azure Databricks

AnaviNahar's avatar
AnaviNahar
Icon for Microsoft rankMicrosoft
Mar 18, 2026

The Challenge: Too Many Tools, Not Enough Clarity 

Modern data teams on Azure often stitch together separate orchestrators, custom streaming consumers, hand-rolled transformation notebooks, and third-party connectors — each with its own monitoring UI, credential system, and failure modes. The result is observability gaps, weeks of work per new data source, disconnected lineage, and governance bolted on as an afterthought. 

Lakeflow, Databricks’ unified data engineering solution, solves this by consolidating ingestion, transformation, and orchestration natively inside Azure Databricks — governed end-to-end by Unity Catalog. 

Component 

What It Does 

Lakeflow Connect 

Point-and-click connectors for databases (using CDC), SaaS apps, files, streaming, and Zerobus for direct telemetry 

Lakeflow Spark Declarative Pipelines 

Declarative ETL with AutoCDC, data quality enforcement, and automatic incremental processing 

Lakeflow Jobs 

Managed orchestration with 99.95% uptime, a visual task DAG, and repair-and-rerun 

 Architecture 

 Step 1: Stream Application Telemetry with Zerobus Ingest 

Zerobus Ingest, part of Lakeflow Connect, lets your application push events directly to a Delta table over gRPC — no message bus, no Structured Streaming job. Sub-5-second latency, up to 100 MB/sec per connection, immediately queryable in Unity Catalog. 

Prerequisites 

  • Azure Databricks workspace with Unity Catalog enabled and serverless compute on
  • A service principal with write access to the target table 

Setup 

First, create the target table in a SQL notebook: 

CREATE CATALOG IF NOT EXISTS prod; 
CREATE SCHEMA IF NOT EXISTS prod.bronze;
CREATE TABLE IF NOT EXISTS prod.bronze.telemetry_events (
  event_id    STRING,
  user_id     STRING,
  event_type  STRING,
  session_id  STRING,
  ts          BIGINT,
  page        STRING,
  duration_ms INT
);

1. Go to Settings → Identity and Access → Service Principals → Add service principal

2. Open the service principal → Secrets tab → Generate secret. Save the Client ID and secret.

3. In a SQL notebook, grant access: 

GRANT USE CATALOG ON CATALOG prod TO `<client-id>`; 
GRANT USE SCHEMA ON SCHEMA prod.bronze TO `<client-id>`;
GRANT MODIFYSELECT ON TABLE prod.bronze.telemetry_events TO `<client-id>`;

4. Derive your Zerobus endpoint from your workspace URL: <workspace-id>.zerobus.<region>.azuredatabricks.net (The workspace ID is the number in your workspace URL, e.g. adb-**1234567890**.12.azuredatabricks.net)

5. Install the SDK: pip install databricks-zerobus-ingest-sdk

6. In your application, open a stream and push records: 

from zerobus.sdk.sync import ZerobusSdk 
from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties
sdk    = ZerobusSdk("<workspace-id>.zerobus.<region>.azuredatabricks.net",
                    "https://<workspace-url>")
stream = sdk.create_stream(
    "<client-id>""<client-secret>",
    TableProperties("prod.bronze.telemetry_events"),
    StreamConfigurationOptions(record_type=RecordType.JSON)
)
stream.ingest_record({"event_id""e1""user_id""u42",
                      "event_type""page_view""ts"1700000000000})
stream.close()

7. Verify in Catalog → prod → bronze → telemetry_events → Sample Data 

 Step 2: Ingest from On-Premises SQL Server via CDC 

Lakeflow Connect reads SQL Server's transaction log incrementally — no full table scans, no custom extraction software. Connectivity to your on-prem server is over Azure ExpressRoute. 

Prerequisites 

  • SQL Server reachable from Databricks over ExpressRoute (TCP port 1433)
  • CDC enabled on the source database and tables (see setup below)
  • A SQL login with CDC read permissions on the source database
  • Databricks: CREATE CONNECTION privilege on the metastore; USE CATALOGCREATE TABLE on the destination catalog 

Setup

Enable CDC on SQL Server: 
USE YourDatabase;
EXEC sys.sp_cdc_enable_db;
EXEC sys.sp_cdc_enable_table
  @source_schema = N'dbo', @source_name = N'orders',    @role_name = NULL;
EXEC sys.sp_cdc_enable_table
  @source_schema = N'dbo', @source_name = N'customers', @role_name = NULL; 

 Configure the connector in Databricks: 

  1. Click Data Ingestion in the sidebar (or + New → Add Data)
  2. Select SQL Server from the connector list
  3. Ingestion Gateway page — enter a gateway name, select staging catalog/schema, click Next
  4. Ingestion Pipeline page — name the pipeline, click Create connection:
  5. Host: your on-prem IP (e.g. 10.0.1.50) · Port: 1433 · Database: YourDatabase
  6. Enter credentials, click Create, then Create pipeline and continue
  7. Source page — expand the database tree, check dbo.orders and dbo.customers; optionally enable History tracking (SCD Type 2) per table. Set Destination table name to orders_raw and customers_raw respectively.
  8. Destination page — set catalog: prod, schema: bronze, click Save and continue
  9. Settings page — set a sync schedule (e.g. every 5 minutes), click Save and run pipeline 

 Step 3: Transform with Spark Declarative Pipelines 

The Lakeflow Pipelines Editor is an IDE built for developing pipelines in Lakeflow Spark Declarative Pipelines (SDP), and lets you define Bronze → Silver → Gold in SQL. SDP then handles incremental execution, schema evolution, and lineage automatically. 

Prerequisites 

  • Bronze tables populated (from Steps 1 and 2)
  • CREATE TABLE and USE SCHEMA privileges on prod.silver and prod.gold 

Setup 

1. In the sidebar, click Jobs & Pipelines → ETL pipeline → Start with an empty file → SQL

2. Rename the pipeline (click the name at top) to lakeflow-demo-pipeline

3. Paste the following SQL: 

-- Silver: latest order state (SCD Type 1) 
CREATE OR REFRESH STREAMING TABLE prod.silver.orders;
APPLY CHANGES INTO prod.silver.orders
  FROM STREAM(prod.bronze.orders_raw)
  KEYS (order_id) SEQUENCE BY updated_at STORED AS SCD TYPE 1;

-- Silver: full customer history (SCD Type 2)
CREATE OR REFRESH STREAMING TABLE prod.silver.customers;
APPLY CHANGES INTO prod.silver.customers
  FROM STREAM(prod.bronze.customers_raw)
  KEYS (customer_id) SEQUENCE BY updated_at STORED AS SCD TYPE 2;

-- Silver: telemetry with data quality check
CREATE OR REFRESH STREAMING TABLE prod.silver.telemetry_events ( 
 CONSTRAINT valid_event_type
    EXPECT (event_type IN ('page_view''add_to_cart''purchase'))
    ON VIOLATION DROP ROW
AS SELECT FROM STREAM(prod.bronze.telemetry_events);

-- Gold: materialized view joining all three Silver tables
CREATE OR REFRESH MATERIALIZED VIEW prod.gold.customer_activity AS
SELECT
  o.order_id, o.customer_id, c.customer_name, c.email,
  o.order_amount, o.order_status,
  COUNT(e.event_id) AS total_events,
  SUM(CASE WHEN e.event_type = 'purchase' THEN ELSE ENDAS purchase_events
FROM prod.silver.orders o
LEFT JOIN prod.silver.customers c ON o.customer_id = c.customer_id
LEFT JOIN prod.silver.telemetry_events e
  ON CAST(o.customer_id AS STRING) = e.user_id  -- user_id in telemetry is string
GROUP BY o.order_id, o.customer_id, c.customer_name, c.email,
  o.order_amount, o.order_status; 

4. Click Settings (gear icon) → set Pipeline mode: Continuous → Target catalog: prod → Save 

5. Click Start — the editor switches to the live Graph view 

 Step 4: Govern with Unity Catalog 

All tables from Steps 1–3 are automatically registered in Unity Catalog, Databricks’ built-in governance and security offering, with full lineage. No manual registration needed. 

View lineage 

  1. Go to Catalog → prod → gold → customer_activity
  2. Click the Lineage tab → See Lineage Graph
  3. Click the expand icon on each upstream node to reveal the full chain: Bronze sources → Silver → Gold

Set Permissions


-- Grant analysts read access to the Gold layer only
GRANT SELECT ON TABLE prod.gold.customer_activity TO `analysts@contoso.com`;
-- Mask PII for non-privileged users
CREATE FUNCTION prod.security.mask_email(email STRING)
  RETURNS STRING
  RETURN CASE WHEN is_account_group_member('data-engineers'THEN email
              ELSE CONCAT(LEFT(email, 2), '***@***.com'END;
ALTER TABLE prod.silver.customers
  ALTER COLUMN email SET MASK prod.security.mask_email; 

 

Step 5: Orchestrate and Monitor with Lakeflow Jobs 

Wire the Connect pipeline and SDP pipeline into a single job with dependencies, scheduling, and alerting, all from the UI with Lakeflow Jobs. 

Prerequisites 

  • Pipelines from Steps 2 and 3 saved in the workspace 

Setup 

  1. Go to Jobs & Pipelines → Create → Job
  2. Task 1: click the Pipeline tile → name it ingest_sql_server_cdc → select your Lakeflow Connect pipeline → Create task
  3. Task 2: click + Add task → Pipeline → name it transform_bronze_to_gold → select lakeflow-demo-pipeline → set Depends on: ingest_sql_server_cdc → Create task
  4. In the Job details panel on the right: click Add schedule → set frequency → add email notification on failure → Save
  5. Click Run now to trigger a run, then click the run ID to open the Run detail view 

For health monitoring across all jobs, query system tables in any notebook or SQL warehouse: 

SELECT job_name, result_state, DATEDIFF(second, start_time, end_time) AS duration_sec 

FROM system.lakeflow.job_run_timeline 

WHERE start_time >= CURRENT_TIMESTAMP INTERVAL 24 HOURS 

ORDER BY start_time DESC; 

 

Step 6: Visualize with AI/BI Dashboards and Genie 

AI/BI Dashboard helps you create AI-powered, low-code dashboards. 

  1. Click + New → Dashboard
  2. Click Add a visualization, connect to prod.gold.customer_activity, and build charts
  3. Click Publish — viewers see data under their own Unity Catalog permissions automatically 

Genie allows you to interact with their data using natural language 

1. In the sidebar, click Genie → New

2. On Choose data sources, select prod.gold.customer_activity → Create 

3. Add context in the Instructions box (e.g., table relationships, business definitions)

4. Switch to the Chat tab and ask a question:

"Which customers have the highest total events and what were their order amounts?" 

5. Genie generates and executes SQL, returning a result table. Click View SQL to inspect the query. 

Everything in One Platform 

Capability 

Lakeflow 

Previously Required 

Telemetry ingestion 

Zerobus Ingest 

Message bus + custom consumer 

Database CDC 

Lakeflow Connect 

Custom scripts or 3rd-party tools 

Transformation + AutoCDC 

Spark Declarative Pipelines 

Hand-rolled MERGE logic 

Data quality 

SDP Expectations 

Separate validation tooling 

Orchestration 

Lakeflow Jobs 

External schedulers (Airflow, etc.) 

Governance 

Unity Catalog 

Disconnected ACLs and lineage 

Monitoring 

Job UI + System Tables 

Separate APM tools 

BI + NL Query 

AI/BI Dashboards + Genie 

External BI tools 

Customers seeing results on Azure Databricks: 

  • Ahold Delhaize — 4.5x faster deployment and 50% cost reduction running 1,000+ ingestion jobs daily
  • Porsche Holding — 85% faster ingestion pipeline development vs. a custom-built solution 

Next Steps 

Updated Mar 16, 2026
Version 1.0
No CommentsBe the first to comment