The Challenge: Too Many Tools, Not Enough Clarity
Modern data teams on Azure often stitch together separate orchestrators, custom streaming consumers, hand-rolled transformation notebooks, and third-party connectors — each with its own monitoring UI, credential system, and failure modes. The result is observability gaps, weeks of work per new data source, disconnected lineage, and governance bolted on as an afterthought.
Lakeflow, Databricks’ unified data engineering solution, solves this by consolidating ingestion, transformation, and orchestration natively inside Azure Databricks — governed end-to-end by Unity Catalog.
|
Component |
What It Does |
|
Point-and-click connectors for databases (using CDC), SaaS apps, files, streaming, and Zerobus for direct telemetry | |
|
Declarative ETL with AutoCDC, data quality enforcement, and automatic incremental processing | |
|
Managed orchestration with 99.95% uptime, a visual task DAG, and repair-and-rerun |
Architecture
Step 1: Stream Application Telemetry with Zerobus Ingest
Zerobus Ingest, part of Lakeflow Connect, lets your application push events directly to a Delta table over gRPC — no message bus, no Structured Streaming job. Sub-5-second latency, up to 100 MB/sec per connection, immediately queryable in Unity Catalog.
Prerequisites
- Azure Databricks workspace with Unity Catalog enabled and serverless compute on
- A service principal with write access to the target table
Setup
First, create the target table in a SQL notebook:
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA IF NOT EXISTS prod.bronze;
CREATE TABLE IF NOT EXISTS prod.bronze.telemetry_events (
event_id STRING,
user_id STRING,
event_type STRING,
session_id STRING,
ts BIGINT,
page STRING,
duration_ms INT
);
1. Go to Settings → Identity and Access → Service Principals → Add service principal
2. Open the service principal → Secrets tab → Generate secret. Save the Client ID and secret.
3. In a SQL notebook, grant access:
GRANT USE CATALOG ON CATALOG prod TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA prod.bronze TO `<client-id>`;
GRANT MODIFY, SELECT ON TABLE prod.bronze.telemetry_events TO `<client-id>`;
4. Derive your Zerobus endpoint from your workspace URL: <workspace-id>.zerobus.<region>.azuredatabricks.net (The workspace ID is the number in your workspace URL, e.g. adb-**1234567890**.12.azuredatabricks.net)
5. Install the SDK: pip install databricks-zerobus-ingest-sdk
6. In your application, open a stream and push records:
from zerobus.sdk.sync import ZerobusSdk
from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties
sdk = ZerobusSdk("<workspace-id>.zerobus.<region>.azuredatabricks.net",
"https://<workspace-url>")
stream = sdk.create_stream(
"<client-id>", "<client-secret>",
TableProperties("prod.bronze.telemetry_events"),
StreamConfigurationOptions(record_type=RecordType.JSON)
)
stream.ingest_record({"event_id": "e1", "user_id": "u42",
"event_type": "page_view", "ts": 1700000000000})
stream.close()
7. Verify in Catalog → prod → bronze → telemetry_events → Sample Data
Step 2: Ingest from On-Premises SQL Server via CDC
Lakeflow Connect reads SQL Server's transaction log incrementally — no full table scans, no custom extraction software. Connectivity to your on-prem server is over Azure ExpressRoute.
Prerequisites
- SQL Server reachable from Databricks over ExpressRoute (TCP port 1433)
- CDC enabled on the source database and tables (see setup below)
- A SQL login with CDC read permissions on the source database
- Databricks: CREATE CONNECTION privilege on the metastore; USE CATALOG, CREATE TABLE on the destination catalog
Setup
Enable CDC on SQL Server:
USE YourDatabase;
EXEC sys.sp_cdc_enable_db;
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo', @source_name = N'orders', @role_name = NULL;
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo', @source_name = N'customers', @role_name = NULL;
Configure the connector in Databricks:
- Click Data Ingestion in the sidebar (or + New → Add Data)
- Select SQL Server from the connector list
- Ingestion Gateway page — enter a gateway name, select staging catalog/schema, click Next
- Ingestion Pipeline page — name the pipeline, click Create connection:
- Host: your on-prem IP (e.g. 10.0.1.50) · Port: 1433 · Database: YourDatabase
- Enter credentials, click Create, then Create pipeline and continue
- Source page — expand the database tree, check dbo.orders and dbo.customers; optionally enable History tracking (SCD Type 2) per table. Set Destination table name to orders_raw and customers_raw respectively.
- Destination page — set catalog: prod, schema: bronze, click Save and continue
- Settings page — set a sync schedule (e.g. every 5 minutes), click Save and run pipeline
Step 3: Transform with Spark Declarative Pipelines
The Lakeflow Pipelines Editor is an IDE built for developing pipelines in Lakeflow Spark Declarative Pipelines (SDP), and lets you define Bronze → Silver → Gold in SQL. SDP then handles incremental execution, schema evolution, and lineage automatically.
Prerequisites
- Bronze tables populated (from Steps 1 and 2)
- CREATE TABLE and USE SCHEMA privileges on prod.silver and prod.gold
Setup
1. In the sidebar, click Jobs & Pipelines → ETL pipeline → Start with an empty file → SQL
2. Rename the pipeline (click the name at top) to lakeflow-demo-pipeline
3. Paste the following SQL:
-- Silver: latest order state (SCD Type 1)
CREATE OR REFRESH STREAMING TABLE prod.silver.orders;
APPLY CHANGES INTO prod.silver.orders
FROM STREAM(prod.bronze.orders_raw)
KEYS (order_id) SEQUENCE BY updated_at STORED AS SCD TYPE 1;
-- Silver: full customer history (SCD Type 2)
CREATE OR REFRESH STREAMING TABLE prod.silver.customers;
APPLY CHANGES INTO prod.silver.customers
FROM STREAM(prod.bronze.customers_raw)
KEYS (customer_id) SEQUENCE BY updated_at STORED AS SCD TYPE 2;
-- Silver: telemetry with data quality check
CREATE OR REFRESH STREAMING TABLE prod.silver.telemetry_events (
CONSTRAINT valid_event_type
EXPECT (event_type IN ('page_view', 'add_to_cart', 'purchase'))
ON VIOLATION DROP ROW
) AS SELECT * FROM STREAM(prod.bronze.telemetry_events);
-- Gold: materialized view joining all three Silver tables
CREATE OR REFRESH MATERIALIZED VIEW prod.gold.customer_activity AS
SELECT
o.order_id, o.customer_id, c.customer_name, c.email,
o.order_amount, o.order_status,
COUNT(e.event_id) AS total_events,
SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_events
FROM prod.silver.orders o
LEFT JOIN prod.silver.customers c ON o.customer_id = c.customer_id
LEFT JOIN prod.silver.telemetry_events e
ON CAST(o.customer_id AS STRING) = e.user_id -- user_id in telemetry is string
GROUP BY o.order_id, o.customer_id, c.customer_name, c.email,
o.order_amount, o.order_status;
4. Click Settings (gear icon) → set Pipeline mode: Continuous → Target catalog: prod → Save
5. Click Start — the editor switches to the live Graph view
Step 4: Govern with Unity Catalog
All tables from Steps 1–3 are automatically registered in Unity Catalog, Databricks’ built-in governance and security offering, with full lineage. No manual registration needed.
View lineage
- Go to Catalog → prod → gold → customer_activity
- Click the Lineage tab → See Lineage Graph
- Click the expand icon on each upstream node to reveal the full chain: Bronze sources → Silver → Gold
Set Permissions
-- Grant analysts read access to the Gold layer only
GRANT SELECT ON TABLE prod.gold.customer_activity TO `analysts@contoso.com`;
-- Mask PII for non-privileged users
CREATE FUNCTION prod.security.mask_email(email STRING)
RETURNS STRING
RETURN CASE WHEN is_account_group_member('data-engineers') THEN email
ELSE CONCAT(LEFT(email, 2), '***@***.com') END;
ALTER TABLE prod.silver.customers
ALTER COLUMN email SET MASK prod.security.mask_email;
Step 5: Orchestrate and Monitor with Lakeflow Jobs
Wire the Connect pipeline and SDP pipeline into a single job with dependencies, scheduling, and alerting, all from the UI with Lakeflow Jobs.
Prerequisites
- Pipelines from Steps 2 and 3 saved in the workspace
Setup
- Go to Jobs & Pipelines → Create → Job
- Task 1: click the Pipeline tile → name it ingest_sql_server_cdc → select your Lakeflow Connect pipeline → Create task
- Task 2: click + Add task → Pipeline → name it transform_bronze_to_gold → select lakeflow-demo-pipeline → set Depends on: ingest_sql_server_cdc → Create task
- In the Job details panel on the right: click Add schedule → set frequency → add email notification on failure → Save
- Click Run now to trigger a run, then click the run ID to open the Run detail view
For health monitoring across all jobs, query system tables in any notebook or SQL warehouse:
SELECT job_name, result_state, DATEDIFF(second, start_time, end_time) AS duration_sec
FROM system.lakeflow.job_run_timeline
WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 24 HOURS
ORDER BY start_time DESC;
Step 6: Visualize with AI/BI Dashboards and Genie
AI/BI Dashboard helps you create AI-powered, low-code dashboards.
- Click + New → Dashboard
- Click Add a visualization, connect to prod.gold.customer_activity, and build charts
- Click Publish — viewers see data under their own Unity Catalog permissions automatically
Genie allows you to interact with their data using natural language
1. In the sidebar, click Genie → New
2. On Choose data sources, select prod.gold.customer_activity → Create
3. Add context in the Instructions box (e.g., table relationships, business definitions)
4. Switch to the Chat tab and ask a question:
"Which customers have the highest total events and what were their order amounts?"
5. Genie generates and executes SQL, returning a result table. Click View SQL to inspect the query.
Everything in One Platform
|
Capability |
Lakeflow |
Previously Required |
|
Telemetry ingestion |
Zerobus Ingest |
Message bus + custom consumer |
|
Database CDC |
Lakeflow Connect |
Custom scripts or 3rd-party tools |
|
Transformation + AutoCDC |
Spark Declarative Pipelines |
Hand-rolled MERGE logic |
|
Data quality |
SDP Expectations |
Separate validation tooling |
|
Orchestration |
Lakeflow Jobs |
External schedulers (Airflow, etc.) |
|
Governance |
Unity Catalog |
Disconnected ACLs and lineage |
|
Monitoring |
Job UI + System Tables |
Separate APM tools |
|
BI + NL Query |
AI/BI Dashboards + Genie |
External BI tools |
Customers seeing results on Azure Databricks:
- Ahold Delhaize — 4.5x faster deployment and 50% cost reduction running 1,000+ ingestion jobs daily
- Porsche Holding — 85% faster ingestion pipeline development vs. a custom-built solution