azure databricks
110 TopicsAzure Managed Redis & Azure Databricks: Real-time Feature Serving for Low-Latency Decisions
This blog content has been a collective collaboration between the Azure Databricks and Azure Managed Redis Product and Product Marketing teams. Executive summary Modern decisioning systems, fraud scoring, payments authorization, personalization, and step-up authentication, must return answers in tens of milliseconds while still reflecting the most recent behavior. That creates a classic tension: lakehouse platforms excel at large-scale ingestion, feature engineering, governance, training, and replayable history, but they are not designed to sit directly on the synchronous request path for high-QPS, ultra-low-latency lookups. This guide shows a pattern that keeps Azure Databricks as the primary system for building and maintaining features, while using Azure Managed Redis as the online speed layer that serves those features at memory speed for real-time scoring. The result is a shorter and more predictable critical path for your application: the Payment API (or any online service) reads features from Azure Managed Redis and calls a model endpoint; Azure Databricks continuously refreshes features from streaming and batch sources; and your authoritative systems of record (for example, account/card data) remain durable and governed. You get real-time responsiveness without giving up data correctness, lineage, or operational discipline. What each service does Azure Databricks is a first-party analytics and AI platform on Azure built on Apache Spark and the lakehouse architecture. It is commonly used for batch and streaming pipelines, feature engineering, model training, governance, and operationalization of ML workflows. In this architecture, Azure Databricks is the primary data and AI platform environment where features are defined, computed, validated, published, as well as where governed history is retained. Azure Managed Redis is a Microsoft‑managed, in‑memory data store based on Redis Enterprise, designed for low‑latency, high‑throughput access patterns. It is commonly used for traditional and real‑time caching, counters, and session state, and increasingly as a fast state layer for AI‑driven applications. In this architecture, Azure Managed Redis serves as the online feature store and speed layer: it holds the most recent feature values and signals required for real‑time scoring and can also support modern agentic patterns such as short‑ and long‑term memory, vector lookups, and fast state access alongside model inference. Business story: real-time fraud scoring as a running example Consider a payment system that must decide to approve, decline, or step-up authentication in tens of milliseconds—faster than a blink of an eye! The decision depends on recent behavioral signals, velocity counters, device changes, geo anomalies, and merchant patterns, combined with a fraud model. If the online service tries to compute or retrieve those features from heavy analytics systems on-demand, the request path becomes slower and more variable, especially at peak load. Instead, Azure Databricks pipelines continuously compute and refresh those features, and Azure Managed Redis serves them instantly to the scoring service. Behavioral history, profiles, and outcomes are still written to durable Azure datastores such as Delta tables, and Azure Cosmos DB so fraud models can be retrained with governed, reproducible data. The pattern: online feature serving with a speed layer The core idea is to separate responsibilities. Azure Databricks owns “building” features, ingest, join, aggregate, compute windows, and publish validated governed results. Azure Managed Redis owns “serving” features, fast, repeated key-based access on the hot path. The model endpoint then consumes a feature payload that is already pre-shaped for inference. This division prevents the lakehouse from becoming an online dependency and lets you scale online decisioning independently from offline compute. Pseudocode: end-to-end flow (online scoring + feature refresh) The pseudocode below intentionally reads like application logic rather than a single SDK. It highlights what matters: key design, pipelined feature reads, conservative fallbacks, and continuous refresh from Azure Databricks. # ---------------------------- # Online scoring (critical path) # ---------------------------- function handleAuthorization(req): schemaV = "v3" keys = buildFeatureKeys(schemaV, req) # card/device/merchant + windows feats = redis.MGET(keys) # single round trip (pipelined) feats = fillDefaults(feats) # conservative, no blocking payload = toModelPayload(req, feats) score = modelEndpoint.predict(payload) # Databricks Model Serving or an Azure-hosted model endpoint decision = policy(score, req) # approve/decline/step-up emitEventHub("txn_events", summarize(req, score, decision)) # async emitMetrics(redisLatencyMs, modelLatencyMs, missCount(feats)) return decision # ----------------------------------------- # Feature pipeline (async): build + publish # ----------------------------------------- function streamingFeaturePipeline(): events = readEventHubs("txn_events") ref = readCosmos("account_card_reference") # system of record lookups feats = computeFeatures(events, ref) # windows, counters, signals writeDelta("fraud_feature_history", feats) # ADLS Delta tables (lakehouse) publishLatestToRedis(feats, schemaV="v3") # SET/HSET + TTL (+ jitter) # ----------------------------------- # Training + deploy (async lifecycle) # ----------------------------------- function trainAndDeploy(): hist = readDelta("fraud_feature_history") labels = readCosmos("fraud_outcomes") # delayed ground truth model = train(joinPointInTime(hist, labels)) register(model) deployToDatabricksModelServing(model) Why it works This architecture works because each layer does the job it is best at. The lakehouse and feature pipelines handle heavy computation, validation, lineage, and re-playable history. The online speed layer handles locality and frequency: it keeps the “hot” feature state close to the online compute so requests do not pay the cost of re-computation or large fan-out reads. You explicitly control freshness with TTLs and refresh cadence, and you keep clear correctness boundaries by treating Azure Managed Redis as a serving layer rather than the authoritative system of record, with durable, governed feature history and labels stored in Delta tables and Azure data stores such as Azure Cosmos DB. Design choices that matter Cost efficiency and availability start with clear separation of concerns. Serving hot features from Azure Managed Redis avoids sizing analytics infrastructure for high‑QPS, low‑latency SLAs, and enables predictable capacity planning with regional isolation for online services. Azure Databricks remains optimized for correctness, freshness, and re-playable history while the online tier scales independently by request rate and working set size. Freshness and TTLs should reflect business tolerance for staleness and the meaning of each feature. Short velocity windows need TTLs slightly longer than ingestion gaps, while profiles and reference features can live longer. Adding jitter (for example ±10%) prevents synchronized expirations that create load spikes. Key design is the control plane for safe evolution and availability. Include explicit schema version prefixes and keep keys stable by entity and window. Publish new versions alongside existing ones, switch readers, and retire old versions to enable zero‑downtime rollouts. Protect the online path from stampedes and unnecessary cost. If a hot key is missing, avoid triggering widespread re-computation in downstream systems. Use a short single‑flight mechanism and conservative defaults, especially for risk‑sensitive decisions. Keep payloads compact so performance and cost remain predictable. Online feature reads are fastest when values are small and fetched in one or two round trips. Favor numeric encodings and small blobs, and use atomic writes to avoid partial or inconsistent reads during scoring. Reference architecture notes (regional first, then global) Start with a single-region deployment to validate end-to-end freshness and latency. Co-locate the Payment API compute, Azure Managed Redis, the model endpoint, and the primary data sources for feature pipelines to minimize round trips. Once the pattern is proven, extend to multi-region by deploying the online tier and its local speed layer per region, while keeping a clear strategy for how features are published and reconciled across regions (often via regional pipelines that consume the same event stream or replicated event hubs). Operations and SRE considerations Layer What to Monitor Why It Matters Typical Signals / Metrics Online service (API / scoring) End‑to‑end request latency, error rate, fallback rate Confirms the critical path meets application SLAs even under partial degradation p50/p95/p99 latency, error %, step‑up or conservative decision rate Azure Managed Redis (speed layer) Feature fetch latency, hit/miss ratio, memory pressure Indicates whether the working set fits and whether TTLs align with access patterns GET/MGET latency, miss %, evictions, memory usage Model serving Inference latency, throughput, saturation Separates model execution cost from feature access cost Inference p95 latency, QPS, concurrency utilization Azure Databricks feature pipelines Streaming lag, job health, data freshness Ensures features are being refreshed on time and correctness is preserved Event lag, job failures, watermark delay Cross‑layer boundaries Correlation between misses, latency spikes, and pipeline lag Helps identify whether regressions originate in serving, pipelines, or models Redis miss spikes vs pipeline delays vs API latency Monitor each layer independently, then correlate at the boundaries. This makes it clear whether an SLA issue is caused by online serving pressure, model inference, or delayed feature publication, without turning the lakehouse into a synchronous dependency. Putting it all together Adopt the pattern incrementally. First, publish a small, high-value feature set from Azure Databricks into Azure Managed Redis and wire the online service to fetch those features during scoring. Measure end-to-end impact on latency, model quality, and operational stability. Next, extend to streaming refresh for near-real-time behavioral features, and add controlled fallbacks for partial misses. Finally, scale out to multi-region if needed, keeping each region’s online service close to its local speed layer and ensuring the feature pipelines provide consistent semantics across regions. Sources and further reading Azure Databricks documentation: https://learn.microsoft.com/en-us/azure/databricks/ Azure Managed Redis documentation (overview and architecture): https://learn.microsoft.com/azure/redis/ Azure Architecture Center: Stream processing with Azure Databricks: https://learn.microsoft.com/azure/architecture/reference-architectures/data/stream-processing-databricks Databricks Feature Store / feature engineering docs (Azure Databricks): https://learn.microsoft.com/azure/databricks/274Views1like0CommentsAnnouncing the New Home for the Azure Databricks Blog
We’re excited to share that the Azure Databricks blog has moved to a new address on Microsoft Tech Community Hub! Azure Databricks | Microsoft Community Hub Our new blog home is designed to make it easier than ever for you to discover the latest product updates, deep technical insights, and real-world best practices directly from the Azure Databricks product team. Whether you're a data engineer, data scientist, or analytics leader, this is your go-to destination for staying informed and inspired. What You’ll Find on the New Blog At our new address, you can expect: Latest Announcements – Stay up to date with new features, capabilities, and releases Best Practice Guidance – Learn proven approaches for building scalable data and AI solutions Technical Deep Dives – Explore detailed walkthroughs and architecture insights Customer Stories – See how organizations are driving impact with Azure Databricks Why the Move? This new blog gives us the flexibility to deliver a better reading experience, improved navigation, and richer content dedicated to Azure Databricks. It also allows us to bring you more frequent updates and more in-depth resources tailored to your needs. Stay Connected We encourage you to bookmark the new blog and check back regularly. Even better—follow along so you never miss an update. By staying connected, you’ll be among the first to hear about new features, performance improvements, and expert recommendations to help you get the most out of Azure Databricks. 👉 Follow the new Azure Databricks blog today and stay ahead with the latest announcements and best practices. We’re looking forward to continuing this journey with you—now at our new home! Check out the latest blogs if you haven’t already: • Introducing Lakeflow Connect Free Tier, now available on Azure Databricks | Microsoft Community Hub •Near–Real-Time CDC to Delta Lake for BI and ML with Lakeflow on Azure Databricks | Microsoft Community Hub157Views0likes0CommentsIntroducing Lakeflow Connect Free Tier, now available on Azure Databricks
We're excited to introduce the Lakeflow Connect Free Tier on Azure Databricks, so you can easily bring your enterprise data into your lakehouse to build analytics and AI applications faster. Modern applications require reliable access to operational data, especially for training analytics and AI agents, but connecting and gathering data across silos can be challenging. With this new release, you can seamlessly ingest all of your enterprise data from SaaS and database sources to unlock data intelligence for your AI agents. Ingest millions of records per day, per workspace for free This new Lakeflow Connect Free Tier provides 100 DBUs per day, per workspace, which allows you to ingest approximately 100 million records* from many popular data sources**, including SaaS applications and databases. Unlock your enterprise data for free with Lakeflow Connect This new offering provides all the benefits of Lakeflow Connect, eliminating the heavy lifting so your teams can focus on unlocking data insights instead of managing infrastructure. In the past year, Databricks has continued rolling out several fully managed connectors, supporting popular data sources. The free tier supports popular SaaS applications (Salesforce, ServiceNow, Google Analytics, Workday, Microsoft Dynamics 365), and top-used databases (SQL Server, Oracle, Teradata, PostgreSQL, MySQL, Snowflake, Redshift, Synapse, and BigQuery). Lakeflow Connect benefits include: Simple UI: Avoid complex setups and architectural overhead, these fully managed connectors provide a simple UI and API to democratize data access. Automated features also help simplify pipeline maintenance with minimal overhead. Efficient ingestion: Increase efficiency and accelerate time to value. Optimized incremental reads and writes and data transformation help improve the performance and reliability of your pipelines, reduce bottlenecks, and reduce impact to the source data for scalability. Unified with the Databricks Platform: Create ingestion pipelines with governance from Unity Catalog, observability from Lakehouse Monitoring, and seamless orchestration with Lakeflow Jobs for analytics, AI and BI. Availability The Lakeflow Connect Free Tier is available starting today on Azure Databricks. If you are at FabCon in Atlanta, Accelerating Data and AI with Azure Databricks on Thursday, March 19th, 8:00–9:00 AM, room C302 to see how these capabilities come together to accelerate performance, simplify architecture, and maximize value on Azure Getting Started Resources To learn more about the Lakeflow Connect Free Tier and Lakeflow Connect, review our pricing page, and documentation. Get started ingesting your data today for free, signup with an Azure free account. Get started with Azure Databricks for free Product tour: Databricks Lakeflow Connect for Salesforce: Powering Smarter Selling with AI and Analytics Product tour: Effortless ServiceNow Data Ingestion with Databricks Lakeflow Connect Product tour: Simplify Data Ingestion with Lakeflow Connect: From Google Analytics to AI On-demand video: Use Lakeflow Connect for Salesforce to predict customer churn On-demand video: Databricks Lakeflow Connect for Workday Reports: Connect, Ingest, and Analyze Workday Data Without Complexity On-demand video: Data Ingestion With Lakeflow Connect —-- * Your actual ingestion capacity will vary based on specific workload characteristics, record sizes, and source types. ** Excludes Zerobus Ingest, Auto Loader and other self-managed connectors. Customer will continue to incur charges for underlying infrastructure consumption from the cloud vendor.3.4KViews0likes0CommentsNear–Real-Time CDC to Delta Lake for BI and ML with Lakeflow on Azure Databricks
The Challenge: Too Many Tools, Not Enough Clarity Modern data teams on Azure often stitch together separate orchestrators, custom streaming consumers, hand-rolled transformation notebooks, and third-party connectors — each with its own monitoring UI, credential system, and failure modes. The result is observability gaps, weeks of work per new data source, disconnected lineage, and governance bolted on as an afterthought. Lakeflow, Databricks’ unified data engineering solution, solves this by consolidating ingestion, transformation, and orchestration natively inside Azure Databricks — governed end-to-end by Unity Catalog. Component What It Does Lakeflow Connect Point-and-click connectors for databases (using CDC), SaaS apps, files, streaming, and Zerobus for direct telemetry Lakeflow Spark Declarative Pipelines Declarative ETL with AutoCDC, data quality enforcement, and automatic incremental processing Lakeflow Jobs Managed orchestration with 99.95% uptime, a visual task DAG, and repair-and-rerun Architecture Step 1: Stream Application Telemetry with Zerobus Ingest Zerobus Ingest, part of Lakeflow Connect, lets your application push events directly to a Delta table over gRPC — no message bus, no Structured Streaming job. Sub-5-second latency, up to 100 MB/sec per connection, immediately queryable in Unity Catalog. Prerequisites Azure Databricks workspace with Unity Catalog enabled and serverless compute on A service principal with write access to the target table Setup First, create the target table in a SQL notebook: CREATE CATALOG IF NOT EXISTS prod; CREATE SCHEMA IF NOT EXISTS prod.bronze; CREATE TABLE IF NOT EXISTS prod.bronze.telemetry_events ( event_id STRING, user_id STRING, event_type STRING, session_id STRING, ts BIGINT, page STRING, duration_ms INT ); 1. Go to Settings → Identity and Access → Service Principals → Add service principal 2. Open the service principal → Secrets tab → Generate secret. Save the Client ID and secret. 3. In a SQL notebook, grant access: GRANT USE CATALOG ON CATALOG prod TO `<client-id>`; GRANT USE SCHEMA ON SCHEMA prod.bronze TO `<client-id>`; GRANT MODIFY, SELECT ON TABLE prod.bronze.telemetry_events TO `<client-id>`; 4. Derive your Zerobus endpoint from your workspace URL: <workspace-id>.zerobus.<region>.azuredatabricks.net (The workspace ID is the number in your workspace URL, e.g. adb-**1234567890**.12.azuredatabricks.net) 5. Install the SDK: pip install databricks-zerobus-ingest-sdk 6. In your application, open a stream and push records: from zerobus.sdk.sync import ZerobusSdk from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties sdk = ZerobusSdk("<workspace-id>.zerobus.<region>.azuredatabricks.net", "https://<workspace-url>") stream = sdk.create_stream( "<client-id>", "<client-secret>", TableProperties("prod.bronze.telemetry_events"), StreamConfigurationOptions(record_type=RecordType.JSON) ) stream.ingest_record({"event_id": "e1", "user_id": "u42", "event_type": "page_view", "ts": 1700000000000}) stream.close() 7. Verify in Catalog → prod → bronze → telemetry_events → Sample Data Step 2: Ingest from On-Premises SQL Server via CDC Lakeflow Connect reads SQL Server's transaction log incrementally — no full table scans, no custom extraction software. Connectivity to your on-prem server is over Azure ExpressRoute. Prerequisites SQL Server reachable from Databricks over ExpressRoute (TCP port 1433) CDC enabled on the source database and tables (see setup below) A SQL login with CDC read permissions on the source database Databricks: CREATE CONNECTION privilege on the metastore; USE CATALOG, CREATE TABLE on the destination catalog Setup Enable CDC on SQL Server: USE YourDatabase; EXEC sys.sp_cdc_enable_db; EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'orders', @role_name = NULL; EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'customers', @role_name = NULL; Configure the connector in Databricks: Click Data Ingestion in the sidebar (or + New → Add Data) Select SQL Server from the connector list Ingestion Gateway page — enter a gateway name, select staging catalog/schema, click Next Ingestion Pipeline page — name the pipeline, click Create connection: Host: your on-prem IP (e.g. 10.0.1.50) · Port: 1433 · Database: YourDatabase Enter credentials, click Create, then Create pipeline and continue Source page — expand the database tree, check dbo.orders and dbo.customers; optionally enable History tracking (SCD Type 2) per table. Set Destination table name to orders_raw and customers_raw respectively. Destination page — set catalog: prod, schema: bronze, click Save and continue Settings page — set a sync schedule (e.g. every 5 minutes), click Save and run pipeline Step 3: Transform with Spark Declarative Pipelines The Lakeflow Pipelines Editor is an IDE built for developing pipelines in Lakeflow Spark Declarative Pipelines (SDP), and lets you define Bronze → Silver → Gold in SQL. SDP then handles incremental execution, schema evolution, and lineage automatically. Prerequisites Bronze tables populated (from Steps 1 and 2) CREATE TABLE and USE SCHEMA privileges on prod.silver and prod.gold Setup 1. In the sidebar, click Jobs & Pipelines → ETL pipeline → Start with an empty file → SQL 2. Rename the pipeline (click the name at top) to lakeflow-demo-pipeline 3. Paste the following SQL: -- Silver: latest order state (SCD Type 1) CREATE OR REFRESH STREAMING TABLE prod.silver.orders; APPLY CHANGES INTO prod.silver.orders FROM STREAM(prod.bronze.orders_raw) KEYS (order_id) SEQUENCE BY updated_at STORED AS SCD TYPE 1; -- Silver: full customer history (SCD Type 2) CREATE OR REFRESH STREAMING TABLE prod.silver.customers; APPLY CHANGES INTO prod.silver.customers FROM STREAM(prod.bronze.customers_raw) KEYS (customer_id) SEQUENCE BY updated_at STORED AS SCD TYPE 2; -- Silver: telemetry with data quality check CREATE OR REFRESH STREAMING TABLE prod.silver.telemetry_events ( CONSTRAINT valid_event_type EXPECT (event_type IN ('page_view', 'add_to_cart', 'purchase')) ON VIOLATION DROP ROW ) AS SELECT * FROM STREAM(prod.bronze.telemetry_events); -- Gold: materialized view joining all three Silver tables CREATE OR REFRESH MATERIALIZED VIEW prod.gold.customer_activity AS SELECT o.order_id, o.customer_id, c.customer_name, c.email, o.order_amount, o.order_status, COUNT(e.event_id) AS total_events, SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_events FROM prod.silver.orders o LEFT JOIN prod.silver.customers c ON o.customer_id = c.customer_id LEFT JOIN prod.silver.telemetry_events e ON CAST(o.customer_id AS STRING) = e.user_id -- user_id in telemetry is string GROUP BY o.order_id, o.customer_id, c.customer_name, c.email, o.order_amount, o.order_status; 4. Click Settings (gear icon) → set Pipeline mode: Continuous → Target catalog: prod → Save 5. Click Start — the editor switches to the live Graph view Step 4: Govern with Unity Catalog All tables from Steps 1–3 are automatically registered in Unity Catalog, Databricks’ built-in governance and security offering, with full lineage. No manual registration needed. View lineage Go to Catalog → prod → gold → customer_activity Click the Lineage tab → See Lineage Graph Click the expand icon on each upstream node to reveal the full chain: Bronze sources → Silver → Gold Set Permissions -- Grant analysts read access to the Gold layer only GRANT SELECT ON TABLE prod.gold.customer_activity TO `analysts@contoso.com`; -- Mask PII for non-privileged users CREATE FUNCTION prod.security.mask_email(email STRING) RETURNS STRING RETURN CASE WHEN is_account_group_member('data-engineers') THEN email ELSE CONCAT(LEFT(email, 2), '***@***.com') END; ALTER TABLE prod.silver.customers ALTER COLUMN email SET MASK prod.security.mask_email; Step 5: Orchestrate and Monitor with Lakeflow Jobs Wire the Connect pipeline and SDP pipeline into a single job with dependencies, scheduling, and alerting, all from the UI with Lakeflow Jobs. Prerequisites Pipelines from Steps 2 and 3 saved in the workspace Setup Go to Jobs & Pipelines → Create → Job Task 1: click the Pipeline tile → name it ingest_sql_server_cdc → select your Lakeflow Connect pipeline → Create task Task 2: click + Add task → Pipeline → name it transform_bronze_to_gold → select lakeflow-demo-pipeline → set Depends on: ingest_sql_server_cdc → Create task In the Job details panel on the right: click Add schedule → set frequency → add email notification on failure → Save Click Run now to trigger a run, then click the run ID to open the Run detail view For health monitoring across all jobs, query system tables in any notebook or SQL warehouse: SELECT job_name, result_state, DATEDIFF(second, start_time, end_time) AS duration_sec FROM system.lakeflow.job_run_timeline WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 24 HOURS ORDER BY start_time DESC; Step 6: Visualize with AI/BI Dashboards and Genie AI/BI Dashboard helps you create AI-powered, low-code dashboards. Click + New → Dashboard Click Add a visualization, connect to prod.gold.customer_activity, and build charts Click Publish — viewers see data under their own Unity Catalog permissions automatically Genie allows you to interact with their data using natural language 1. In the sidebar, click Genie → New 2. On Choose data sources, select prod.gold.customer_activity → Create 3. Add context in the Instructions box (e.g., table relationships, business definitions) 4. Switch to the Chat tab and ask a question: "Which customers have the highest total events and what were their order amounts?" 5. Genie generates and executes SQL, returning a result table. Click View SQL to inspect the query. Everything in One Platform Capability Lakeflow Previously Required Telemetry ingestion Zerobus Ingest Message bus + custom consumer Database CDC Lakeflow Connect Custom scripts or 3rd-party tools Transformation + AutoCDC Spark Declarative Pipelines Hand-rolled MERGE logic Data quality SDP Expectations Separate validation tooling Orchestration Lakeflow Jobs External schedulers (Airflow, etc.) Governance Unity Catalog Disconnected ACLs and lineage Monitoring Job UI + System Tables Separate APM tools BI + NL Query AI/BI Dashboards + Genie External BI tools Customers seeing results on Azure Databricks: Ahold Delhaize — 4.5x faster deployment and 50% cost reduction running 1,000+ ingestion jobs daily Porsche Holding — 85% faster ingestion pipeline development vs. a custom-built solution Next Steps Lakeflow product page Lakeflow Connect documentation Live demos on Demo Center Get started with Azure Databricks422Views0likes0CommentsNew Microsoft Certified: Azure Databricks Data Engineer Associate Certification
As a data engineer, you understand that AI performance depends directly on the quality of its data. If the data isn’t clean, well-managed, and accessible at scale, even the most sophisticated AI models won’t perform as expected. Introducing the Microsoft Certified: Azure Databricks Data Engineer Associate Certification, designed to prove that you have the skills required to build and operate reliable data systems by using Azure Databricks. To earn the Certification, you need to pass Exam DP-750: Implementing Data Engineering Solutions Using Azure Databricks, currently in beta. Is this Certification right for you? This Certification offers you the opportunity to prove your skills and validate your expertise in the following areas: Core technical skills Ingesting, transforming, and modeling data using SQL and Python Building production data pipelines on Azure Databricks Implementing software development lifecycle (SDLC) practices with Git-based workflows Integrating Azure Databricks with key Microsoft services, such as Azure Storage, Azure Data Factory, Azure Monitor, Azure Key Vault, and Microsoft Entra ID Governance and security Securing and governing data with Unity Catalog and Microsoft Purview Applying workspace, cluster, and data-level security best practices Performance and reliability Optimizing compute, caching, partitioning, and Delta Lake design patterns Troubleshooting and resolving issues with jobs and pipelines Managing workloads across development, staging, and production For engineers already familiar with Azure Databricks, this Certification bridges the gap between general Azure Databricks skills and the Azure‑specific architecture, security, and operational patterns that employers increasingly expect. Ready to prove your skills? The first 300 candidates can save 80% Take advantage of the discounted beta exam offer. The first 300 people who take Exam DP-750 (beta) on or before April 2, 2026, can get 80% off. To receive the discount, when you register for the exam and are prompted for payment, use code DP750Deltona. This is not a private access code. The seats are offered on a first-come, first-served basis. As noted, you must take the exam on or before April 2, 2026. Please note that this discount is not available in Turkey, Pakistan, India, or China. How to prepare Get ready to take Exam DP-750 (beta): Review the Exam DP-750 (beta) exam page for details. The Exam DP-750 study guide explores key topics covered in the exam. Work through the Plan on Microsoft Learn: Get Exam‑Ready for DP‑750: Azure Databricks Data Engineer Associate Certification. Need other preparation ideas? Check out Just How Does One Prepare for Beta Exams? You can take Certification exams online, from your home or office. Learn what to expect in Online proctored exams: What to expect and how to prepare. Interested in unlocking more Azure Databricks expertise? Grow your skills and take the next step by exploring Databricks credentials and show what you can do with Azure Databricks. Ready to get started? Remember, only the first 300 candidates can get 80% Exam DP-750 (beta) with code DP750Deltona on or before April 2, 2026. Beta exam rescoring begins when the exam goes live, with final results released approximately 10 days later. For more details, read Creating high-quality exams: The path from beta to live. Stay tuned for general availability of this Certification in early May 2026. Get involved: Help shape future Microsoft Credentials Join our Microsoft Worldwide Learning SME Group for Credentials on LinkedIn for beta exam alerts and opportunities to help shape future Microsoft learning and assessments. Additional information For more cloud and AI Certification updates, read our recent blog post, The AI job boom is here. Are you ready to showcase your skills? Explore Microsoft Credentials on AI Skills Navigator.22KViews4likes21CommentsAzure Databricks & Fabric Disaster Recovery: The Better Together Story
Author's: Amudha Palani amudhapalani, Oscar Alvarado oscaralvarado, Eric Kwashie ekwashie, Peter Lo PeterLo and Rafia Aqil Rafia_Aqil Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions. Identify Business Critical Workloads Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform. Azure Databricks Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions. Full System Failover (Active-Passive) Strategy A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include: Infrastructure Components: Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform Deploy network infrastructure (subnets) in the secondary region Establish data synchronization mechanisms Data Replication Strategy: Use Deep Clone for Delta tables rather than geo-redundant storage Implement periodic synchronization jobs using Delta's incremental replication Measure data transfer results using time travel syntax Workspace Asset Synchronization: Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD Utilize Terraform and SCIM for identity and access management Keep job concurrencies at zero in the secondary region to prevent execution Fully Redundant (Active-Active) Strategy The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy: Requires complex data synchronization between regions Incurs highest operational costs due to duplicate processing Typically needed only for mission-critical workloads with zero-tolerance for downtime Can be implemented as partial active-active, processing most workload in primary with subset in secondary Enabling Disaster Recovery Create a secondary workspace in a paired region. Use CI/CD to keep Workspace Assets Synchronized continuously. Requirement Approach Tools Cluster Configurations Co-deploy to both regions as code Terraform Code (Notebooks, Libraries, SQL) Co-deploy with CI/CD pipelines Git, Azure DevOps, GitHub Actions Jobs Co-deploy with CI/CD, set concurrency to zero in secondary Databricks Asset Bundles, Terraform Permissions (Users, Groups, ACLs) Use IdP/SCIM and infrastructure as code Terraform, SCIM Secrets Co-deploy using secret management Terraform, Azure Key Vault Table Metadata Co-deploy with CI/CD workflows Git, Terraform Cloud Services (ADLS, Network) Co-deploy infrastructure Terraform Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution. Replicate all dependent services (Key Vault, Storage accounts, SQL DB). Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions. Introduce an application‑level “Sync Tool” that redirects: data ingestion compute execution Enable parallel processing in both regions for selected or all workloads. Use bi‑directional synchronization for Delta data to maintain consistency across regions. For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm. Implement Three-Pillar DR Design Primary Workspace: Your production Databricks environment running normal operations Secondary Workspace: A standby Databricks workspace in a different(paired) Azure region that remains ready to take over if the primary fails. This architecture ensures business continuity while optimizing costs by keeping the secondary workspace dormant until needed. The DR solution is built on three fundamental pillars that work together to provide comprehensive protection: 1. Infrastructure Provisioning (Terraform) The infrastructure layer creates and manages all Azure resources required for disaster recovery using Infrastructure as Code (Terraform). What It Creates: Secondary Resource Group: A dedicated resource group in your paired DR region (e.g., if primary is in East US, secondary might be in West US 2) Secondary Databricks Workspace: A standby Databricks workspace with the same SKU as your primary, ready to receive failover traffic DR Storage Account: An ADLS Gen2 storage account that serves as the backup destination for your critical data Monitoring Infrastructure: Azure Monitor Log Analytics workspace and alert action groups to track DR health Protection Locks: Management locks to prevent accidental deletion of critical DR resources Key Design Principle: The Terraform configuration references your existing primary workspace without modifying it. It only creates new resources in the secondary region, ensuring your production environment remains untouched during setup. 2. Data Synchronization (Delta Notebooks) The data synchronization layer ensures your critical data is continuously backed up to the secondary region. How It Works: The solution uses a Databricks notebook that runs in your primary workspace on a scheduled basis. This notebook: Connects to Backup Storage: Uses Unity Catalog with Azure Managed Identity for secure, credential-free authentication to the secondary storage account Identifies Critical Tables: Reads from a configuration list you define (sales data, customer data, inventory, financial transactions, etc.) Performs Deep Clone: Uses Delta Lake's native CLONE functionality to create exact copies of your tables in the backup storage Tracks Sync Status: Logs each synchronization operation, tracks row counts, and reports on data freshness Authentication Flow: The synchronization process leverages Unity Catalog's managed identity capabilities: An existing Access Connector for Unity Catalog is granted "Storage Blob Data Contributor" permissions on the backup storage. Storage credentials are created in Databricks that reference this Access Connector. The notebook uses these credentials transparently—no storage keys or secrets are required. What Gets Synced: You define which tables are critical to your business operations. The notebook creates backup copies including: Full table data and schema Table partitioning structure Delta transaction logs for point-in-time recovery 3. Failover Automation (Python Scripts) The failover automation layer orchestrates the switch from primary to secondary workspace when disaster strikes. Microsoft Fabric Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication. Power BI Business Continuity Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering: No opt-in required: DR capabilities are automatically included. Azure storage geo-redundant replication: Ensures backup instances exist in other regions. Read-only access during disasters: Semantic models, reports, and dashboards remain accessible. Always supported: BCDR for Power BI remains active regardless of OneLake DR setting. Microsoft Fabric Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers: Microsoft's Responsibilities: Ensure baseline infrastructure and platform services availability Maintain Azure regional pairings for geo-redundancy. Provide DR capabilities for Power BI as default. Customer Responsibilities: Enable disaster recovery settings for capacities Set up secondary capacity and workspaces in paired regions Replicate data and configurations Enabling Disaster Recovery Organizations can enable BCDR through the Admin portal under Capacity settings: Navigate to Admin portal → Capacity settings Select the appropriate Fabric Capacity Access Disaster Recovery configuration Enable the disaster recovery toggle Critical Timing Considerations: 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted. 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective. Azure Databricks & Microsoft Fabric DR Considerations Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different. Recovery Procedures Procedure Databricks Fabric Failover Stop workloads, update routing, resume in secondary region. Microsoft initiates failover; customers restore services in DR capacity. Restore to Primary Stop secondary workloads, replicate data/code back, test, resume production. Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. Asset Syncing Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. Business Considerations Consideration Databricks Fabric Control Customers manage DR strategy, failover timing, and asset replication. Microsoft manages failover; customers restore services post-failover. Regional Dependencies Must ensure secondary region has sufficient capacity and services. DR only available in Azure regions with Fabric support and paired regions. Power BI Continuity Not applicable. Power BI offers built-in BCDR with read-only access to semantic models and reports. Activation Timeline Immediate upon configuration. DR setting takes up to 72 hours to activate; 30-day wait before changes allowed.956Views4likes0CommentsHow Azure NetApp Files Object REST API powers Azure and ISV Data and AI services – on YOUR data
This article introduces the Azure NetApp Files Object REST API, a transformative solution for enterprises seeking seamless, real-time integration between their data and Azure's advanced analytics and AI services. By enabling direct, secure access to enterprise data—without costly transfers or duplication—the Object REST API accelerates innovation, streamlines workflows, and enhances operational efficiency. With S3-compatible object storage support, it empowers organizations to make faster, data-driven decisions while maintaining compliance and data security. Discover how this new capability unlocks business potential and drives a new era of productivity in the cloud.1.2KViews0likes0CommentsAzure Databricks Lakebase is now generally available
Modern applications are built on real-time, intelligent, and increasingly powered by AI agents that need fast, reliable access to operational data—without sacrificing governance, scale, or simplicity. To solve for this, Azure Databricks Lakebase introduces a serverless, Postgres database architecture that separates compute from storage and integrates natively with the Databricks Data Intelligence Platform on Azure. Lakebase is now generally available in Azure Databricks enabling you and your team to start building and validating real-time and AI-driven applications directly on your lakehouse foundation. Why Azure Databricks Lakebase? Lakebase was created for modern workloads and reduce silos. By decoupling compute from storage, Lakebase treats infrastructure as an on-demand service—scaling automatically with workload needs and scaling to zero when idle. Key capabilities include: Serverless Postgres for Production Workloads: Lakebase delivers a managed Postgres experience with predictable performance and built-in reliability features suitable for production applications, while abstracting away infrastructure management. Instant Branching and Point-in-Time Recovery: Teams can create zero-copy branches of production data in seconds for testing, debugging, or experimentation, and restore databases to precise points in time to recover from errors or incidents. Unified Governance with Unity Catalog: Operational data in Lakebase can be governed using the same Unity Catalog policies that secure analytics and AI workloads, enabling consistent access control, auditing, and compliance across the platform. Built for AI and Real-Time Applications: Lakebase is designed to support AI-native patterns such as real-time feature serving, agent memory, and low-latency application state—while keeping data directly connected to the lakehouse for analytics and learning workflows. Lakebase allows applications to operate directly on governed, lake-backed data—reducing complexity with pipeline synchronization or duplicating storage On Azure Databricks, this unlocks new scenarios such as: Real-time applications built on lakehouse data AI agents with persistent, governed memory Faster release cycles with safe, isolated database branches Simplified architectures with fewer moving parts All while using familiar Postgres interfaces and tools. Get Started with Azure Databricks Lakebase Lakebase is integrated into the Azure Databricks experience and can be provisioned directly within Azure Databricks workspaces. For Azure Databricks customers building intelligent, real-time applications, it offers a new foundation—one designed for the pace and complexity of modern data-driven systems. We’re excited to see what you build, get started today!987Views0likes0CommentsSecuring A Multi-Agent AI Solution Focused on User Context & the Complexities of On-Behalf-Of.
How we built an enterprise-grade multi-agent system that preserves user identity across AI agents and Databricks Introduction When building AI-powered applications for the enterprise, a common challenge emerges: how do you maintain user identity and access controls when an AI agent queries backend services on behalf of a user? In many implementations, AI agents authenticate to backend systems using a shared service account or with PAT (Personal Access Token) tokens, effectively bypassing row-level security (RLS), column masking, and other data governance policies that organizations carefully configure. This creates a security gap where users can potentially access data they shouldn’t see, simply by asking an AI agent. In this post, I’ll walk through how we solved this challenge for a current enterprise customer by implementing Microsoft Entra ID On-Behalf-Of (OBO) secure flow in a custom multi-agent LangGraph solution, enabling our Databricks Genie agent to query data and the data agent designed to modify or update delta tables, to do so as the authenticated user, while preserving all RBAC policies. The Architecture Our system is built on several key components: Chainlit: Python-based web interface for LLM-driven conversational applications, integrated with OAuth 2.0–based authentication. Customizing the framework to satisfy customer UI requirements eliminated the need to develop and maintain a bespoke React front end. It fulfilled the majority of requirements while reducing maintenance overhead. Azure App Service - Managed hosting with built-in authentication support and autoscaling LangGraph: Opensource Multi-agent orchestration framework. Azure Databricks Genie: Natural language to SQL agent. Azure Cosmos DB: Long-term memory and checkpoint storage. Microsoft Entra ID: Identity provider with OBO support. This shows: Genie: Read-only natural language queries, per-user OBO Task Agent: Handles sensitive operations (SQL modifications, etc.) with HITL approval + OBO Memory: Shared agent, no per-user auth needed The Problem with Chainlit OAuth Provider Chainlit was integrated with Microsoft Entra ID for OAuth authentication; however, the default implementation assumes Microsoft Graph scopes, requiring extension to support custom resource scopes. This means: The access token you receive is scoped for Microsoft Graph API You can’t use it for OBO flow to downstream services like Databricks The token’s audience is graph.microsoft.com, not your application For OBO to work, you need an access token where: The audience is your application’s client ID The scope includes your custom API permission (e.g., api://{client_id}/access_as_user) Solution: Custom Entra ID OBO Provider We created a custom OAuth provider that replaces Chainlit’s built-in one. Key insight: By requesting api://{client_id}/access_as_user as the scope, the returned access token has the correct audience for OBO exchange. Since we can’t call Graph API with this token (wrong audience), we extract user information from the ID token claims instead. The OBO Token Exchange Once we have the user’s access token (with correct audience), we exchange it for a Databricks-scoped token using MSAL. The resulting token: Has audience = Databricks resource ID Contains the user’s identity (UPN, OID) Can be used with Databricks SDK/API Respects all Unity Catalog permissions configured for that user Per-User Agent Creation A critical design decision: never cache user-specific agents globally. Each user needs their own Genie agent instance. Using the OBO Token with Databricks Genie The key integration point is passing the OBO-acquired token to the Databricks SDK’s WorkspaceClient as indicated in the above screenshot, which the Genie agent uses internally for all API calls as shown in the following image. Initialize Genie Agent with User’s Access Token: Wire It Into LangGraph: The user_access_token flows from Chainlit’s OAuth callback → session config → LangGraph config → agent creation, ensuring every Genie query runs with the authenticated user’s permissions. Human-in-the-Loop for Destructive SQL Operations While Databricks Genie handles natural language queries (read-only), our system also supports custom SQL execution for data modifications. Since these operations can DELETE or UPDATE data, we implement human-in-the-loop approval using LangGraph’s interrupt feature. The OBO token ensures that even when executing user-authored SQL, the query runs with the user’s permissions: they can only modify data they’re authorized to change. The destructive operation detector uses LLM-based intent analysis Entra ID App Registration Requirements Your Entra ID app registration needs: API Permissions: Azure Databricks → user_impersonation (admin consent required) Expose an API: Scope access_as_user on URI api://{client-id} Redirect URI: {your-app-url}/auth/oauth/azure-ad/callback Lessons Learned Token audience matters: OBO fails if your initial token has the wrong audience Don’t cache user-specific clients: breaks user isolation ID tokens contain user info: use claims when you can’t call Graph API HITL for destructive ops: even with RBAC, require explicit user confirmation Conclusion By implementing Entra ID OBO flow in our multi-agent system, we achieved: User identity preservation across AI agents RBAC enforcement at the Databricks/Unity Catalog level Audit trail showing actual user making queries Zero-trust architecture: the AI agent never has more access than the user Human-in-the-loop for destructive SQL operations This approach enables any organization building AI systems that supports OAuth 2.0 to participate in an on‑behalf‑of (OBO) flow. More importantly, it establishes a critical layer of AI governance for enterprise‑grade, custom multi‑agent solutions, aligning with Microsoft’s Secure Future Initiative (SFI) and Zero Trust principles. As organizations accelerate toward multi‑agent AI architectures and broader AI transformation, centralized services that standardize identity, authorization, and user delegation become foundational. Capabilities such as Microsoft Entra Agent ID and Azure AI Foundry are emerging precisely to address this need - enabling secure, scalable, and user‑context–aware agent interactions. In the next post, I’ll shift the lens from architecture to outcomes - examining what this foundation means from a CXO perspective, and why identity‑first AI governance is quickly becoming a board‑level concern.926Views1like0CommentsUnlocking Advanced Data Analytics & AI with Azure NetApp Files object REST API
Azure NetApp Files object REST API enables object access to enterprise file data stored on Azure NetApp Files, without copying, moving, or restructuring that data. This capability allows analytics and AI platforms that expect object storage to work directly against existing NFS based datasets, while preserving Azure NetApp Files’ performance, security, and governance characteristics.593Views0likes0Comments