azure databricks

98 Topics

Guide for Architecting Azure-Databricks: Design to Deployment
Author's: Chris Walk cwalk, Dan Johnson danjohn1234, Eduardo dos Santos eduardomdossantos, Ted Kim tekim, Eric Kwashie ekwashie, Chris Hayes Chris_Haynes, Tayo Akigbogun takigbogun and Rafia Aqil Rafia_Aqil Peer Reviewed: Mohamed Sharaf mohamedsharaf Note: This article does not cover the Serverless Workspace option, which is currently in Private Preview. We plan to update this article once Serverless Workspaces are Generally Available. Also, while Terraform is the recommended method for production deployments due to its automation and repeatability, for simplicity in this article we will demonstrate deployment through the Azure portal. DESIGN: Architecting a Secure Azure Databricks Environment Step 1: Plan Workspace, Subscription Organization, Analytics Architecture and Compute Planning your Azure Databricks environment can follow various arrangements depending on your organization’s structure, governance model, and workload requirements. The following guidance outlines key considerations to help you design a well-architected foundation. 1.1 Align Workspaces with Business Units A recommended best practice is to align each Azure Databricks workspace with a specific business unit. This approach—often referred to as the “Business Unit Subscription” design pattern—offers several operational and governance advantages. Streamlined Access Control: Each unit manages its own workspace, simplifying permissions and reducing cross-team access risks. For example, Sales can securely access only their data and notebooks. Cost Transparency: Mapping workspaces to business units enables accurate cost attribution and supports internal chargeback models. Each workspace can be tagged to a cost center for visibility and accountability. Even within the same workspace, costs can be controlled using system tables that provide detailed usage metrics and resource consumption insights. Challenges to keep-in-mind: While per-BU workspaces have high impact, be mindful of workspace sprawl. If every small team spins up its own workspace, you might end up with dozens or hundreds of workspaces, which introduces management overhead. Databricks recommends a reasonable upper limit (on Azure, roughly 20–50 workspaces per account/subscription) because managing “collaboration, access, and security across hundreds of workspaces can become extremely difficult, even with good automation” [1]. Each workspace will need governance (user provisioning, monitoring, compliance checks), so there is a balance to strike. 1.2 Workspace Alignment and Shared Metastore Strategy As you align workspaces with business units, it's essential to understand how Unity Catalog and the metastore fit into your architecture. Unity Catalog is Databricks’ unified governance layer that centralizes access control, auditing, and data lineage across workspaces. Each Unity Catalog is backed by a metastore, which acts as the central metadata repository for tables, views, volumes, and other data assets. In Azure Databricks, you can have one metastore per region, and all workspaces within that region share it. This enables consistent governance and simplifies data sharing across teams. If your organization spans multiple regions, you’ll need to plan for cross-region sharing, which Unity Catalog supports through Delta Sharing. By aligning workspaces with business units and connecting them to a shared metastore, you ensure that governance policies are enforced uniformly, while still allowing each team to manage its own data assets securely and independently. 1.3 Distribute Workspaces Across Subscriptions When scaling Azure Databricks, consider not just the number of workspaces, but also how to distribute them across Azure subscriptions. Using multiple Azure subscriptions can serve both organizational needs and technical requirements: Environment Segmentation (Dev/Test/Prod): A common pattern is to put production workspaces in a separate Azure subscription from development or test workspaces. This provides an extra layer of isolation. Microsoft highly recommends separating workspaces into prod and dev, in separate subscriptions. This way, you can apply stricter Azure policies or network rules to the prod subscription and keep the dev subscription a bit more open for experimentation without risking prod resources. Honor Azure Resource Limits: Azure subscriptions come with certain capacity limits and Azure Databricks workspaces have their own limits (since it’s a multi-tenant PaaS). If you put all workspaces in one subscription, or all teams in one workspace, you might hit those limits. Most enterprises naturally end up with multiple subscriptions as they grow – planning this early avoids later migration headaches. If you currently have everything in one subscription, evaluate usage and consider splitting off heavy workloads or prod workloads into a new one to adhere to best practices. 1.4 Consider Completing Azure Landing Zone Assessment When evaluating and planning your next deployment, it’s essential to ensure that your current landing zone aligns with Microsoft best practices. This helps establish a robust Databricks architecture and minimizes the risk of avoidable issues. Additionally, customers who are early in their cloud journey can benefit from Cloud Assessments—such as an Azure Landing Zone Review and a review of the “Prepare for Cloud Adoption” documentation—to build a strong foundation. 1.5 Planning Your Azure Databricks Workspace Architecture Your workspace architecture should reflect the operational model of your organization and support the workloads you intend to run, from exploratory notebooks to production-grade ETL pipelines. To support your planning, Microsoft provides several reference architectures that illustrate well-architected patterns for Databricks deployments. These solution ideas can serve as starting points for designing maintainable environments: Simplified Architecture: Modern Data Platform Architecture, ETL-Intensive Workload Reference Architecture: Building ETL Intensive Architecture, End-to-End Analytics Architecture: Create a Modern Analytics Architecture. 1.6 Planning for that “Right” Compute Choosing the right compute setup in Azure Databricks is crucial for optimizing performance and controlling costs, as billing is based on Databricks Units (DBUs) using a per-second pricing model. Classic Compute: You can fine-tune your own compute by enabling auto-termination and autoscaling, using Photon acceleration, leveraging spot instances, selecting the right VM type and node count for your workload, and choosing SSDs for performance or HDDs for archival storage. Preferred by mature internal teams and developers who need advanced control over clusters—such as custom VM selection, tuning, and specialized configurations. Serverless Compute: Alternatively, managed services can simplify operations with built-in optimizations. Removes infrastructure management and offers instant scaling without cluster warm-up, making it ideal for agility and simplicity. Step 2: Plan the “Right” CIDR Range (Classic Compute) Note: You can skip this step if you plan to use serverless compute for all your resources, as CIDR range planning is not required in serverless deployments. When planning CIDR ranges for your Azure Databricks workspace, it's important to ensure your virtual network has enough IP address capacity to support cluster scaling. Why this matters: If you choose a small VNet address space and your analytics workloads grow, you might hit a ceiling where you simply cannot launch more clusters or scale-out because there are no free IPs in the subnet. The subnet sizes—and by extension, the VNet CIDR—determine how many nodes you can. Databricks recommends using a CIDR block between /16 and /24 for the VNet, and up to /26 for the two required subnets: the container subnet and the host subnet. Here’s a reference Microsoft provides. If your current workspace’s VNet lacks sufficient IP space for active cluster nodes, you can request a CIDR range update through your Azure Databricks account team as noted in the Microsoft documentation. 2.1 Considerations for CIDR Range Workload Type & Concurrency: Consider what kinds of workloads will run (ETL Pipelines, Machine Learning Notebooks, BI Dashboards, etc.) and how many jobs or clusters may need to run in parallel. High concurrency (e.g. multiple ETL jobs or many interactive clusters) means more nodes running at the same time, requiring a larger pool of IP addresses. Data Volume (Historical vs. Incremental): Are you doing a one-time historical data load or only processing new incremental data? A large backfill of terabytes of data may require spinning up a very large cluster (hundreds of nodes) to process in a reasonable time. Ongoing smaller loads might get by with fewer nodes. Estimate how much data needs processing. Transformation Complexity: The complexity of data transformations or machine learning workloads matters. Heavy transformations (joins, aggregations on big data) or complex model training can benefit more workers. If your use cases include these, you may need larger clusters (more nodes) to meet performance SLAs, which in turn demands more IP addresses available in the subnet. Data Sources and Integration: Consider how your Databricks environment will connect to data. If you have multiple data sources or sinks (e.g. ingest from many event hubs, databases, or IoT streams), you might design multiple dedicated clusters or workflows, potentially all active at once. Also, if using separate job clusters per job (Databricks Jobs), multiple clusters might launch concurrently. All these scenarios increase concurrent node count. 2.2 Configuring a Dedicated Network (VNet) per Workspace with Egress Control By default, Azure Databricks deploys its classic compute resources into a Microsoft-managed virtual network (VNet) within your Azure subscription. While this simplifies setup, it limits control over network configuration. For enhanced security and flexibility, it's recommended to use VNet Injection, which allows you to deploy the compute plane into your own customer-managed VNet. This approach enables secure integration with other Azure services using service endpoints or private endpoints, supports user-defined routes for accessing on-premises data sources, allows traffic inspection via network virtual appliances or firewalls, and provides the ability to configure custom DNS and enforce egress restrictions through network security group (NSG) rules. Within this VNet (which must reside in the same region and subscription as the Azure Databricks workspace), two subnets are required for Azure Databricks: a container subnet (referred to as private subnet) and a host subnet (referred to as public subnet). To implement front-end Private Link, back-end Private Link, or both, your workspace VNet needs a third subnet that will contain the private endpoint (PrivateLink subnet). It is recommended to also deploy an Azure Firewall for egress control. Step 3: Plan Network Architecture for Securing Azure-Databricks 3.1 Secure Cluster Connectivity Secure Cluster Connectivity, also known as No Public IP (NPIP), is a foundational security feature for Azure Databricks deployments. When enabled, it ensures that compute resources within the customer-managed virtual network (VNet) do not have public IP addresses, and no inbound ports are exposed. Instead, each cluster initiates a secure outbound connection to the Databricks control plane using port 443 (HTTPS), through a dedicated relay. This tunnel is used exclusively for administrative tasks, separate from the web application and REST API traffic, significantly reducing the attack surface. For the most secure deployment, Microsoft and Databricks strongly recommend enabling Secure Cluster Connectivity, especially in environments with strict compliance or regulatory requirements. When Secure Cluster Connectivity is enabled, both workspace subnets become private, as cluster nodes don’t have public IP addresses. 3.2 Egress with VNet Injection (NVA) For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.3 Front-End Connectivity with Azure Private Link (Standard Deployment) To further enhance security, Azure Databricks supports Private Link for front-end connections. In a standard deployment, Private Link enables users to access the Databricks web application, REST API, and JDBC/ODBC endpoints over a private VNet interface, bypassing the public internet. For organizations with no public internet access from user networks, a browser authentication private endpoint is required. This endpoint supports SSO login callbacks from Microsoft Entra ID and is shared across all workspaces in a region using the same private DNS zone. It is typically hosted in a transit VNet that bridges on-premises networks and Azure. Note: There are two deployment types: standard and simplified. To compare these deployment types, see Choose standard or simplified deployment. 3.4 Serverless Compute Networking Azure Databricks offers serverless compute options that simplify infrastructure management and accelerate workload execution. These resources run in a Databricks-managed serverless compute plane, isolated from the public internet and connected to the control plane via the Microsoft backbone network. To secure outbound traffic from serverless workloads, administrators can configure Serverless Egress Control using network policies that restrict connections by location, FQDN, or Azure resource type. Additionally, Network Connectivity Configurations (NCCs) allow centralized management of private endpoints and firewall rules. NCCs can be attached to multiple workspaces and are essential for enabling secure access to Azure services like Data Lake Storage from serverless SQL warehouses. DEPLOYMENT: Step-to-Step Implementation using Azure Portal Step 1: Create an Azure Resource Group For each new workspace, create a dedicated Resource Group (to contain the Databricks workspace resource and associated resources). Ensure that all resources are deployed in the same Region and Resource Group (i.e. workspace, subnets...) to optimize data movement performance and enhance security. Step 2: Deploy Workspace Specific Virtual Network (VNET) From your Resource Group, create a Virtual Network. Under the Security section, enable Azure Firewall. Deploying an Azure Firewall is recommended for egress control, ensuring that outbound traffic from your Databricks environment is securely managed. Define address spaces for your Virtual Network (Review Step 2 from Design). As documented, you could create a VNet with these values: IP range: First remove the default IP range, and then add IP range 10.28.0.0/23. Create subnet public-subnet with range 10.28.0.0/25. Create subnet private-subnet with range 10.28.0.128/25. Create subnet private-link with range 10.28.1.0/27. Please note: your IP values can be different depending on your IPAM and available scopes. Review + Create your Virtual Network. Step 3: Deploy Azure-Databricks Workspace: Now that networking is in place, create the Databricks workspace. Below are detailed steps your organization should review while creating workspace creation: In Azure Portal, search for Azure Databricks and click Create. Choose the Subscription, RG, Region, select Premium, enter in “Managed Resource Group name” and click Next. Managed Resource Group- will be created after your Databrick workspace is deployed and contains infrastructure resources for the workspace i.e. VNets, DBFS. Required: Enable “Secure Cluster Connectivity” (No Public IP for clusters), to ensure that Databricks clusters are deployed without public IP addresses (Review Section 3.1). Required: Enable the option to deploy into your Virtual Network (VNet Injection), also known as “Bring Your Own VNet” (Review Section 3.2). Select the Virtual Network created in Step 2. Enter Private, Public Subnet Names. Enable or Disable “Deploying Nat Gateway”, according to your workspace requirement. Disable “Allow Public Network Access”. Select “No Azure Databricks Rules” for Required NSG Rules. Select “Click on add to create a private endpoint”, this will open a panel for private endpoint setup. Click “Add” to enter your Private Link details created in Step 2. Also, ensure that Private DNS zone integration is set to “Yes” and that a new Private DNS Zone is created, indicated by (New)privatelink.azuredatabricks.net. Unless an existing DNS zone for this purpose already exists. (Optional) Under Encryption Tab, Enable Infrastructure Encryption, if you have requirement for FIPS 140-2. It comes at a cost, it takes time to encrypt and decrypt. By default your data is already encrypted. If you have a standard regulatory requirement (ex. HIPAA). (Optional) Compliance security profile- for HIPAA. (Optional) Automatic cluster updates, First Sunday of every Month. Review + Create the workspace and wait for it to deploy. Step 4: Create a private endpoint to support SSO for web browser access: Note: This step is required when front-end Private Link is enabled, and client networks cannot access the public internet. After creating your Azure Databricks workspace, if you try to launch it without the proper Private Link configuration, you will see an error like the image below: This happens because the workspace is configured to block public network access, and the necessary Private Endpoints (including the browser_authentication endpoint for SSO) are not yet in place. Create Web-Auth Workspace Note: Deploy a “dummy”: WEB_AUTH_DO_NOT_DELETE_<region> workspace in the same region as your production workspace. Purpose: Host the browser_authentication private endpoint (one required per region). Lock the workspace (Delete lock) to prevent accidental removal. Follow step 2 to create Virtual Network (Vnet) Follow step 3 and create a VNet injected “dummy” workspace. Create Browser Authentication Private Endpoint In Azure Portal, Databricks workspace (dummy), Networking, Private endpoint connections, + Private endpoint. Resource step: Target sub-resource: browser_authentication Virtual Network step: VNet: Transit/Hub VNet (central network for Private Link) Subnet: Private Endpoint subnet in that VNet (not Databricks host subnets) DNS step: Integrate with Private DNS zone: Yes Zone: privatelink.azuredatabricks.net Ensure DNS zone is linked to the Transit VNet After creation: A-records for *.pl-auth.azuredatabricks.net are auto-created in the DNS zone. Workspace Connectivity Testing If you have VPN or ExpressRoute, Bastion is not required. However, for the purposes of this article we will be testing our workpace connectivity through Bastion. If you don’t have private connectivity and need to test from inside the VNet, Azure Bastion is a convenient option. Step 5: Create Storage Account From your Resource Group, click Create and select Storage account. On the configuration page: Select Preferred Storage type as: Azure Blob Storage or Azure Data Lake Storage Gen 2. Choose Performance and Redundancy options based on your business requirements. Click Next to proceed. Under the Advanced tab: Enable Hierarchical namespace under Data Lake Storage Gen2. This is critical for: Directory and file-level operations, Access Control Lists (ACLs). Under the Networking tab: Set Public Network Access to Disabled. Complete the creation process and then create container(s) inside the storage account. Step 6: Create Private Endpoints for Workspace Storage Account Pre-requisite: You need to create two private endpoints from the VNet used for VNet injection to your workspace storage account for the following Target sub-resources: dfs and blob. Navigate to your Storage Account. Go to Networking, Private Endpoints tab and click on to + Create Private Endpoint. In the Create Private Endpoint wizard: Resource tab: Select your Storage Account. Set Target sub-resource to dfs for the first endpoint. Virtual Network tab: Choose the VNet you used for VNet injection. Select the appropriate subnet. Complete the creation process. The private endpoint will be auto approved and visible under Private Endpoints. Repeat the process for the second private endpoint: This time set Target sub-resource to blob. Step 7: Link Storage and Databricks Workspace: Create Access Connector In your Resource Group, create an Access Connector for Azure Databricks. No additional configuration is required during creation. Assign Role to Access Connector Navigate to your Storage Account, Access Control (IAM), Add role assignment. Select: Role: Storage Blob Data Contributor Assign access to: Managed Identity Under Members: Click Select members. Find and select your newly created Access Connector for Azure Databricks. Save the role assignment. Copy Resource ID Go to the Access Connector Overview page. Copy the Resource ID for later use in Databricks configuration. Step 8: Link Storage and Databricks Workspace: Navigate to Unity Catalog In your Databricks Workspace, go to Unity Catalog, External Data and select “Create external Location” button. Configure External Location Select ADLS as the storage type. Enter the ADLS storage URL in the following format: abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/ Update these two parameters: <container_name> and <storage_name> Provide Access Connector Select “Create new storage credential” from Storage credential field. Paste the Resource ID of the Access Connector for Azure Databricks (from Step 10) into the Access Connector ID field. Validate Connection Click Submit. You should see a “Successful” message confirming the connection. Click submit and you should receive a “Successful” message, indicating your connection has succeeded. You can now create Catalogs and link your secure storage. Step 9: Configuring Serverless Compute Networking: If your organization plans to use Serverless SQL Warehouses or Serverless Jobs Compute, you must configure Serverless Networking. Add Network Connectivity Configuration (NCC) Go to the Databricks Account Console: https://accounts.azuredatabricks.net/ Navigate to Cloud resources, click Add Network Connectivity Configuration. Fill in the required fields and create a new NCC. Associate NCC with Workspace In the Account Console, go to Workspaces. Select your workspace, click Update Workspace. From the Network Connectivity Configuration dropdown, select the NCC you just created. Add Private Endpoint Rule In Cloud resources, select your NCC, select Private Endpoint Rules and click Add Private Endpoint Rule. Provide: Resource ID: Enter your Storage Account Resource ID. Note: this can be found from your storage account, click on “JSON View” top right. Azure Subresource type: dfs & blob. Approve Pending Connection Go to your Storage Account, Networking, Private Endpoints. You will see a Pending connection from Databricks. Approve the connection and you will see the Connection status in your Account Console as ESTABLISHED. Step 10: Test Your Workspace: Launch a small test cluster and verify the following: It can start (which means it can talk to the control plane). It can read/write from the storage, following the following code to confirm read/write to storage: Set Spark properties to configure Azure credentials to access Azure storage. Check Private DNS Record has been created. (Optional) If on-prem data is needed: try connecting to an on-prem database (using the ExpressRoute path): Connect your Azure Databricks workspace to your on-premises network - Azure Databricks | Microsoft Learn. Step 11: Account Console, Planning Workspace Access Controls and Getting Started: Once your Azure Databricks workspace is deployed, it's essential to configure access controls and begin onboarding users with the right permissions. From your account console: https://accounts.azuredatabricks.net/, you can centrally manage your environment: add users and groups, enable preview features, and view or configure all your workspaces. Azure Databricks supports fine-grained access management through Unity Catalog, cluster policies, and workspace-level roles. Start by defining who needs access to what—whether it's notebooks, tables, jobs, or clusters—and apply least-privilege principles to minimize risk. DBFS Limitation: DBFS is automatically created upon Databricks Workspace creation. DBFS can be found in your Managed Resource Group. Databricks cannot secure DBFS (see reference image below). If there is a business need to avoid DBFS then you can disable DBFS access following instructions here: Disable access to DBFS root and mounts in your existing Azure Databricks workspace. Use Unity Catalog to manage data access across catalogs, schemas, and tables, and consider implementing cluster policies to standardize compute configurations across teams. To help your teams get started, Microsoft provides a range of tutorials and best practice guides: Best practice articles - Azure Databricks | Microsoft Learn. Step 12: Planning Data Migration: As you prepare to move data into your Azure Databricks environment, it's important to assess your migration strategy early. This includes identifying source systems, estimating data volumes, and determining the appropriate ingestion methods—whether batch, streaming, or hybrid. For organizations with complex migration needs or legacy systems, Microsoft offers specialized support through its internal Azure Cloud Accelerated Factory program. Reach out to your Microsoft account team to explore nomination for Azure Cloud Accelerated Factory, which provides hands-on guidance, tooling, and best practices to accelerate and streamline your data migration journey. Summary Regular maintenance and governance are as important as the initial design. Continuously review the environment and update configurations as needed to address evolving requirements and threats. For example, tag all resources (workspaces, VNets, clusters, etc.) with clear identifiers (workspace name, environment, department) to track costs and ownership effectively. Additionally, enforce least privilege across the platform: ensure that only necessary users are given admin privileges, and use cluster-level access control to restrict who can create or start clusters. By following the above steps, an organization will have an Azure Databricks architecture that is securely isolated, well-governed, and scalable. References: [1] 5 Best Practices for Databricks Workspaces AzureDatabricksBestPractices/toc.md at master · Azure ... - GitHub Deploy a workspace using the Azure Portal Additional Links: Quick Introduction to Databricks: what is databricks | introduction - databricks for dummies Connect Purview with Azure Databricks: Integrating Microsoft Purview with Azure Databricks Secure Databricks Delta Share between Workspaces: Secure Databricks Delta Share for Serverless Compute Azure-Databricks Cost Optimization Guide: Databricks Cost Optimization: A Practical Guide Integrate Azure Databricks with Microsoft Fabric: Integrating Azure Databricks with Microsoft Fabric Databricks Solution Accelerators for Data & AI Azure updates Appendix 3.5 Understanding Data Transfer (Express Route vs. Public Internet) For data transfers, your organization must decide to use ExpressRoute or Internet Egress. There are several considerations that can help you determine your choice: 3.5.1. Connectivity Model • ExpressRoute: Provides a private, dedicated connection between your on-premises infrastructure and Microsoft Azure. It bypasses the public internet entirely and connects through a network service provider. • Internet Egress: Refers to outbound data traffic from Azure to the public internet. This is the default path for most Azure services unless configured otherwise. 3.6 Planning for User-Defined Routes (UDRs) When working with Databricks deployments—especially in VNet-injected workspaces—setting up User Defined Routes (UDRs) is a smart move. It’s a best practice that helps manage and secure network traffic more effectively. By using UDRs, teams can steer traffic between Databricks components and external services in a controlled way, which not only boosts security but also supports compliance efforts. 3.6.1 UDRs and Hub and Spoke Topology If your Databricks workspace is deployed into your own virtual network (VNet), you’ll need to configure standard user-defined routes (UDRs) to manage traffic flow. In a typical hub-and-spoke architecture, UDRs are used to route all traffic from the spoke VNets to the hub VNet. 3.6.2 Hub and Spoke with VWANHUB If your Databricks workspace is deployed into your own virtual network (VNet) and is peered to a Virtual WAN (VWAN) hub as the primary connectivity hub into Azure, a user-defined route (UDR) is not required—provided that a private traffic routing policy or internet traffic routing policy is configured in the VWAN hub. 3.6.3 Use of NVAs and Service Tags For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.6.4 Default Outbound Access Retirement (Non-Serverless Compute) Microsoft is retiring default outbound internet access for new deployments starting September 30,2025. Going forward, outbound connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.” Configure connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.”
Rafia_Aqil
Nov 26, 2025 Place Analytics on Azure Blog
84Views
0likes
0Comments
Azure Databricks Cost Optimization: A Practical Guide
Co-Authored by Sanjeev Nair This guide walks through a proven approach to Databricks cost optimization, structured in three phases: Discovery, Cluster/Data/Code Best Practices, and Team Alignment & Next Steps. Phase 1: Discovery Assessing Your Current State The following questions are designed to guide your initial assessment and help you identify areas for improvement. Documenting answers to each will provide a baseline for optimization and inform the next phases of your cost management strategy. Environment & Organization Cluster Management Cost Optimization Data Management Performance Monitoring Future Planning What is the current scale of your Databricks environment? How many workspaces do you have? How are your workspaces organized (e.g., by environment type, region, use case)? How many clusters are deployed? How many users are active? What are the primary use cases for Databricks in your organization? Data engineering Data science Machine learning Business intelligence How are clusters currently managed? Manual configuration Automated scripts Databricks REST API Cluster policies What is the average cluster uptime? Hours per day Days per week What is the average cluster utilization rate? CPU usage Memory usage What is the current monthly spend on Databricks? Total cost Breakdown by workspace Breakdown by cluster What cost management tools are currently in use? Azure Cost Management Third-party tools Are there any existing cost optimization strategies in place? Reserved instances Spot instances Cluster auto-scaling What is the current data storage strategy? Data lake Data warehouse Hybrid What is the average data ingestion rate? GB per day Number of files What is the average data processing time? ETL jobs Machine learning models What types of data formats are used in your environment? Delta Lake Parquet JSON CSV Other formats relevant to your workloads What performance monitoring tools are currently in use? Databricks Ganglia Azure Monitor Third-party tools What are the key performance metrics tracked? Job execution time Cluster performance Data processing speed Are there any planned expansions or changes to the Databricks environment? New use cases Increased data volume Additional users What are the long-term goals for Databricks cost optimization? Reducing overall spend Improving resource utilization & cost attribution Enhancing performance Understanding Databricks Cost Structure Total Cost = Cloud Cost + DBU Cost Cloud Cost: Compute (VMs, networking, IP addresses), storage (ADLS, MLflow artifacts), other services (firewalls), cluster type (serverless compute, classic compute) DBU Cost: Workload size, cluster/warehouse size, photon acceleration, compute runtime, workspace tier, SKU type (Jobs, Delta Live Tables, All Purpose Clusters, Serverless), model serving, queries per second, model execution time Diagnose Cost and Issues Effectively diagnosing cost and performance issues in Databricks requires a structured approach. Use the following steps and metrics to gain visibility into your environment and uncover actionable insights. 1. Identify Costly Workloads Account Console Usage Reports: Review usage reports to identify usage breakdowns by product, SKU name, and custom tags. Usage Breakdown by Product and SKU: Helps you understand which services and compute types (clusters, SQL warehouses, serverless options) are consuming the most resources. Custom Tags for Attribution: Tags allow you to attribute costs to teams, projects, or departments, making it easier to identify high-cost areas. Workflow and Job Analysis: By correlating usage data with workflows and jobs, you can pinpoint long-running or resource-heavy workloads that drive costs. Focus on Long-Running Workloads: Examine workloads with extended runtimes or high resource utilization. Key Question: Which pipelines or workloads are driving the majority of your costs? Now That You’ve Identified Long-Running Workloads, Review These Key Areas: 2. Review Cluster Metrics CPU Utilization: Track guest, iowait, idle, irq, nice, softirq, steal, system, and user times to understand how compute resources are being used. Memory Utilization: Monitor used, free, buffer, and cached memory to identify over- or under-utilization. Key Question: Is your cluster over- or under-utilized? Are resources being wasted or stretched too thin? 3. Review SQL Warehouse Metrics Live Statistics: Monitor warehouse status, running/queued queries, and current cluster count. Time Scale Filter: Analyze query and cluster activity over different time frames (8 hours, 24 hours, 7 days, 14 days). Peak Query Count Chart: Identify periods of high concurrency. Completed Query Count Chart: Track throughput and query success/failure rates. Running Clusters Chart: Observe cluster allocation and recycling events. Query History Table: Filter and analyze queries by user, duration, status, and statement type. Key Question: Is your SQL Warehouse over- or under-utilized? Are resources being wasted or stretched too thin? 4. Review Spark UI Stages Tab: Look for skewed data, high input/output, and shuffle times. Uneven task durations may indicate data skew or inefficient data handling. Jobs Timeline: Identify long-running jobs or stages that consume excessive resources. Stage Analysis: Determine if stages are I/O bound or suffering from data skew/spill. Executor Metrics: Monitor memory usage, CPU utilization, and disk I/O. Frequent garbage collection or high memory usage may signal the need for better resource allocation. 4.1. Spark UI: Storage & Jobs Tab Storage Level: Check if data is stored in memory, on disk, or both. Size: Assess the size of cached data. Job Analysis: Investigate jobs that dominate the timeline or have unusually long durations. Look for gaps caused by complex execution plans, non-Spark code, driver overload, or cluster malfunction. 4.2. Spark UI: Executor Tab Storage Memory: Compare used vs. available memory. Task Time (Garbage Collection): Review long tasks and garbage collection times. Shuffle Read/Write: Measure data transferred between stages. 5. Additional Diagnostic Methods System Tables in Unity Catalog: Query system tables for cost attribution and resource usage trends. Cost Observability Queries Tagging Analysis: Use tags to identify which teams or projects consume the most resources. Dashboards & Alerts: Set up cost dashboards and budget alerts for proactive monitoring. Phase 2: Cluster/Code/Data Best Practices Alignment Cluster UI Configuration and Cost Attribution Effectively configuring clusters/workloads in Databricks is essential for balancing performance, scalability, and cost. Tunning settings and features when used strategically can help organizations maximize resource efficiency and minimize unnecessary spending. Key Configuration Strategies 1. Reduce Idle Time: Clusters to incur costs even when not actively processing workloads. To avoid paying for unused resources: Enable Auto-Terminate: Set clusters automatically shut down after a period of inactivity. This simple setting can significantly reduce wasted spending. Enable Autoscaling: Workloads fluctuate in size and complexity. Autoscaling allows clusters to dynamically adjust the number of nodes based on demand: Automatic Resource Adjustment: Scale up for heavy jobs and scale down for lighter loads, ensuring you only pay for what you use. It significantly enhances cost efficiency and overall performance. For serverless and streaming, using Delta Live Tables with autoscaling is recommended. This approach leads to better resource management and reliability. Use Spot Instances: For batch processing and non-critical workloads, spot instances offer substantial cost savings: Lower VM Costs: Spot instances are typically much cheaper than standard VMs. However, they are not recommended for jobs requiring constant uptime due to potential interruptions. Considerations: Azure Spot VMs are intended for non-critical, fault-tolerant tasks. They can be evicted without notice, riskingproduction stability. No SLA guarantees mean potentialdowntime for critical applications. Using Spot VMs could lead to reliability issues in production environments. Leverage Photon Engine: Photon is Databricks’ high-performance, vectorized query engine: Accelerate Large Workloads: Photon can dramatically reduce runtime for compute-intensive tasks, improving both speed and cost efficiency. Keep Runtimes Up to Date: Using the latest Databricks runtime ensures optimal performance and security: Benefit from Improvements: Regular updates include performance enhancements, bug fixes, and new features. Apply Cluster Policies: Cluster policies help standardize configurations and enforce cost controls across teams: Governance and Consistency: Policies can restrict certain settings, enforce tagging, and ensure clusters are created with cost-effective defaults. Optimize Storage: type impacts both performance and cost: Switch from HDDs to SSDs: SSDs provide faster caching and shuffle operations, which can improve job efficiency and reduce runtime. Tag Clusters for Cost Attribution: Tagging clusters enables granular tracking and reporting: Visibility and Accountability: Use tags to attribute costs to specific teams, projects, or environments, supporting better budgeting and chargeback processes. Select the Right Cluster Type: Different workloads require different cluster types, see table below for Serverless vs Classic Compute: Feature Classic Compute Serverless Compute Control Full control over config & network Minimal control, fully managed by Databricks Startup Time Slower (unless pre-warmed) Instant Cost Model Hourly, supports reservations Pay-per-use, elastic scaling Security VNet injection, private endpoints NCC-based private connectivity Best For Heavy ETL, ML, compliance workloads Interactive queries, unpredictable demand Job Clusters: Ideal for scheduled jobs and Delta Live Tables. All-Purpose Clusters: Suited for ad-hoc analysis and collaborative work. Single-Node Clusters: Efficient for simple exploratory data analysis or pure Python tasks. Serverless Compute: Scalable, managed workloads with automatic resource management. 11. Monitor and Adjust Regularly: review cluster metrics and query history: Continuous Optimization: Use built-in dashboards to monitor usage, identify bottlenecks, and adjust cluster size or configuration as needed. Code Best Practices Avoid Reprocessing Large Tables Use a CDC (Change Data Capture) architecture with Delta Live Tables (DLT) to process only new or changed data, minimizing unnecessary computation. Ensure Code Parallelizes Well Write Spark code that leverages parallel processing. Avoid loops, deeply nested structures, and inefficient user-defined functions (UDFs) that can hinder scalability. Reduce Memory Consumption Tweak Spark configurations to minimize memory overhead. Clean out legacy or unnecessary settings that may have carried over from previous Spark versions. Prefer SQL Over Complex Python Use SQL (declarative language) for Spark jobs whenever possible. SQL queries are typically more efficient and easier to optimize than complex Python logic. Modularize Notebooks Use %run to split large notebooks into smaller, reusable modules. This improves maintainability. Use LIMIT in Exploratory Queries When exploring data, always use the LIMIT clause to avoid scanning large datasets unnecessarily. Monitor Job Performance Regularly review Spark UI to detect inefficiencies such as high shuffle, input, or output. Review the below table for optimization opportunities: Spark stage high I/O - Azure Databricks | Microsoft Learn Databricks Code Performance Enhancements & Data Engineering Best Practices By enabling the below features and applying best practices, you can significantly lower costs, accelerate job execution, and build Databricks pipelines that are both scalable and highly reliable. For more guidance review: Comprehensive Guide to Optimize Data Workloads | Databricks. Feature / Technique Purpose / Benefit How to Use / Enable / Key Notes Disk Caching Accelerates repeated reads of Parquet files Set spark.databricks.io.cache.enabled = true Dynamic File Pruning (DFP) Skips irrelevant data files during queries, improves query performance Enabled by default in Databricks Low Shuffle Merge Reduces data rewriting during MERGE operations, less need to recalculate ZORDER Use Databricks runtime with feature enabled Adaptive Query Execution (AQE) Dynamically optimizes query plans based on runtime statistics Available in Spark 3.0+, enabled by default Deletion Vectors Efficient row removal/change without rewriting entire Parquet file Enable in workspace settings, use with Delta Lake Materialized Views Faster BI queries, reduced compute for frequently accessed data Create in Databricks SQL Optimize Compacts Delta Lake files, improves query performance Run regularly, combine with ZORDER on high-cardinality columns ZORDER Physically sorts/co-locates data by chosen columns for faster queries Use with OPTIMIZE, select columns frequently used in filters/joins Auto Optimize Automatically compacts small files during writes Enable optimizeWrite and autoCompact table properties Liquid Clustering Simplifies data layout, replaces partitioning/ZORDER, flexible clustering keys Recommended for new Delta tables, enables easy redefinition of clustering keys File Size Tuning Achieve optimal file size for performance and cost Set delta.targetFileSize table property Broadcast Hash Join Optimizes joins by broadcasting smaller tables Adjust spark.sql.autoBroadcastJoinThreshold and spark.databricks.adaptive.autoBroadcastJoinThreshold Shuffle Hash Join Faster join alternative to sort-merge join Prefer over sort-merge join when broadcasting isn’t possible, Photon engine can help Cost-Based Optimizer (CBO) Improves query plans for complex joins Enabled by default, collect column/table statistics with ANALYZE TABLE Data Spilling & Skew Handles uneven data distribution and excessive shuffle Use AQE, set spark.sql.shuffle.partitions=auto, optimize partitioning Data Explosion Management Controls partition sizes after transformations (e.g., explode, join) Adjust spark.sql.files.maxPartitionBytes, use repartition() after reads Delta Merge Efficient upserts and CDC (Change Data Capture) Use MERGE operation in Delta Lake, combine with CDC architecture Data Purging (Vacuum) Removes stale data files, maintains storage efficiency Run VACUUM regularly based on transaction frequency Phase 3: Team Alignment and Next Steps Implementing Cost Observability and Taking Action Effective cost management in Databricks goes beyond configuration and code—it requires robust observability, granular tracking, and proactive measures. Below outlines how your teams can achieve this using system tables, tagging, dashboards, and actionable scripts. Cost Observability with System Tables Databricks Unity Catalog provides system tables that store operational data for your account. These tables enable historical cost observability and empower FinOps teams to analyze spend independently. System Tables Location: Found inside the Unity Catalog under the “system” schema. Key Benefits: Structured data for querying, historical analysis, and cost attribution. Action: Assign permissions to FinOps teams so they can access and analyze dedicated cost tables. Enable Tags for Granular Tracking Tagging is a powerful feature for tracking, reporting, and budgeting at a granular level. Classic Compute: Manually add key/value pairs when creating clusters, jobs, SQL Warehouses, or Model Serving endpoints. Use cluster policies to enforce custom tags. Serverless Compute: Create budget policies and assign permissions to teams or members for serverless workloads. Action: Tag all compute resources to enable detailed cost attribution and reporting. Track Costs with Dashboards and Alerts Databricks offers prebuilt dashboards and queries for cost forecasting and usage analysis. Dashboards: Visualize spend, usage trends, and forecast future costs. Prebuilt Queries: Use top queries with system tables to answer meaningful cost questions. Budget Alerts: Set up alerts in the Account Console (Usage > Budget) to receive notifications when spend approaches defined thresholds. Build Culture of Efficiency To go beyond technical fixes and build a culture of efficiency, by focusing on the below strategic actions: Collaborate with Internal Engineers: Spend time with engineering teams to understand workload patterns and optimization opportunities. Peer Reviews and Code Audits: Conduct regular code review sessions and peer reviews to ensure best practices are followed for Spark jobs, data pipelines, and cluster configurations. Create Internal Best Practice Documentation: Develop clear guidelines for writing optimized code, managing data, and maintaining clusters. Make these resources easily accessible for all teams. Implement Observability Dashboards: Use Databricks’ built-in features to create dashboards that track spend, monitor resource utilization, and highlight anomalies. Set Alerts and Budgets: Configure alerts for long-running workloads and establish budgets using prebuilt Databricks capabilities to prevent cost overruns. 5. Azure Reservations and Azure Savings Plan When optimizing Databricks costs on Azure, it’s important to understand the two main commitment-based savings options: Azure Reservations and Azure Savings Plans. Both can help you reduce compute costs, but they differ in flexibility and how savings are applied. Which Should You Choose? Reservations are ideal if you have stable, predictable Databricks workloads and want maximum savings. Savings Plans are better if you expect your compute needs to change, or if you want a simpler, more flexible way to save across multiple services. Pro Tip: You can combine both options—use Reservations for your baseline, always-on Databricks clusters, and Savings Plans for bursty, variable, or new workloads. Summary Table: Action Steps It’s critical to monitor costs continuously and align your teams with established best practices, while scheduling regular code review sessions to ensure efficiency and consistency. Area Best Practice / Action System Tables Use for historical cost analysis and attribution Tagging Apply to all compute resources for granular tracking Dashboards Visualize spend, usage, and forecasts Alerts Set budget alerts for proactive cost management Scripts/Queries Build custom analysis tools for deep insights Cluster/Data/Code Review & Align Regularly review best practices, share findings, and align teams on optimization Save on your Usage Consider Azure Reservations and Azure Savings Plan
Rafia_Aqil
Nov 21, 2025 Place Analytics on Azure Blog
915Views
1like
0Comments
Approaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.
Rafia_Aqil
Nov 21, 2025 Place Analytics on Azure Blog
1.6KViews
6likes
1Comment
Azure Databricks Genie integration with Copilot Studio and Microsoft Foundry is now live!
This blog was co-authored by Toussaint Webb, Databricks We’re excited to announce the Public Preview availability of AI/BI Genie in Microsoft Copilot Studio and Microsoft Foundry via MCP. This makes it easier than ever for organizations to unlock and scale the power of their Genie spaces across the Microsoft ecosystem (e.g., Teams), ultimately democratizing trusted data and insights to business users. AI/BI Genie opens the power of conversational analytics to everyone in the organization. A user can ask a question such as “What is my revenue growth this month?” and Genie interprets the intent, generates the appropriate query, and returns the data insight. Users can also review the underlying logic for transparency. By supporting iterative questioning, Genie enables users to investigate their data directly and build confidence in their understanding without requiring code or specialist intervention. The Challenge Before: Complex setup via Custom Code Previously, connecting Genie to the Microsoft ecosystem was challenging. Organizations had to develop custom connections to manage API flows, which added architectural overhead. This complexity limited organizations’ ability to distribute Genie’s trusted insights efficiently across Microsoft platforms. What’s Now Possible: Unlock the value of Microsoft Ecosystem The new integrations between Genie and Copilot Studio, as well as Genie and Microsoft Foundry, solve these challenges by providing easy and secure ways to connect each platform. Additionally, by leveraging MCP, updates to the underlying Genie APIs are seamlessly managed for users, eliminating the need to modify your integration. Genie + Copilot Studio Connect a Genie space to a Copilot Studio Agent as a tool (over MCP) Instantly make Genie’s trusted insights available in Teams or M365 Copilot by publishing Genie enabled Copilot Studio agents Genie’s complete context of a customer’s data estate will enable Copilot Studio agents to deliver richer, more accurate responses Genie + Microsoft Foundry Connect a Genie space to Foundry as a tool (over MCP) Instantly make Genie’s trusted insights available to developers building agents using Foundry How It Works: Genie + Copilot Studio Create a connection to your Azure Databricks workspace in Power Apps, using OAuth or a Microsoft Entra ID Service Principal Open Copilot Studio Select an existing Copilot Studio agent or create a new one Open ‘Tools’, click ‘+Add a tool’, and search for “Azure Databricks Genie” or find it within the Model Context Protocol section Select the Genie space to connect and configure the connection Note: It’s important to give your Genie space a clear title and description so the Copilot Studio agent can effectively orchestrate requests. After completing the steps above, you are ready to go and can publish the agent to channels, such as Microsoft Teams, to easily distribute the value of Genie to your organization. How It Works: Genie + Microsoft Foundry Portal Open Microsoft Foundry Go to the tool catalog within the Discover tab Select ‘Azure Databricks Genie’ Configure the connection your Azure Databricks Genie space and click ‘Connect’ Click ‘Use in an agent’ and select the desired agent to connect your Genie space to Once completed, Genie is available for use in your Foundry agent! Try It Out: Get started with Genie in Copilot Studio and Microsoft Foundry To get started with Genie + Copilot Studio, check out our technical documentation To get started with Genie + Microsoft Foundry, check out the documentation To learn more about the Generally Available Azure Databricks Power Platform connector explore this blog
AnaviNahar
Nov 18, 2025 Place Analytics on Azure Blog
2.1KViews
0likes
0Comments
Secure Delta Sharing Between Databricks Workspaces Using NCC and Private Endpoints
This guide walks you through the steps to share Delta tables between two Databricks workspaces (NorthCentral and SouthCentral) and configure Network Connectivity Configuration (NCC) for a Serverless Warehouse. These steps ensure secure data sharing and connectivity for your workloads. Part 1: Delta Sharing Between Workspaces Access Delta Shares From your NorthCentral Workspace, go to Catalog. Hover over Delta Shares Received. When the icon appears, click it. → This will redirect you to the Delta Sharing page. Create a New Recipient On the Delta Sharing page, click Shared by me. Click New Recipient. Fill in the details: Recipient Name: (Enter your recipient name) Recipient Type: Select Databricks Sharing Identifier: azure:southcentralus:3035j6je88e8-91-434a-9aca-e6da87c1e882 To get the sharing identifier using a notebook or Databricks SQL query: (SQL) SELECT CURRENT_METASTORE(); Click Create. Share Data Click "Share Data". Enter a Share Name. Select the data assets you want to share. Note: Please disable History for the selected data assets, as the current data snapshot. Disabling the History option on the Delta Share will simplify the share and prevent unnecessary access to historical versions. Additionally, review whether you can further simplify your share by partitioning the data where appropriate. Add the recipient's name you created earlier. Click Share Data. Add Recipient From the newly created share, click Add Recipient. Select your South-Central Workspace Metastore ID. South-CentralWorkspace In your South-Central Workspace, navigate to the Delta Sharing page. Under Shared with me tab, locate your newly created share and click on it. Add the share to a catalog in Unity Catalog. Part 2: Enable NCC for Serverless Warehouse 6. Add Network Connectivity Configuration (NCC) Go to the Databricks Account Console: https://accounts.azuredatabricks.net/ Navigate to Cloud resources, click Add Network Connectivity Configuration. Fill in the required fields and create a new NCC for SouthCentral. 7. Associate NCC with Workspace In the Account Console, go to Workspaces. Select your SouthCentral workspace, click Update Workspace. From the Network Connectivity Configuration dropdown, select the NCC you just created. 8. Add Private Endpoint Rule In Cloud resources, select your NCC, select Private Endpoint Rules and click Add Private Endpoint Rule. Provide: Resource ID: Enter your Storage Account Resource ID in NorthCentral. Note: This can be found in your storage account (NorthCentral). Click on “JSON View” top right. Azure Subresource type: dfs & blob. 9. Approve Pending Connection Go to your NorthCentral Storage Account, Networking, Private Endpoints. You will see a Pending connection from Databricks. Approve the connection and you will see the Connection status in your Account Console as ESTABLISHED. You will now see your share listed under “Delta Shares Received” Note: If you cannot view your share, run the following SQL command: GRANT USE_PROVIDER ON METASTORE TO `username@xxxx.com`.
Rafia_Aqil
Nov 18, 2025 Place Analytics on Azure Blog
386Views
1like
0Comments
How Great Engineers Make Architectural Decisions — ADRs, Trade-offs, and an ATAM-Lite Checklist
Why Decision-Making Matters Without a shared framework, context fades and teams' re-debate old choices. ADRs solve that by recording the why behind design decisions — what problem we solved, what options we considered, and what trade-offs we accepted. A good ADR: Lives next to the code in your repo. Explains reasoning in plain language. Survives personnel changes and version history. Think of it as your team’s engineering memory. The Five Pillars of Trade-offs At Microsoft, we frame every major design discussion using the Azure Well-Architected pillars: Reliability – Will the system recover gracefully from failures? Performance Efficiency – Can it meet latency and throughput targets? Cost Optimization – Are we using resources efficiently? Security – Are we minimizing blast radius and exposure? Operational Excellence – Can we deploy, monitor, and fix quickly? No decision optimizes all five. Great engineers make conscious trade-offs — and document them. A Practical Decision Flow Step What to Do Output 1. Frame It Clarify the problem, constraints, and quality goals (SLOs, cost caps). Problem statement 2. List Options Identify 2-4 realistic approaches. Options list 3. Score Trade-offs Use a Decision Matrix to rate options (1–5) against pillars. Table of scores 4. ATAM-Lite Review List scenarios, identify sensitivity points (small changes with big impact) and risks. Risk notes 5. Record It as an ADR Capture everything in one markdown doc beside the code. ADR file Example: Adding a Read-Through Cache Decision: Add a Redis cache in front of Cosmos DB to reduce read latency. Context: Average P95 latency from DB is 80 ms; target is < 15 ms. Options: A) Query DB directly B) Add read-through cache using Redis Trade-offs Performance: + Massive improvement in read speed. Cost: + Fewer RU/s on Cosmos DB. Reliability: − Risk of stale data if cache invalidation fails. Operational: + Added complexity for monitoring and TTLs. Templates You Can Re-use ADR Template # ADR-001: Add Read-through Cache in Front of Cosmos DB Status: Accepted Date: 2025-10-21 Context: High read latency; P95 = 80ms, target <15ms Options: A) Direct DB reads B) Redis cache for hot keys ✅ Decision: Adopt Redis cache for performance and cost optimization. Consequences: - Improved read latency and reduced RU/s cost - Risk of data staleness during cache invalidation - Added operational complexity Links: PR#3421, Design Doc #204, Azure Monitor dashboard Decision Matrix Example Pillar Weight Option A Option B Notes Reliability 5 3 4 Redis clustering handles failover Performance 4 2 5 In-memory reads Cost 3 4 5 Reduced RU/s Security 4 4 4 Same auth posture Operational Excellence 3 4 3 More moving parts Weighted total = Σ(weight × score) → best overall score wins. Team Guidelines Create a /docs/adr folder in each repo. One ADR per significant change; supersede old ones instead of editing history. Link ADRs in design reviews and PRs. Revisit when constraints change (incidents, new SLOs, cost shifts). Publish insights as follow-up blogs to grow shared knowledge. Why It Works This practice connects the theory of trade-offs with Microsoft’s engineering culture of reliability and transparency. It improves onboarding, enables faster design reviews, and builds a traceable record of engineering evolution. Join the Conversation Have you tried ADRs or other decision frameworks in your projects? Share your experience in the comments or link to your own public templates — let’s make architectural reasoning part of our shared language.
Antony_nganga
Oct 21, 2025 Place Azure Architecture Blog
547Views
1like
0Comments
SAP Business Data Cloud Connect with Azure Databricks is now generally available
We are excited to share that SAP Business Data Cloud (SAP BDC) Connect for Azure Databricks is generally available. With this announcement, Azure Databricks customers like you, can connect your SAP BDC environment to your existing Azure Databricks instance – without copying the data – to enable bi-directional, live data sharing. Connecting SAP data with other enterprise data prevents governance risk, compliance gaps, and data silos. In addition, maintenance costs are also reduced and manual building of semantics is no longer needed. SAP data products can now be shared directly via Delta Sharing into your existing Azure Databricks instances ensuring complete context for your business. You can now unify your data estate across Azure Databricks and SAP BDC This makes it easier for you to: Enforce governance Power analytics, data warehousing, BI and AI Connecting SAP BDC to Azure Databricks is simple, secure, and fast. The connection is trusted and requires approval from both platforms to enable bi-directional sharing of data products. Once approved, data products in SAP BDC can be directly mounted into Azure Databricks Unity Catalog and are treated like other assets shared using Delta sharing. As a result, your teams can query, analyze, and gather insights on SAP data in addition to your existing business data in one unified way. Instead of spending time gathering the data in once place, your teams can instead focus on unlocking insights from this unified data quickly and securely. This launch complements SAP Databricks in SAP BDC running on Azure that enables AI, ML, data engineering, and data warehousing capabilities directly inside your SAP environment. We have expanded the list of supported regions for SAP Databricks on SAP BDC running on Azure. To learn more with SAP BDC Connect with Azure Databricks review documentation and get started today.
AnaviNahar
Oct 15, 2025 Place Analytics on Azure Blog
1.4KViews
2likes
0Comments
How Azure NetApp Files Object REST API powers Azure and ISV Data and AI services – on YOUR data
This article introduces the Azure NetApp Files Object REST API, a transformative solution for enterprises seeking seamless, real-time integration between their data and Azure's advanced analytics and AI services. By enabling direct, secure access to enterprise data—without costly transfers or duplication—the Object REST API accelerates innovation, streamlines workflows, and enhances operational efficiency. With S3-compatible object storage support, it empowers organizations to make faster, data-driven decisions while maintaining compliance and data security. Discover how this new capability unlocks business potential and drives a new era of productivity in the cloud.
GeertVanTeylingen
Oct 14, 2025 Place Azure Architecture Blog
592Views
0likes
0Comments
Data Vault 2.0 Warehouse Automation on Azure
This is the series of 'Blog Articles' on the topic "Data Vault 2.0 on Azure" where we start from 'What?' and then slowly dwell into 'How To?' implement DV 2.0 on Azure Data Platform Technologies.
Naveed-Hussain
Oct 10, 2025 Place Analytics on Azure Blog
9.2KViews
0likes
1Comment
Secure Medallion Architecture Pattern on Azure Databricks (Part I)
This article presents a security-first pattern for Azure Databricks: a Medallion Architecture where Bronze, Silver and Gold each run as their Lakeflow Job and cluster, orchestrated by a parent job. Run-as identities are Microsoft Entra service principals; storage access is governed via Unity Catalog External Locations backed by the Access Connector’s managed identity. Least-privilege is enforced with cluster policies and UC grants. Prefer managed tables to unlock Predictive Optimisation, Automatic liquid clustering and Automatic statistics. Secrets live in Azure Key Vault and are read at runtime. Monitor reliability and cost with system tables and Jobs UI. Part II covers more low-level concepts and CI/CD.
mscagliola
Oct 08, 2025 Place Analytics on Azure Blog
919Views
11likes
0Comments