azure
3055 TopicsLearn how to build MCP servers with Python and Azure
We just concluded Python + MCP, a three-part livestream series where we: Built MCP servers in Python using FastMCP Deployed them into production on Azure (Container Apps and Functions) Added authentication, including Microsoft Entra as the OAuth provider All of the materials from our series are available for you to keep learning from, and linked below: Video recordings of each stream Powerpoint slides Open-source code samples complete with Azure infrastructure and 1-command deployment If you're an instructor, feel free to use the slides and code examples in your own classes. Spanish speaker? We've got you covered- check out the Spanish version of the series. 🙋🏽♂️Have follow up questions? Join our weekly office hours on Foundry Discord: Tuesdays @ 11AM PT → Python + AI Thursdays @ 8:30 AM PT → All things MCP Building MCP servers with FastMCP 📺 Watch YouTube recording In the intro session of our Python + MCP series, we dive into the hottest technology of 2025: MCP (Model Context Protocol). This open protocol makes it easy to extend AI agents and chatbots with custom functionality, making them more powerful and flexible. We demonstrate how to use the Python FastMCP SDK to build an MCP server running locally. Then we consume that server from chatbots like GitHub Copilot in VS Code, using it's tools, resources, and prompts. Finally, we discover how easy it is to connect AI agent frameworks like Langchain and Microsoft agent-framework to the MCP server. Slides for this session Code repository with examples: python-mcp-demos Deploying MCP servers to the cloud 📺 Watch YouTube recording In our second session of the Python + MCP series, we deploy MCP servers to the cloud! We walk through the process of containerizing a FastMCP server with Docker and deploying to Azure Container Apps. Then we instrument the MCP server with OpenTelemetry and observe the tool calls using Azure Application Insights and Logfire. Finally, we explore private networking options for MCP servers, using virtual networks that restrict external access to internal MCP tools and agents. Slides for this session Code repository with examples: python-mcp-demos Authentication for MCP servers 📺 Watch YouTube recording In our third session of the Python + MCP series, we explore the best ways to build authentication layers on top of your MCP servers. We start off simple, with an API key to gate access, and demonstrate a key-restricted FastMCP server deployed to Azure Functions. Then we move on to OAuth-based authentication for MCP servers that provide user-specific data. We dive deep into MCP authentication, which is built on top of OAuth2 but with additional requirements like PRM and DCR/CIMD, which can make it difficult to implement fully. We demonstrate the full MCP auth flow in the open-souce identity provider KeyCloak, and show how to use an OAuth proxy pattern to implement MCP auth on top of Microsoft Entra. Slides for this session Code repository with Container Apps examples: python-mcp-demos Code repository with Functions examples: python-mcp-demosMicrosoft BizTalk Server Product Lifecycle Update
For more than 25 years, Microsoft BizTalk Server has supported mission-critical integration workloads for organizations around the world. From business process automation and B2B messaging to connectivity across industries such as financial services, healthcare, manufacturing, and government, BizTalk Server has played a foundational role in enterprise integration strategies. To help customers plan confidently for the future, Microsoft is sharing an update to the BizTalk Server product lifecycle and long-term support timelines. BizTalk Server 2020 will be the final version of BizTalk Server. Guidance to support long-term planning for mission-critical workloads This announcement does not change existing support commitments. Customers can continue to rely on BizTalk Server for many years ahead, with a clear and predictable runway to plan modernization at a pace that aligns with their business and regulatory needs. Lifecycle Phase End Date What’s Included Mainstream Support April 11, 2028 Security + non-security updates and Customer Service & Support (CSS) support Extended Support April 9, 2030 CSS support, Security updates, and paid support for fixes (*) End of Support April 10, 2030 No further updates or support (*) Paid Extended Support will be available for BizTalk Server 2020 between April 2028 and April 2030 for customers requiring hotfixes for non-security updates. CSS will continue providing their typical support. BizTalk Server 2016 is already out of mainstream support, and we recommend those customers evaluate a direct modernization path to Azure Logic Apps. Continued Commitment to Enterprise Integration Microsoft remains fully committed to supporting mission-critical integration, including hybrid connectivity, future-ready orchestration, and B2B/EDI modernization. Azure Logic Apps, part of Azure Integration Services — which includes API Management, Service Bus, and Event Grid — delivers the comprehensive integration platform for the next decade of enterprise connectivity. Host Integration Server: Continued Support for Mainframe Workloads Host Integration Server (HIS) has long provided essential connectivity for organizations with mainframe and midrange systems. To ensure continued support for those workloads, Host Integration Server 2028 will ship as a standalone product with its own lifecycle, decoupled from BizTalk Server. This provides customers with more flexibility and a longer planning horizon. Recognizing Mainframe modernization customers might be looking to integrate with their mainframes from Azure, Microsoft provides Logic Apps connectors for mainframe and midrange systems, and we are keen on adding more connectors in this space. Let us know about your HIS plans, and if you require specific features for Mainframe and midranges integration from Logic Apps at: https://aka.ms/lamainframe Azure Logic Apps: The Successor to BizTalk Server Azure Logic Apps, part of Azure Integration Services, is the modern integration platform that carries forward what customers value in BizTalk while unlocking new innovation, scale, and intelligence. With 1,400+ out-of-box connectors supporting enterprise, SaaS, legacy, and mainframe systems, organizations can reuse existing BizTalk maps, schemas, rules, and custom code to accelerate modernization while preserving prior investments including B2B/EDI and healthcare transactions. Logic Apps delivers elastic scalability, enterprise-grade security and compliance, and built-in cost efficiency without the overhead of managing infrastructure. Modern DevOps tooling, Visual Studio Code support, and infrastructure-as-code (ARM/Bicep) ensure consistent, governed deployments with end-to-end observability using Azure Monitor and OpenTelemetry. Modernizing Logic Apps also unlocks agentic business processes, enabling AI-driven routing, predictive insights, and context-aware automation without redesigning existing integrations. Logic Apps adapts to business and regulatory needs, running fully managed in Azure, hybrid via Arc-enabled Kubernetes, or evaluated for air-gapped environments. Throughout this lifecycle transition, customers can continue to rely on the BizTalk investments they have made while moving toward a platform ready for the next decade of integration and AI-driven business. Charting Your Modernization Path Microsoft remains fully committed to supporting customers through this transition. We recognize that BizTalk systems support highly customized and mission-critical business operations. Modernization requires time, planning, and precision. We hope to provide: Proven guidance and recommended design patterns A growing ecosystem of tooling supporting artifact reuse Unified Support engagements for deep migration assistance A strong partner ecosystem specializing in BizTalk modernization Potential incentive programs to help facilitate migration for eligible customers (details forthcoming) Customers can take a phased approach — starting with new workloads while incrementally modernizing existing BizTalk deployments. We’re Here to Help Migration resources are available today: Overview: https://aka.ms/btmig Best practices: https://aka.ms/BizTalkServerMigrationResources Video series: https://aka.ms/btmigvideo Feature request survey: https://aka.ms/logicappsneeds We encourage customers to engage their Microsoft accounts team early to assess readiness, identify modernization opportunities, and explore assistance programs. Your Modernization Journey Starts Now BizTalk Server has played a foundational role in enterprise integration success for more than two decades. As you plan ahead, Microsoft is here to partner with you every step of the way, ensuring operational continuity today while unlocking innovation tomorrow. To begin your transition, please contact your Microsoft account team or visit our migration hub. Thank you for your continued trust in Microsoft and BizTalk Server. We look forward to partnering closely with you as you plan the future of your integration platforms. Frequently Asked Questions Do I need to migrate now? No. BizTalk Server 2020 is fully supported through April 11, 2028, with paid Extended Support available through April 9, 2030, for non-security hotfixes. CSS will continue providing their typical support. You have a long and predictable runway to plan your transition. Will there be a new BizTalk Server version? No. BizTalk Server 2020 is the final version of the product. What happens after April 9, 2030? BizTalk Server will reach End of Support, and security updates or technical assistance will no longer be provided. Workloads will continue running but without Microsoft servicing. Is paid support available past 2028? Yes. Paid extended support will be available through April 2030 for BizTalk Server 2020 customers looking for non-security hotfixes. CSS will continue to provide the typical support. What about BizTalk Server 2016 or earlier versions? Those versions are already out of mainstream support. We strongly encourage moving directly to Logic Apps rather than upgrading to BizTalk Server 2020. Will Host Integration Server continue? Yes. Host Integration Server (HIS) 2028 will be released as a standalone product with its own lifecycle and support commitments. Can I reuse BizTalk Server artifacts in Logic Apps? Yes. Most of BizTalk maps, schemas, rules, assemblies, and custom code can be reused with minimal effort using Microsoft and partner migration tooling. We welcome feature requests here: https://aka.ms/logicappsneeds Does modernization require moving fully to the cloud? No. Logic Apps supports hybrid deployments for scenarios requiring local processing or regulatory compliance, and fully disconnected environments are under evaluation. More information of the Hybrid deployment model here: https://aka.ms/lahybrid. Does modernization unlock AI capabilities? Yes. Logic Apps enables AI-driven automations through Agent Loop, improving routing, decisioning, and operational intelligence. Where do I get planning support? Your Microsoft account team can assist with assessment and planning. Migration resources are also linked in this announcement to help you get started. Microsoft Corporation962Views2likes1CommentCloud Native Identity with Azure Files: Entra-only Secure Access for the Modern Enterprise
Azure Files introduces Entra only identities authentication for SMB shares, enabling cloud-only identity management without reliance on on-premises Active Directory. This advancement supports secure, seamless access to file shares from anywhere, streamlining cloud migration and modernization, and reducing operational complexity and costs.9.8KViews8likes14CommentsAnnouncing the General Availability (GA) of the Premium v2 tier of Azure API Management
Superior capacity, highest entity limits, unlimited included calls, and the most comprehensive set of features set the Premium v2 tier apart from other API Management tiers. Customers rely on the Premium v2 tier for running enterprise-wide API programs at scale, with high availability, and performance. The Premium v2 tier has a new architecture that eliminates management traffic from the customer VNet, making private networking much more secure and easier to setup. During the creation of a Premium v2 instance, you can choose between VNet injection or VNet integration (introduced in the Standard v2 tier) options. In addition, today we are also adding three new features to Premium v2: Inbound Private Link: You can now enable private endpoint connectivity to restrict inbound access to your Premium v2 instance. It can be enabled along with VNet injection or VNet integration or without a VNet. Availability zone support: Premium v2 now supports availability zones (zone redundancy) to enhance the reliability and resilience of your API gateway. Custom CA certificates: Azure API management v2 gateway can now validate TLS connections with the backend service using custom CA certificates. New and improved VNet injection Using VNet injection in Premium v2 no longer requires configuring routes or service endpoints. Customers can secure their API workloads without impacting API Management dependencies, while Microsoft can secure the infrastructure without interfering with customer API workloads. In short, the new VNet injection implementation enables both parties to manage network security and configuration settings independently and without affecting each other. You can now configure your APIs with complete networking flexibility: force tunnel all outbound traffic to on-premises, send all outbound traffic through an NVA, or add a WAF device to monitor all inbound traffic to your API Management Premium v2—all without constraints. Inbound Private Link Customers can now configure an inbound private endpoint for their API Management Premium v2 instance to allow your API consumers securely access the API Management gateway over Azure Private Link. The private endpoint uses an IP address from an Azure virtual network in which it's hosted. Network traffic between a client on your private network and API Management traverses over the virtual network and a Private Link on the Microsoft backbone network, eliminating exposure from the public internet. Further, you can configure custom DNS settings or an Azure DNS private zone to map the API Management hostname to the endpoint's private IP address. With a private endpoint and Private Link, you can: Create multiple Private Link connections to an API Management instance. Use the private endpoint to send inbound traffic on a secure connection. Apply different API Management policies based on whether traffic comes from the private endpoint. Limit incoming traffic only to private endpoints, preventing data exfiltration. Combine with inbound virtual network injection or outbound virtual network integration to provide end-to-end network isolation of your API Management clients and backend services. More details can be found here Today, only the API Management instance’s Gateway endpoint supports inbound private link connections. Each API management instance can support at most 100 Private Link connections. Availability zones Azure API Management Premium v2 now supports Availability Zones (AZ) redundancy to enhance the reliability and resilience of your API gateway. When deploying an API Management instance in an AZ-enabled region, users can choose to enable zone redundancy. This distributes the service's units, including Gateway, management plane, and developer portal, across multiple, physically separate AZs within that region. Learn how to enable AZs here. CA certificates If the API Management Gateway needs to connect to the backends secured with TLS certificates issued by private certificate authorities (CA), you need to configure custom CA certificates in the API Management instance. Custom CA certificates can be added and managed as Authorization Credentials in the Backend entities. The Backend entity has been extended with new properties allowing customers to specify a list of certificate thumbprints or subject name + issuer thumbprint pairs that Gateway should trust when establishing TLS connection with associated backend endpoint. More details can be found here. Region availability The Premium v2 tier is now generally available in six public regions (Australia East, East US2, Germany West Central, Korea Central, Norway East and UK South) with additional regions coming soon. For pricing information and regional availability, please visit the API Management pricing page. Learn more API Management v2 tiers FAQ API Management v2 tiers documentation API Management overview documentationGuide for Architecting Azure-Databricks: Design to Deployment
Author's: Chris Walk cwalk, Dan Johnson danjohn1234, Eduardo dos Santos eduardomdossantos, Ted Kim tekim, Eric Kwashie ekwashie, Chris Haynes Chris_Haynes, Tayo Akigbogun takigbogun and Rafia Aqil Rafia_Aqil Peer Reviewed: Mohamed Sharaf mohamedsharaf Note: This article does not cover the Serverless Workspace option, which is currently in Public Preview. We plan to update this article once Serverless Workspaces are Generally Available. Also, while Terraform is the recommended method for production deployments due to its automation and repeatability, for simplicity in this article we will demonstrate deployment through the Azure portal. DESIGN: Architecting a Secure Azure Databricks Environment Step 1: Plan Workspace, Subscription Organization, Analytics Architecture and Compute Planning your Azure Databricks environment can follow various arrangements depending on your organization’s structure, governance model, and workload requirements. The following guidance outlines key considerations to help you design a well-architected foundation. 1.1 Align Workspaces with Business Units A recommended best practice is to align each Azure Databricks workspace with a specific business unit. This approach—often referred to as the “Business Unit Subscription” design pattern—offers several operational and governance advantages. Streamlined Access Control: Each unit manages its own workspace, simplifying permissions and reducing cross-team access risks. For example, Sales can securely access only their data and notebooks. Cost Transparency: Mapping workspaces to business units enables accurate cost attribution and supports internal chargeback models. Each workspace can be tagged to a cost center for visibility and accountability. Even within the same workspace, costs can be controlled using system tables that provide detailed usage metrics and resource consumption insights. Challenges to keep-in-mind: While per-BU workspaces have high impact, be mindful of workspace sprawl. If every small team spins up its own workspace, you might end up with dozens or hundreds of workspaces, which introduces management overhead. Databricks recommends a reasonable upper limit (on Azure, roughly 20–50 workspaces per account/subscription) because managing “collaboration, access, and security across hundreds of workspaces can become extremely difficult, even with good automation” [1]. Each workspace will need governance (user provisioning, monitoring, compliance checks), so there is a balance to strike. 1.2 Workspace Alignment and Shared Metastore Strategy As you align workspaces with business units, it's essential to understand how Unity Catalog and the metastore fit into your architecture. Unity Catalog is Databricks’ unified governance layer that centralizes access control, auditing, and data lineage across workspaces. Each Unity Catalog is backed by a metastore, which acts as the central metadata repository for tables, views, volumes, and other data assets. In Azure Databricks, you can have one metastore per region, and all workspaces within that region share it. This enables consistent governance and simplifies data sharing across teams. If your organization spans multiple regions, you’ll need to plan for cross-region sharing, which Unity Catalog supports through Delta Sharing. By aligning workspaces with business units and connecting them to a shared metastore, you ensure that governance policies are enforced uniformly, while still allowing each team to manage its own data assets securely and independently. 1.3 Distribute Workspaces Across Subscriptions When scaling Azure Databricks, consider not just the number of workspaces, but also how to distribute them across Azure subscriptions. Using multiple Azure subscriptions can serve both organizational needs and technical requirements: Environment Segmentation (Dev/Test/Prod): A common pattern is to put production workspaces in a separate Azure subscription from development or test workspaces. This provides an extra layer of isolation. Microsoft highly recommends separating workspaces into prod and dev, in separate subscriptions. This way, you can apply stricter Azure policies or network rules to the prod subscription and keep the dev subscription a bit more open for experimentation without risking prod resources. Honor Azure Resource Limits: Azure subscriptions come with certain capacity limits and Azure Databricks workspaces have their own limits (since it’s a multi-tenant PaaS). If you put all workspaces in one subscription, or all teams in one workspace, you might hit those limits. Most enterprises naturally end up with multiple subscriptions as they grow – planning this early avoids later migration headaches. If you currently have everything in one subscription, evaluate usage and consider splitting off heavy workloads or prod workloads into a new one to adhere to best practices. 1.4 Consider Completing Azure Landing Zone Assessment When evaluating and planning your next deployment, it’s essential to ensure that your current landing zone aligns with Microsoft best practices. This helps establish a robust Databricks architecture and minimizes the risk of avoidable issues. Additionally, customers who are early in their cloud journey can benefit from Cloud Assessments—such as an Azure Landing Zone Review and a review of the “Prepare for Cloud Adoption” documentation—to build a strong foundation. 1.5 Planning Your Azure Databricks Workspace Architecture Your workspace architecture should reflect the operational model of your organization and support the workloads you intend to run, from exploratory notebooks to production-grade ETL pipelines. To support your planning, Microsoft provides several reference architectures that illustrate well-architected patterns for Databricks deployments. These solution ideas can serve as starting points for designing maintainable environments: Simplified Architecture: Modern Data Platform Architecture, ETL-Intensive Workload Reference Architecture: Building ETL Intensive Architecture, End-to-End Analytics Architecture: Create a Modern Analytics Architecture. 1.6 Planning for that “Right” Compute Choosing the right compute setup in Azure Databricks is crucial for optimizing performance and controlling costs, as billing is based on Databricks Units (DBUs) using a per-second pricing model. Classic Compute: You can fine-tune your own compute by enabling auto-termination and autoscaling, using Photon acceleration, leveraging spot instances, selecting the right VM type and node count for your workload, and choosing SSDs for performance or HDDs for archival storage. Preferred by mature internal teams and developers who need advanced control over clusters—such as custom VM selection, tuning, and specialized configurations. Serverless Compute: Alternatively, managed services can simplify operations with built-in optimizations. Removes infrastructure management and offers instant scaling without cluster warm-up, making it ideal for agility and simplicity. Step 2: Plan the “Right” CIDR Range (Classic Compute) Note: You can skip this step if you plan to use serverless compute for all your resources, as CIDR range planning is not required in serverless deployments. When planning CIDR ranges for your Azure Databricks workspace, it's important to ensure your virtual network has enough IP address capacity to support cluster scaling. Why this matters: If you choose a small VNet address space and your analytics workloads grow, you might hit a ceiling where you simply cannot launch more clusters or scale-out because there are no free IPs in the subnet. The subnet sizes—and by extension, the VNet CIDR—determine how many nodes you can. Databricks recommends using a CIDR block between /16 and /24 for the VNet, and up to /26 for the two required subnets: the container subnet and the host subnet. Here’s a reference Microsoft provides. If your current workspace’s VNet lacks sufficient IP space for active cluster nodes, you can request a CIDR range update through your Azure Databricks account team as noted in the Microsoft documentation. 2.1 Considerations for CIDR Range Workload Type & Concurrency: Consider what kinds of workloads will run (ETL Pipelines, Machine Learning Notebooks, BI Dashboards, etc.) and how many jobs or clusters may need to run in parallel. High concurrency (e.g. multiple ETL jobs or many interactive clusters) means more nodes running at the same time, requiring a larger pool of IP addresses. Data Volume (Historical vs. Incremental): Are you doing a one-time historical data load or only processing new incremental data? A large backfill of terabytes of data may require spinning up a very large cluster (hundreds of nodes) to process in a reasonable time. Ongoing smaller loads might get by with fewer nodes. Estimate how much data needs processing. Transformation Complexity: The complexity of data transformations or machine learning workloads matters. Heavy transformations (joins, aggregations on big data) or complex model training can benefit more workers. If your use cases include these, you may need larger clusters (more nodes) to meet performance SLAs, which in turn demands more IP addresses available in the subnet. Data Sources and Integration: Consider how your Databricks environment will connect to data. If you have multiple data sources or sinks (e.g. ingest from many event hubs, databases, or IoT streams), you might design multiple dedicated clusters or workflows, potentially all active at once. Also, if using separate job clusters per job (Databricks Jobs), multiple clusters might launch concurrently. All these scenarios increase concurrent node count. 2.2 Configuring a Dedicated Network (VNet) per Workspace with Egress Control By default, Azure Databricks deploys its classic compute resources into a Microsoft-managed virtual network (VNet) within your Azure subscription. While this simplifies setup, it limits control over network configuration. For enhanced security and flexibility, it's recommended to use VNet Injection, which allows you to deploy the compute plane into your own customer-managed VNet. This approach enables secure integration with other Azure services using service endpoints or private endpoints, supports user-defined routes for accessing on-premises data sources, allows traffic inspection via network virtual appliances or firewalls, and provides the ability to configure custom DNS and enforce egress restrictions through network security group (NSG) rules. Within this VNet (which must reside in the same region and subscription as the Azure Databricks workspace), two subnets are required for Azure Databricks: a container subnet (referred to as private subnet) and a host subnet (referred to as public subnet). To implement front-end Private Link, back-end Private Link, or both, your workspace VNet needs a third subnet that will contain the private endpoint (PrivateLink subnet). It is recommended to also deploy an Azure Firewall for egress control. Step 3: Plan Network Architecture for Securing Azure-Databricks 3.1 Secure Cluster Connectivity Secure Cluster Connectivity, also known as No Public IP (NPIP), is a foundational security feature for Azure Databricks deployments. When enabled, it ensures that compute resources within the customer-managed virtual network (VNet) do not have public IP addresses, and no inbound ports are exposed. Instead, each cluster initiates a secure outbound connection to the Databricks control plane using port 443 (HTTPS), through a dedicated relay. This tunnel is used exclusively for administrative tasks, separate from the web application and REST API traffic, significantly reducing the attack surface. For the most secure deployment, Microsoft and Databricks strongly recommend enabling Secure Cluster Connectivity, especially in environments with strict compliance or regulatory requirements. When Secure Cluster Connectivity is enabled, both workspace subnets become private, as cluster nodes don’t have public IP addresses. 3.2 Egress with VNet Injection (NVA) For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.3 Front-End Connectivity with Azure Private Link (Standard Deployment) To further enhance security, Azure Databricks supports Private Link for front-end connections. In a standard deployment, Private Link enables users to access the Databricks web application, REST API, and JDBC/ODBC endpoints over a private VNet interface, bypassing the public internet. For organizations with no public internet access from user networks, a browser authentication private endpoint is required. This endpoint supports SSO login callbacks from Microsoft Entra ID and is shared across all workspaces in a region using the same private DNS zone. It is typically hosted in a transit VNet that bridges on-premises networks and Azure. Note: There are two deployment types: standard and simplified. To compare these deployment types, see Choose standard or simplified deployment. 3.4 Serverless Compute Networking Azure Databricks offers serverless compute options that simplify infrastructure management and accelerate workload execution. These resources run in a Databricks-managed serverless compute plane, isolated from the public internet and connected to the control plane via the Microsoft backbone network. To secure outbound traffic from serverless workloads, administrators can configure Serverless Egress Control using network policies that restrict connections by location, FQDN, or Azure resource type. Additionally, Network Connectivity Configurations (NCCs) allow centralized management of private endpoints and firewall rules. NCCs can be attached to multiple workspaces and are essential for enabling secure access to Azure services like Data Lake Storage from serverless SQL warehouses. DEPLOYMENT: Step-to-Step Implementation using Azure Portal Step 1: Create an Azure Resource Group For each new workspace, create a dedicated Resource Group (to contain the Databricks workspace resource and associated resources). Ensure that all resources are deployed in the same Region and Resource Group (i.e. workspace, subnets...) to optimize data movement performance and enhance security. Step 2: Deploy Workspace Specific Virtual Network (VNET) From your Resource Group, create a Virtual Network. Under the Security section, enable Azure Firewall. Deploying an Azure Firewall is recommended for egress control, ensuring that outbound traffic from your Databricks environment is securely managed. Define address spaces for your Virtual Network (Review Step 2 from Design). As documented, you could create a VNet with these values: IP range: First remove the default IP range, and then add IP range 10.28.0.0/23. Create subnet public-subnet with range 10.28.0.0/25. Create subnet private-subnet with range 10.28.0.128/25. Create subnet private-link with range 10.28.1.0/27. Please note: your IP values can be different depending on your IPAM and available scopes. Review + Create your Virtual Network. Step 3: Deploy Azure-Databricks Workspace: Now that networking is in place, create the Databricks workspace. Below are detailed steps your organization should review while creating workspace creation: In Azure Portal, search for Azure Databricks and click Create. Choose the Subscription, RG, Region, select Premium, enter in “Managed Resource Group name” and click Next. Managed Resource Group- will be created after your Databrick workspace is deployed and contains infrastructure resources for the workspace i.e. VNets, DBFS. Required: Enable “Secure Cluster Connectivity” (No Public IP for clusters), to ensure that Databricks clusters are deployed without public IP addresses (Review Section 3.1). Required: Enable the option to deploy into your Virtual Network (VNet Injection), also known as “Bring Your Own VNet” (Review Section 3.2). Select the Virtual Network created in Step 2. Enter Private, Public Subnet Names. Enable or Disable “Deploying Nat Gateway”, according to your workspace requirement. Disable “Allow Public Network Access”. Select “No Azure Databricks Rules” for Required NSG Rules. Select “Click on add to create a private endpoint”, this will open a panel for private endpoint setup. Click “Add” to enter your Private Link details created in Step 2. Also, ensure that Private DNS zone integration is set to “Yes” and that a new Private DNS Zone is created, indicated by (New)privatelink.azuredatabricks.net. Unless an existing DNS zone for this purpose already exists. (Optional) Under Encryption Tab, Enable Infrastructure Encryption, if you have requirement for FIPS 140-2. It comes at a cost, it takes time to encrypt and decrypt. By default your data is already encrypted. If you have a standard regulatory requirement (ex. HIPAA). (Optional) Compliance security profile- for HIPAA. (Optional) Automatic cluster updates, First Sunday of every Month. Review + Create the workspace and wait for it to deploy. Step 4: Create a private endpoint to support SSO for web browser access: Note: This step is required when front-end Private Link is enabled, and client networks cannot access the public internet. After creating your Azure Databricks workspace, if you try to launch it without the proper Private Link configuration, you will see an error like the image below: This happens because the workspace is configured to block public network access, and the necessary Private Endpoints (including the browser_authentication endpoint for SSO) are not yet in place. Create Web-Auth Workspace Note: Deploy a “dummy”: WEB_AUTH_DO_NOT_DELETE_<region> workspace in the same region as your production workspace. Purpose: Host the browser_authentication private endpoint (one required per region). Lock the workspace (Delete lock) to prevent accidental removal. Follow step 2 to create Virtual Network (Vnet) Follow step 3 and create a VNet injected “dummy” workspace. Create Browser Authentication Private Endpoint In Azure Portal, Databricks workspace (dummy), Networking, Private endpoint connections, + Private endpoint. Resource step: Target sub-resource: browser_authentication Virtual Network step: VNet: Transit/Hub VNet (central network for Private Link) Subnet: Private Endpoint subnet in that VNet (not Databricks host subnets) DNS step: Integrate with Private DNS zone: Yes Zone: privatelink.azuredatabricks.net Ensure DNS zone is linked to the Transit VNet After creation: A-records for *.pl-auth.azuredatabricks.net are auto-created in the DNS zone. Workspace Connectivity Testing If you have VPN or ExpressRoute, Bastion is not required. However, for the purposes of this article we will be testing our workpace connectivity through Bastion. If you don’t have private connectivity and need to test from inside the VNet, Azure Bastion is a convenient option. Step 5: Create Storage Account From your Resource Group, click Create and select Storage account. On the configuration page: Select Preferred Storage type as: Azure Blob Storage or Azure Data Lake Storage Gen 2. Choose Performance and Redundancy options based on your business requirements. Click Next to proceed. Under the Advanced tab: Enable Hierarchical namespace under Data Lake Storage Gen2. This is critical for: Directory and file-level operations, Access Control Lists (ACLs). Under the Networking tab: Set Public Network Access to Disabled. Complete the creation process and then create container(s) inside the storage account. Step 6: Create Private Endpoints for Workspace Storage Account Pre-requisite: You need to create two private endpoints from the VNet used for VNet injection to your workspace storage account for the following Target sub-resources: dfs and blob. Navigate to your Storage Account. Go to Networking, Private Endpoints tab and click on to + Create Private Endpoint. In the Create Private Endpoint wizard: Resource tab: Select your Storage Account. Set Target sub-resource to dfs for the first endpoint. Virtual Network tab: Choose the VNet you used for VNet injection. Select the appropriate subnet. Complete the creation process. The private endpoint will be auto approved and visible under Private Endpoints. Repeat the process for the second private endpoint: This time set Target sub-resource to blob. Step 7: Link Storage and Databricks Workspace: Create Access Connector In your Resource Group, create an Access Connector for Azure Databricks. No additional configuration is required during creation. Assign Role to Access Connector Navigate to your Storage Account, Access Control (IAM), Add role assignment. Select: Role: Storage Blob Data Contributor Assign access to: Managed Identity Under Members: Click Select members. Find and select your newly created Access Connector for Azure Databricks. Save the role assignment. Copy Resource ID Go to the Access Connector Overview page. Copy the Resource ID for later use in Databricks configuration. Step 8: Link Storage and Databricks Workspace: Navigate to Unity Catalog In your Databricks Workspace, go to Unity Catalog, External Data and select “Create external Location” button. Configure External Location Select ADLS as the storage type. Enter the ADLS storage URL in the following format: abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/ Update these two parameters: <container_name> and <storage_name> Provide Access Connector Select “Create new storage credential” from Storage credential field. Paste the Resource ID of the Access Connector for Azure Databricks (from Step 10) into the Access Connector ID field. Validate Connection Click Submit. You should see a “Successful” message confirming the connection. Click submit and you should receive a “Successful” message, indicating your connection has succeeded. You can now create Catalogs and link your secure storage. Step 9: Configuring Serverless Compute Networking: If your organization plans to use Serverless SQL Warehouses or Serverless Jobs Compute, you must configure Serverless Networking. Add Network Connectivity Configuration (NCC) Go to the Databricks Account Console: https://accounts.azuredatabricks.net/ Navigate to Cloud resources, click Add Network Connectivity Configuration. Fill in the required fields and create a new NCC. Associate NCC with Workspace In the Account Console, go to Workspaces. Select your workspace, click Update Workspace. From the Network Connectivity Configuration dropdown, select the NCC you just created. Add Private Endpoint Rule In Cloud resources, select your NCC, select Private Endpoint Rules and click Add Private Endpoint Rule. Provide: Resource ID: Enter your Storage Account Resource ID. Note: this can be found from your storage account, click on “JSON View” top right. Azure Subresource type: dfs & blob. Approve Pending Connection Go to your Storage Account, Networking, Private Endpoints. You will see a Pending connection from Databricks. Approve the connection and you will see the Connection status in your Account Console as ESTABLISHED. Step 10: Test Your Workspace: Launch a small test cluster and verify the following: It can start (which means it can talk to the control plane). It can read/write from the storage, following the following code to confirm read/write to storage: Set Spark properties to configure Azure credentials to access Azure storage. Check Private DNS Record has been created. (Optional) If on-prem data is needed: try connecting to an on-prem database (using the ExpressRoute path): Connect your Azure Databricks workspace to your on-premises network - Azure Databricks | Microsoft Learn. Step 11: Account Console, Planning Workspace Access Controls and Getting Started: Once your Azure Databricks workspace is deployed, it's essential to configure access controls and begin onboarding users with the right permissions. From your account console: https://accounts.azuredatabricks.net/, you can centrally manage your environment: add users and groups, enable preview features, and view or configure all your workspaces. Azure Databricks supports fine-grained access management through Unity Catalog, cluster policies, and workspace-level roles. Start by defining who needs access to what—whether it's notebooks, tables, jobs, or clusters—and apply least-privilege principles to minimize risk. DBFS Limitation: DBFS is automatically created upon Databricks Workspace creation. DBFS can be found in your Managed Resource Group. Databricks cannot secure DBFS (see reference image below). If there is a business need to avoid DBFS then you can disable DBFS access following instructions here: Disable access to DBFS root and mounts in your existing Azure Databricks workspace. Use Unity Catalog to manage data access across catalogs, schemas, and tables, and consider implementing cluster policies to standardize compute configurations across teams. To help your teams get started, Microsoft provides a range of tutorials and best practice guides: Best practice articles - Azure Databricks | Microsoft Learn. Step 12: Planning Data Migration: As you prepare to move data into your Azure Databricks environment, it's important to assess your migration strategy early. This includes identifying source systems, estimating data volumes, and determining the appropriate ingestion methods—whether batch, streaming, or hybrid. For organizations with complex migration needs or legacy systems, Microsoft offers specialized support through its internal Azure Cloud Accelerated Factory program. Reach out to your Microsoft account team to explore nomination for Azure Cloud Accelerated Factory, which provides hands-on guidance, tooling, and best practices to accelerate and streamline your data migration journey. Summary Regular maintenance and governance are as important as the initial design. Continuously review the environment and update configurations as needed to address evolving requirements and threats. For example, tag all resources (workspaces, VNets, clusters, etc.) with clear identifiers (workspace name, environment, department) to track costs and ownership effectively. Additionally, enforce least privilege across the platform: ensure that only necessary users are given admin privileges, and use cluster-level access control to restrict who can create or start clusters. By following the above steps, an organization will have an Azure Databricks architecture that is securely isolated, well-governed, and scalable. References: [1] 5 Best Practices for Databricks Workspaces AzureDatabricksBestPractices/toc.md at master · Azure ... - GitHub Deploy a workspace using the Azure Portal Additional Links: Quick Introduction to Databricks: what is databricks | introduction - databricks for dummies Connect Purview with Azure Databricks: Integrating Microsoft Purview with Azure Databricks Secure Databricks Delta Share between Workspaces: Secure Databricks Delta Share for Serverless Compute Azure-Databricks Cost Optimization Guide: Databricks Cost Optimization: A Practical Guide Integrate Azure Databricks with Microsoft Fabric: Integrating Azure Databricks with Microsoft Fabric Databricks Solution Accelerators for Data & AI Azure updates Appendix 3.5 Understanding Data Transfer (Express Route vs. Public Internet) For data transfers, your organization must decide to use ExpressRoute or Internet Egress. There are several considerations that can help you determine your choice: 3.5.1. Connectivity Model • ExpressRoute: Provides a private, dedicated connection between your on-premises infrastructure and Microsoft Azure. It bypasses the public internet entirely and connects through a network service provider. • Internet Egress: Refers to outbound data traffic from Azure to the public internet. This is the default path for most Azure services unless configured otherwise. 3.6 Planning for User-Defined Routes (UDRs) When working with Databricks deployments—especially in VNet-injected workspaces—setting up User Defined Routes (UDRs) is a smart move. It’s a best practice that helps manage and secure network traffic more effectively. By using UDRs, teams can steer traffic between Databricks components and external services in a controlled way, which not only boosts security but also supports compliance efforts. 3.6.1 UDRs and Hub and Spoke Topology If your Databricks workspace is deployed into your own virtual network (VNet), you’ll need to configure standard user-defined routes (UDRs) to manage traffic flow. In a typical hub-and-spoke architecture, UDRs are used to route all traffic from the spoke VNets to the hub VNet. 3.6.2 Hub and Spoke with VWANHUB If your Databricks workspace is deployed into your own virtual network (VNet) and is peered to a Virtual WAN (VWAN) hub as the primary connectivity hub into Azure, a user-defined route (UDR) is not required—provided that a private traffic routing policy or internet traffic routing policy is configured in the VWAN hub. 3.6.3 Use of NVAs and Service Tags For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.6.4 Default Outbound Access Retirement (Non-Serverless Compute) Microsoft is retiring default outbound internet access for new deployments starting September 30,2025. Going forward, outbound connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.” Configure connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.”1.3KViews1like0CommentsSecure, Seamless Access using Managed Identities with Azure Files SMB
As organizations evolve their application and storage environments, whether on‑premises, hybrid, or cloud, secure access is top of mind. Organizations are vigilant about protecting sensitive data while enabling agile application access across distributed environments. SMB shares are commonly used for persistent storage in applications like AKS for container workloads, web applications, and App Services. Traditional models that rely on credentials like storage account keys do not meet the demands of a Zero Trust architecture, where every access request must be verified explicitly, granted with least privilege, and designed to assume malicious access from bad actors. We are excited to announce the Public Preview of Managed Identities support with Azure Files SMB. This capability provides a secure, identity-driven approach for customer applications that eliminates credentials-based access and integrates seamlessly with MS Entra ID. Azure virtual machines, containers, and applications running in Azure can now authenticate to Azure Files using their own managed identity, and mount shares using short lived OAuth tokens over Kerberos. This unlocks secure file share access for both first party and customer applications, including Azure Kubernetes Service (AKS), Azure Functions, App Services, and other cloud native services By leveraging Managed Identities, customers gain: Zero Trust Alignment–Identity tied to a specific resource, token refreshes every hour, and no passwords or keys to manage or rotate with Azure handling end-to-end identity management Role Based Access Control – Built-in RBAC for least-privilege enforcement Compliance Mandate Resolution – Compliant with FIPS, removing need for NTLMv2 Multi-Client Support – Works with Windows and Linux clients over SMB This capability brings a secure, simple, and scalable access model that helps organizations meet industry standard security requirements while inheriting Microsoft Entra ID’s enterprise grade identity, governance, and security capabilities for file shares. Securing Real World Applications To illustrate how Managed Identities strengthen security, the following example workloads highlight where customers will benefit from this capability. Eliminate Secret Sprawl for Continuous Integration, Continuous Deployment (CI/CD) workloads Azure Files SMB provides a centralized location for storing software development artifacts generated during CI/CD pipelines. CI/CD workloads span far beyond application code, covering infrastructure updates, data engineering workflows, ML pipelines, and compliance automation, making them foundational to modern DevOps practices. Build agents in Azure DevOps or other CI/CD systems often run on both Linux and Windows, requiring a common storage backend for binaries and configuration files. Historically, these agents authenticated to Azure Files using storage account keys. With Managed Identities, build agents can now authenticate using their own identity from Microsoft Entra ID, with authorization governed through Azure RBAC. This enhances security, removes static credentials, and simplifies compliance. “Managed Identities support with SMB shares will enable us to remove dependencies on storage account keys to run our CI/CD pipelines, enabling stronger security and alignment with Zero-Trust principles." Alex Garcia, Staff Dev Ops Engineer, Unity Technologies. Secure Persistent Files Storage with Azure Kubernetes Service (AKS) Stateful AKS workloads rely on persistent volumes for configuration, logs, and application data. Previously, mounting Azure Files required storing account keys or secrets in Kubernetes. Organizations requested exceptions from their security organizations to continue using shared keys until a secure managed identities-based solution was available. With this feature, AKS clusters can authenticate directly to Azure Files SMB without storage account keys. This enables secure, token‑based access for persistent volume mounts, improving security posture and eliminating the need for exceptions to use access tied to storage account keys. Learn more in the Azure Files AKS CSI documentation. Get Started with Managed Identities with SMB Azure Files Start using Managed Identities with Azure Files today at no additional cost. This feature is supported on HDD and SSD SMB shares across all billing models. Refer to our documentation for complete set-up guidance. Whether provisioning new storage or enhancing existing deployments, this capability provides secure, enterprise‑grade access with a streamlined configuration experience. Secure your workloads today! For any questions, reach out to the team at azurefiles@microsoft.com183Views0likes0CommentsAzure passowrd protection
We have a hybrid Azure infrastructure with an AD Connector installed on-prem and configured for PTA. We installed the password protection server and registered it with the Azure tenant, then deployed the DC agent on all domain controllers. Both the proxy and agents are operational. We published a few banned words to block in case anyone uses them. For testing, I changed my password to include one of the banned words. To my surprise, I was able to change the password. I checked the corresponding logon server, and the DC event viewer showed that the password was validated, but the banned word was in the password list that Azure set to enforce. Why is it not blocking the change?Solved34Views0likes1CommentI passed the GH‑900: GitHub Foundations exam!
Hi everyone, I’m excited to share that I cleared the GH‑900 (GitHub Foundations) exam with a good score! This certification validates my understanding of Git, repository collaboration, pull requests, and GitHub’s core features. Preparation Approach: I studied using Microsoft Learn resources and the GH‑900 study guide. For extra practice and exam-style questions, I used dumps-4-azure — it really gave me the extra edge for exam readiness. I also practiced hands-on with real GitHub workflows (branches, pull requests, projects) to reinforce my understanding. Key Takeaways: The exam tests foundational Git + GitHub collaboration skills — not just theory. Practical experience combined with mock questions made a big difference. Consistency in daily preparation is the key. Next Steps: After GH‑900, I’m planning to go for GH‑100 (GitHub Administration) to deepen my GitHub skills at the organizational level.83Views1like1CommentOverload to Optimal: Tuning Microsoft Fabric Capacity
Co-Authored by: Daya Ram, Sr. Cloud Solutions Architect and Rafia Aqil, Could Solutions Architect Optimizing Microsoft Fabric capacity is both a performance and cost exercise. By diagnosing workloads, tuning cluster and Spark settings, and applying data best practices, teams can reduce run times, avoid throttling, and lower total cost of ownership—without compromising SLAs. Use Fabric’s built-in observability (Monitoring Hub, Capacity Metrics, Spark UI) to identify hot spots and then apply cluster- and data-level remediations. Capacity Planning For capacity planning and sizing guidance, see Plan your capacity size. Selecting the wrong SKU can lead to two major issues: Over-provisioning: Paying for resources you don’t need. Under-provisioning: Struggling with performance bottlenecks and failed jobs. To simplify this process, Microsoft provides the Fabric SKU Estimator, a powerful tool designed to help organizations accurately size their capacity based on real-world usage patterns. Run the SKU Estimator before onboarding new workloads or scaling existing ones. Combine its recommendations with monitoring tools like Fabric Capacity Metrics to validate performance and adjust as needed. Options to Diagnose Capacity Issues 1) Monitoring Hub — Start with the Story of the Run What to use it for: Browse Spark activity across applications (notebooks, Spark Job Definitions, and pipelines). Quickly surface long‑running or anomalous runs; view read/write bytes, idle time, core allocation, and utilization. How to use it From the Fabric portal, open Monitoring (Monitor Hub). Select a Notebook or Spark Job Definition to run and choose Historical Runs. Inspect the Run Duration chart; click on a run to see read/write bytes, idle time, core allocation, overall utilization, and other Spark metrics. What to look for Use the guide: application detail monitoring to review and monitor your application. 2) Capacity Metrics App — Measure the Whole Environment What to use it for: Review capacity-wide utilization and system events (overloads, queueing); compare utilization across time windows and identify sustained peaks. How to use it Open the Microsoft Fabric Capacity Metrics app for your capacity. Review the Compute page (ribbon charts, utilization trends) and the System events tab to see overload or throttling windows. Use the Timepoint page to drill into a 30‑second interval and see which operations consumed the most compute. What to look for Use the Troubleshooting guide: Monitor and identify capacity usage to pinpoint top CU‑consuming items. 3) Spark UI — Diagnose at Deeper Level Why it matters: Spark UI exposes skew, shuffle, memory pressure, and long stages. Use it after Monitoring Hub/Capacity Metrics to pinpoint the problematic job. Key tabs to inspect Stages: uneven task durations (data skew), heavy shuffle read/write, large input/output volumes. Executors: storage memory, task time (GC), shuffle metrics. High GC or frequent spills indicate memory tuning is needed. Storage: which RDDs/cached tables occupy memory; any disk spill. Jobs: long‑running jobs and gaps in the timeline (driver compilation, non‑Spark code, driver overload). What to look for Set via environment Spark properties or session config. Data skew, Memory usage, High/Low Shuffles: Adjust Apache Spark settings: i.e. spark.ms.autotune.enabled, spark.task.cpus and spark.sql.shuffle.partitions. Remediation and Optimization Suggestions A) Cluster & Workspace Settings Runtime & Native Execution Engine (NEE) Use Fabric Runtime 1.3 (Spark 3.5, Delta 3.2) and enable the Native Execution Engine to boost performance; enable at the environment level under Spark compute → Acceleration. Starter Pools vs. Custom Pools Starter Pool: prehydrated, medium‑size pools; fast session starts, good for dev/quick runs. Custom Pools: size nodes, enable autoscale, dynamic executors. Create via workspace Spark Settings (requires capacity admin to enable workspace customization). High Concurrency Session Sharing Enable High Concurrency to share Spark Sessions across notebooks (and pipelines) to reduce session startup latency and cost; use session tags in pipelines to group notebooks. Autotune for Spark Enable Autotune (spark.ms.autotune.enabled = true) to auto‑adjust per‑query: spark.sql.shuffle.partitions Spark.sql.autoBroadcastJoinThreshold spark.sql.files.maxPartitionBytes. Autotune is disabled by default and is in preview; enable per environment or session. B) Data‑level best practices Microsoft Fabric offers several approaches to maintain optimal file sizes in Delta tables, review documentation here: Table Compaction - Microsoft Fabric. Intelligent Cache Enabled by default (Runtime 1.1/1.2) for Spark pools: caches frequently read files at node level for Delta/Parquet/CSV; improves subsequent read performance and TCO. OPTIMIZE & Z‑Order Run OPTIMIZE regularly to rewrite files and improve file layout. V‑Order V‑Order (disabled by default in new workspaces) can accelerate reads for read‑heavy workloads; enable via spark.sql.parquet.vorder.default = true. Vacuum Run VACUUM to remove unreferenced files (stale data); default retention is 7 days; align retention across OneLake to control storage costs and maintain time travel. Collaboration & Next Steps Engage Data Engineering Team to Define an Optimization Playbook Start with reviewing capacity sizing guidance, cluster‑level optimizations (runtime/NEE, pools, concurrency, Autotune) and then target data improvements (Z‑order, compaction, caching, query refactors). Triage: Monitor Hub → Capacity Metrics → Spark UI to map workloads and identify high‑impact jobs, and workloads causing throttling. Schedule: Operationalize maintenance: OPTIMIZE (full or selective) during off‑peak windows; enable Auto Compaction for micro‑batch/streaming writes; add VACUUM to your cadence with agreed retention. Add regular code review sessions to ensure consistent performance patterns. Fix: Adjust pool sizing or concurrency; enable Autotune; tune shuffle partitions; refactor problematic queries; re‑run compaction. Verify: Re‑run the job and change, i.e. reduced run time, lower shuffle, improved utilization.543Views1like0CommentsAzure Databricks Cost Optimization: A Practical Guide
Co-Authored by: Sanjeev Nair Sanjeev Nair and Rafia Aqil Rafia_Aqil This guide walks through a proven approach to Databricks cost optimization, structured in three phases: Discovery, Cluster/Data/Code Best Practices, and Team Alignment & Next Steps. Phase 1: Discovery Assessing Your Current State The following questions are designed to guide your initial assessment and help you identify areas for improvement. Documenting answers to each will provide a baseline for optimization and inform the next phases of your cost management strategy. Environment & Organization Cluster Management Cost Optimization Data Management Performance Monitoring Future Planning What is the current scale of your Databricks environment? How many workspaces do you have? How are your workspaces organized (e.g., by environment type, region, use case)? How many clusters are deployed? How many users are active? What are the primary use cases for Databricks in your organization? Data engineering Data science Machine learning Business intelligence How are clusters currently managed? Manual configuration Automated scripts Databricks REST API Cluster policies What is the average cluster uptime? Hours per day Days per week What is the average cluster utilization rate? CPU usage Memory usage What is the current monthly spend on Databricks? Total cost Breakdown by workspace Breakdown by cluster What cost management tools are currently in use? Azure Cost Management Third-party tools Are there any existing cost optimization strategies in place? Reserved instances Spot instances Cluster auto-scaling What is the current data storage strategy? Data lake Data warehouse Hybrid What is the average data ingestion rate? GB per day Number of files What is the average data processing time? ETL jobs Machine learning models What types of data formats are used in your environment? Delta Lake Parquet JSON CSV Other formats relevant to your workloads What performance monitoring tools are currently in use? Databricks Ganglia Azure Monitor Third-party tools What are the key performance metrics tracked? Job execution time Cluster performance Data processing speed Are there any planned expansions or changes to the Databricks environment? New use cases Increased data volume Additional users What are the long-term goals for Databricks cost optimization? Reducing overall spend Improving resource utilization & cost attribution Enhancing performance Understanding Databricks Cost Structure Total Cost = Cloud Cost + DBU Cost Cloud Cost: Compute (VMs, networking, IP addresses), storage (ADLS, MLflow artifacts), other services (firewalls), cluster type (serverless compute, classic compute) DBU Cost: Workload size, cluster/warehouse size, photon acceleration, compute runtime, workspace tier, SKU type (Jobs, Delta Live Tables, All Purpose Clusters, Serverless), model serving, queries per second, model execution time Diagnose Cost and Issues Effectively diagnosing cost and performance issues in Databricks requires a structured approach. Use the following steps and metrics to gain visibility into your environment and uncover actionable insights. 1. Identify Costly Workloads Account Console Usage Reports: Review usage reports to identify usage breakdowns by product, SKU name, and custom tags. Usage Breakdown by Product and SKU: Helps you understand which services and compute types (clusters, SQL warehouses, serverless options) are consuming the most resources. Custom Tags for Attribution: Tags allow you to attribute costs to teams, projects, or departments, making it easier to identify high-cost areas. Workflow and Job Analysis: By correlating usage data with workflows and jobs, you can pinpoint long-running or resource-heavy workloads that drive costs. Focus on Long-Running Workloads: Examine workloads with extended runtimes or high resource utilization. Key Question: Which pipelines or workloads are driving the majority of your costs? Now That You’ve Identified Long-Running Workloads, Review These Key Areas: 2. Review Cluster Metrics CPU Utilization: Track guest, iowait, idle, irq, nice, softirq, steal, system, and user times to understand how compute resources are being used. Memory Utilization: Monitor used, free, buffer, and cached memory to identify over- or under-utilization. Key Question: Is your cluster over- or under-utilized? Are resources being wasted or stretched too thin? 3. Review SQL Warehouse Metrics Live Statistics: Monitor warehouse status, running/queued queries, and current cluster count. Time Scale Filter: Analyze query and cluster activity over different time frames (8 hours, 24 hours, 7 days, 14 days). Peak Query Count Chart: Identify periods of high concurrency. Completed Query Count Chart: Track throughput and query success/failure rates. Running Clusters Chart: Observe cluster allocation and recycling events. Query History Table: Filter and analyze queries by user, duration, status, and statement type. Key Question: Is your SQL Warehouse over- or under-utilized? Are resources being wasted or stretched too thin? 4. Review Spark UI Stages Tab: Look for skewed data, high input/output, and shuffle times. Uneven task durations may indicate data skew or inefficient data handling. Jobs Timeline: Identify long-running jobs or stages that consume excessive resources. Stage Analysis: Determine if stages are I/O bound or suffering from data skew/spill. Executor Metrics: Monitor memory usage, CPU utilization, and disk I/O. Frequent garbage collection or high memory usage may signal the need for better resource allocation. 4.1. Spark UI: Storage & Jobs Tab Storage Level: Check if data is stored in memory, on disk, or both. Size: Assess the size of cached data. Job Analysis: Investigate jobs that dominate the timeline or have unusually long durations. Look for gaps caused by complex execution plans, non-Spark code, driver overload, or cluster malfunction. 4.2. Spark UI: Executor Tab Storage Memory: Compare used vs. available memory. Task Time (Garbage Collection): Review long tasks and garbage collection times. Shuffle Read/Write: Measure data transferred between stages. 5. Additional Diagnostic Methods System Tables in Unity Catalog: Query system tables for cost attribution and resource usage trends. Cost Observability Queries Tagging Analysis: Use tags to identify which teams or projects consume the most resources. Dashboards & Alerts: Set up cost dashboards and budget alerts for proactive monitoring. Phase 2: Cluster/Code/Data Best Practices Alignment Cluster UI Configuration and Cost Attribution Effectively configuring clusters/workloads in Databricks is essential for balancing performance, scalability, and cost. Tunning settings and features when used strategically can help organizations maximize resource efficiency and minimize unnecessary spending. Key Configuration Strategies 1. Reduce Idle Time: Clusters to incur costs even when not actively processing workloads. To avoid paying for unused resources: Enable Auto-Terminate: Set clusters automatically shut down after a period of inactivity. This simple setting can significantly reduce wasted spending. Enable Autoscaling: Workloads fluctuate in size and complexity. Autoscaling allows clusters to dynamically adjust the number of nodes based on demand: Automatic Resource Adjustment: Scale up for heavy jobs and scale down for lighter loads, ensuring you only pay for what you use. It significantly enhances cost efficiency and overall performance. For serverless and streaming, using Delta Live Tables with autoscaling is recommended. This approach leads to better resource management and reliability. Use Spot Instances: For batch processing and non-critical workloads, spot instances offer substantial cost savings: Lower VM Costs: Spot instances are typically much cheaper than standard VMs. However, they are not recommended for jobs requiring constant uptime due to potential interruptions. Considerations: Azure Spot VMs are intended for non-critical, fault-tolerant tasks. They can be evicted without notice, riskingproduction stability. No SLA guarantees mean potentialdowntime for critical applications. Using Spot VMs could lead to reliability issues in production environments. Leverage Photon Engine: Photon is Databricks’ high-performance, vectorized query engine: Accelerate Large Workloads: Photon can dramatically reduce runtime for compute-intensive tasks, improving both speed and cost efficiency. Keep Runtimes Up to Date: Using the latest Databricks runtime ensures optimal performance and security: Benefit from Improvements: Regular updates include performance enhancements, bug fixes, and new features. Apply Cluster Policies: Cluster policies help standardize configurations and enforce cost controls across teams: Governance and Consistency: Policies can restrict certain settings, enforce tagging, and ensure clusters are created with cost-effective defaults. Optimize Storage: type impacts both performance and cost: Switch from HDDs to SSDs: SSDs provide faster caching and shuffle operations, which can improve job efficiency and reduce runtime. Tag Clusters for Cost Attribution: Tagging clusters enables granular tracking and reporting: Visibility and Accountability: Use tags to attribute costs to specific teams, projects, or environments, supporting better budgeting and chargeback processes. Select the Right Cluster Type: Different workloads require different cluster types, see table below for Serverless vs Classic Compute: Feature Classic Compute Serverless Compute Control Full control over config & network Minimal control, fully managed by Databricks Startup Time Slower (unless pre-warmed) Instant Cost Model Hourly, supports reservations Pay-per-use, elastic scaling Security VNet injection, private endpoints NCC-based private connectivity Best For Heavy ETL, ML, compliance workloads Interactive queries, unpredictable demand Job Clusters: Ideal for scheduled jobs and Delta Live Tables. All-Purpose Clusters: Suited for ad-hoc analysis and collaborative work. Single-Node Clusters: Efficient for simple exploratory data analysis or pure Python tasks. Serverless Compute: Scalable, managed workloads with automatic resource management. 11. Monitor and Adjust Regularly: review cluster metrics and query history: Continuous Optimization: Use built-in dashboards to monitor usage, identify bottlenecks, and adjust cluster size or configuration as needed. Code Best Practices Avoid Reprocessing Large Tables Use a CDC (Change Data Capture) architecture with Delta Live Tables (DLT) to process only new or changed data, minimizing unnecessary computation. Ensure Code Parallelizes Well Write Spark code that leverages parallel processing. Avoid loops, deeply nested structures, and inefficient user-defined functions (UDFs) that can hinder scalability. Reduce Memory Consumption Tweak Spark configurations to minimize memory overhead. Clean out legacy or unnecessary settings that may have carried over from previous Spark versions. Prefer SQL Over Complex Python Use SQL (declarative language) for Spark jobs whenever possible. SQL queries are typically more efficient and easier to optimize than complex Python logic. Modularize Notebooks Use %run to split large notebooks into smaller, reusable modules. This improves maintainability. Use LIMIT in Exploratory Queries When exploring data, always use the LIMIT clause to avoid scanning large datasets unnecessarily. Monitor Job Performance Regularly review Spark UI to detect inefficiencies such as high shuffle, input, or output. Review the below table for optimization opportunities: Spark stage high I/O - Azure Databricks | Microsoft Learn Databricks Code Performance Enhancements & Data Engineering Best Practices By enabling the below features and applying best practices, you can significantly lower costs, accelerate job execution, and build Databricks pipelines that are both scalable and highly reliable. For more guidance review: Comprehensive Guide to Optimize Data Workloads | Databricks. Feature / Technique Purpose / Benefit How to Use / Enable / Key Notes Disk Caching Accelerates repeated reads of Parquet files Set spark.databricks.io.cache.enabled = true Dynamic File Pruning (DFP) Skips irrelevant data files during queries, improves query performance Enabled by default in Databricks Low Shuffle Merge Reduces data rewriting during MERGE operations, less need to recalculate ZORDER Use Databricks runtime with feature enabled Adaptive Query Execution (AQE) Dynamically optimizes query plans based on runtime statistics Available in Spark 3.0+, enabled by default Deletion Vectors Efficient row removal/change without rewriting entire Parquet file Enable in workspace settings, use with Delta Lake Materialized Views Faster BI queries, reduced compute for frequently accessed data Create in Databricks SQL Optimize Compacts Delta Lake files, improves query performance Run regularly, combine with ZORDER on high-cardinality columns ZORDER Physically sorts/co-locates data by chosen columns for faster queries Use with OPTIMIZE, select columns frequently used in filters/joins Auto Optimize Automatically compacts small files during writes Enable optimizeWrite and autoCompact table properties Liquid Clustering Simplifies data layout, replaces partitioning/ZORDER, flexible clustering keys Recommended for new Delta tables, enables easy redefinition of clustering keys File Size Tuning Achieve optimal file size for performance and cost Set delta.targetFileSize table property Broadcast Hash Join Optimizes joins by broadcasting smaller tables Adjust spark.sql.autoBroadcastJoinThreshold and spark.databricks.adaptive.autoBroadcastJoinThreshold Shuffle Hash Join Faster join alternative to sort-merge join Prefer over sort-merge join when broadcasting isn’t possible, Photon engine can help Cost-Based Optimizer (CBO) Improves query plans for complex joins Enabled by default, collect column/table statistics with ANALYZE TABLE Data Spilling & Skew Handles uneven data distribution and excessive shuffle Use AQE, set spark.sql.shuffle.partitions=auto, optimize partitioning Data Explosion Management Controls partition sizes after transformations (e.g., explode, join) Adjust spark.sql.files.maxPartitionBytes, use repartition() after reads Delta Merge Efficient upserts and CDC (Change Data Capture) Use MERGE operation in Delta Lake, combine with CDC architecture Data Purging (Vacuum) Removes stale data files, maintains storage efficiency Run VACUUM regularly based on transaction frequency Phase 3: Team Alignment and Next Steps Implementing Cost Observability and Taking Action Effective cost management in Databricks goes beyond configuration and code—it requires robust observability, granular tracking, and proactive measures. Below outlines how your teams can achieve this using system tables, tagging, dashboards, and actionable scripts. Cost Observability with System Tables Databricks Unity Catalog provides system tables that store operational data for your account. These tables enable historical cost observability and empower FinOps teams to analyze spend independently. System Tables Location: Found inside the Unity Catalog under the “system” schema. Key Benefits: Structured data for querying, historical analysis, and cost attribution. Action: Assign permissions to FinOps teams so they can access and analyze dedicated cost tables. Enable Tags for Granular Tracking Tagging is a powerful feature for tracking, reporting, and budgeting at a granular level. Classic Compute: Manually add key/value pairs when creating clusters, jobs, SQL Warehouses, or Model Serving endpoints. Use cluster policies to enforce custom tags. Serverless Compute: Create budget policies and assign permissions to teams or members for serverless workloads. Action: Tag all compute resources to enable detailed cost attribution and reporting. Track Costs with Dashboards and Alerts Databricks offers prebuilt dashboards and queries for cost forecasting and usage analysis. Dashboards: Visualize spend, usage trends, and forecast future costs. Prebuilt Queries: Use top queries with system tables to answer meaningful cost questions. Budget Alerts: Set up alerts in the Account Console (Usage > Budget) to receive notifications when spend approaches defined thresholds. Build Culture of Efficiency To go beyond technical fixes and build a culture of efficiency, by focusing on the below strategic actions: Collaborate with Internal Engineers: Spend time with engineering teams to understand workload patterns and optimization opportunities. Peer Reviews and Code Audits: Conduct regular code review sessions and peer reviews to ensure best practices are followed for Spark jobs, data pipelines, and cluster configurations. Create Internal Best Practice Documentation: Develop clear guidelines for writing optimized code, managing data, and maintaining clusters. Make these resources easily accessible for all teams. Implement Observability Dashboards: Use Databricks’ built-in features to create dashboards that track spend, monitor resource utilization, and highlight anomalies. Set Alerts and Budgets: Configure alerts for long-running workloads and establish budgets using prebuilt Databricks capabilities to prevent cost overruns. 5. Azure Reservations and Azure Savings Plan When optimizing Databricks costs on Azure, it’s important to understand the two main commitment-based savings options: Azure Reservations and Azure Savings Plans. Both can help you reduce compute costs, but they differ in flexibility and how savings are applied. Which Should You Choose? Reservations are ideal if you have stable, predictable Databricks workloads and want maximum savings. Savings Plans are better if you expect your compute needs to change, or if you want a simpler, more flexible way to save across multiple services. Pro Tip: You can combine both options—use Reservations for your baseline, always-on Databricks clusters, and Savings Plans for bursty, variable, or new workloads. Summary Table: Action Steps It’s critical to monitor costs continuously and align your teams with established best practices, while scheduling regular code review sessions to ensure efficiency and consistency. Area Best Practice / Action System Tables Use for historical cost analysis and attribution Tagging Apply to all compute resources for granular tracking Dashboards Visualize spend, usage, and forecasts Alerts Set budget alerts for proactive cost management Scripts/Queries Build custom analysis tools for deep insights Cluster/Data/Code Review & Align Regularly review best practices, share findings, and align teams on optimization Save on your Usage Consider Azure Reservations and Azure Savings Plan1.4KViews3likes0Comments