azure
7669 TopicsStep-by-Step Guide: Integrating Microsoft Purview with Azure Databricks and Microsoft Fabric
Co-Authored By: aryananmol, laurenkirkwood and mmanley This article provides practical guidance on setup, cost considerations, and integration steps for Azure Databricks and Microsoft Fabric to help organizations plan for building a strong data governance framework. It outlines how Microsoft Purview can unify governance efforts across cloud platforms, enabling consistent policy enforcement, metadata management, and lineage tracking. The content is tailored for architects and data leaders seeking to execute governance in scalable, hybrid environments. Note: this article focuses mainly on Data Governance features for Microsoft Purview. Why Microsoft Purview Microsoft Purview enables organizations to discover, catalog, and manage data across environments with clarity and control. Automated scanning and classification build a unified view of your data estate enriched with metadata, lineage, and sensitivity labels, and the Unified Catalog gives business-friendly search and governance constructs like domains, data products, glossary terms, and data quality. Note: Microsoft Purview Unified Catalog is being rolled out globally, with availability across multiple Microsoft Entra tenant regions; this page lists supported regions, availability dates, and deployment plans for the Unified Catalog service: Unified Catalog Supported Regions. Understanding Data Governance Features Cost in Purview Under the classic model: Data Map (Classic), users pay for an “always-on” Data Map capacity and scanning compute. In the new model, those infrastructure costs are subsumed into the consumption meters – meaning there are no direct charges for metadata storage or scanning jobs when using the Unified Catalog (Enterprise tier). Essentially, Microsoft stopped billing separately for the underlying data map and scan vCore-hours once you opt into the new model or start fresh with it. You only incur charges when you govern assets or run data processing tasks. This makes costs more predictable and tied to governance value: you can scan as much as needed to populate the catalog without worrying about scan fees and then pay only for the assets you actively manage (“govern”) and any data quality processes you execute. In summary, Purview Enterprise’s pricing is usage-based and divided into two primary areas: (1) Governed Assets and (2) Data Processing (DGPUs). Plan for Governance Microsoft Purview’s data governance framework is built on two core components: Data Map and Unified Catalog. The Data Map acts as the technical foundation, storing metadata about assets discovered through scans across your data estate. It inventories sources and organizes them into collections and domains for technical administration. The Unified Catalog sits on top as the business-facing layer, leveraging the Data Map’s metadata to create a curated marketplace of data products, glossary terms, and governance domains for data consumers and stewards. Before onboarding sources, align Unified Catalog (business-facing) and Data Map (technical inventory) and define roles, domains, and collections so ownership and access boundaries are clear. Here is a documentation that covers roles and permissions in Purview: Permissions in the Microsoft Purview portal | Microsoft Learn. The imageabove helps understand therelationship between the primary data governance solutions, Unified Catalog and Data Map, and the permissions granted by the roles for each solution. Considerations and Steps for Setting up Purview Steps for Setting up Purview: Step 1: Create a Purview Account. In the Azure Portal, use the search bar at the top to navigate to Microsoft Purview Accounts. Once there, click “Create”. This will take you to the following screen: Step 2: Click Next: Configuration and follow the Wizard, completing the necessary fields, including information on Networking, Configurations, and Tags. Then click Review + Create to create your Purview account. Consideration: Private networking: Use Private Endpoints to secure Unified Catalog/Data Map access and scan traffic; follow the new platform private endpoints guidance in the Microsoft Purview portal or migrate classic endpoints. Once your Purview Account is created, you’ll want to set up and manage your organization’s governance strategy to ensure that your data is classified and managed according to the specific lifecycle guidelines you set. Note: Follow the steps in this guide to set up Microsoft Purview Data Lifecycle Management: Data retention policy, labeling, and records management. Data Map Best Practices Design your collections hierarchy to align with organizational strategy—such as by geography, business function, or data domain. Register each data source only once per Purview account to avoid conflicting access controls. If multiple teams consume the same source, register it at a parent collection and create scans under subcollections for visibility. The imageaboveillustrates a recommended approach for structuring your Purview DataMap. Why Collection Structure Matters A well-structured Data Map strategy, including a clearly defined hierarchy of collections and domains, is critical because the Data Map serves as the metadata backbone for Microsoft Purview. It underpins the Unified Catalog, enabling consistent governance, role-based access control, and discoverability across the enterprise. Designing this hierarchy thoughtfully ensures scalability, simplifies permissions management, and provides a solid foundation for implementing enterprise-wide data governance. Purview Integration with Azure Databricks Databricks Workspace Structure In Azure Databricks, each region supports a single Unity Catalog metastore, which is shared across all workspaces within that region. This centralized architecture enables consistent data governance, simplifies access control, and facilitates seamless data sharing across teams. As an administrator, you can scan one workspace in the region using Microsoft Purview to discover and classify data managed by Unity Catalog, since the metastore governs all associated workspaces in a region. If your organization operates across multiple regions and utilizes cross-region data sharing, please review the consideration and workaround outlined below to ensure proper configuration and governance. Follow pre-requisite requirements here, before you register your workspace: Prerequisites to Connect and manage Azure Databricks Unity Catalog in Microsoft Purview. Steps to Register Databricks Workspace Step 1: In the Microsoft Purview portal, navigate to the Data Map section from the left-hand menu. Select Data Sources. Click on Register to begin the process of adding your Databricks workspace. Step 2: Note: There are two Databricks data sources, please review documentation here to review differences in capability: Connect to and manage Azure Databricks Unity Catalog in Microsoft Purview | Microsoft Learn. You can choose either source based on your organization’s needs. Recommended is “Azure Databricks Unity Catalog”: Step 3: Register your workspace. Here are the steps to register your data source: Steps to Register an Azure Databricks workspace in Microsoft Purview. Step 4: Initiate scan for your workspace, follow steps here: Steps to scan Azure Databricks to automatically identify assets. Once you have entered the required information test your connection and click continue to set up scheduled scan trigger. Step 5: For Scan trigger, choose whether to set up a schedule or run the scan once according to your business needs. Step 6: From the left pane, select Data Map and select your data source for your workspace. You can view a list of existing scans on that data source under Recent scans, or you can view all scans on the Scans tab. Review further options here: Manage and Review your Scans. You can review your scanned data sources, history and details here: Navigate to scan run history for a given scan. Limitation: The “Azure Databricks Unity Catalog” data source in Microsoft Purview does not currently support connection via Managed Vnet. As a workaround, the product team recommends using the “Azure Databricks Unity Catalog” source in combination with a Self-hosted Integration Runtime (SHIR) to enable scanning and metadata ingestion. You can find setup guidance here: Create and manage SHIR in Microsoft Purview Choose the right integration runtime configuration Scoped scan support for Unity Catalog is expected to enter private preview soon. You can sign up here: https://aka.ms/dbxpreview. Considerations: If you have delta-shared Databricks-to-Databricks workspaces, you may have duplication in your data assets if you are scanning both Workspaces. The workaround for this scenario is as you add tables/data assets to a Data Product for Governance in Microsoft Purview, you can identify the duplicated tables/data assets using their Fully Qualified Name (FQN). To make identification easier: Look for the keyword “sharing” in the FQN, which indicates a Delta-Shared table. You can also apply tags to these tables for quicker filtering and selection. The screenshot highlights how the FQN appears in the interface, helping you confidently identify and manage your data assets. Purview Integration with Microsoft Fabric Understanding Fabric Integration: Connect Cross-Tenant: This refers to integrating Microsoft Fabric resources across different Microsoft Entra tenants. It enables organizations to share data, reports, and workloads securely between separate tenants, often used in multi-organization collaborations or partner ecosystems. Key considerations include authentication, data governance, and compliance with cross-tenant policies. Connect In-Same-Tenant: This involves connecting Fabric resources within the same Microsoft Entra tenant. It simplifies integration by leveraging shared identity and governance models, allowing seamless access to data, reports, and pipelines across different workspaces or departments under the same organizational umbrella. Requirements: An Azure account with an active subscription. Create an account for free. An active Microsoft Purview account. Authentication is supported via: Managed Identity. Delegated Authentication and Service Principal. Steps to Register Fabric Tenant Step 1: In the Microsoft Purview portal, navigate to the Data Map section from the left-hand menu. Select Data Sources. Click on Register to begin the process of adding your Fabric Tenant (which also includes PowerBI). Step 2: Add in Data Source Name, keep Tenant ID as default (auto-populated). Microsoft Fabric and Microsoft Purview should be in the same tenant. Step 3: Enter in Scan name, enable/disable scanning for personal workspaces. You will notice under Credentials automatically created identity for authenticating Purview account. Note: If your Purview is behind Private Network, follow the guidelines here: Connect to your Microsoft Fabric tenant in same tenant as Microsoft Purview. Step 4: From your Microsoft Fabric, open Settings, Click on Tenant Settings and enable “Service Principals can access read-only admin APIs”, “Enhanced admin API responses within detailed metadata” and “Enhance Admin API responses with DAX and Mashup Expressions” within Admin API Settings section. Step 5: You will need to create a group, add the Purviews' managed identity to the group and add the group under “Service Principals can access read-only admin APIs” section of your tenant settings inside Microsoft Fabric Step 6: Test your connection and setup scope for your scan. Select the required workspaces, click continue and automate a scan trigger. Step 7: From the left pane, select Data Map and select your data source for your workspace. You can view a list of existing scans on that data source under Recent scans, or you can view all scans on the Scans tab. Review further options here: Manage and Review your Scans. You can review your scanned data sources, history and details here: Navigate to scan run history for a given scan. Why Customers Love Purview Kern County unified its approach to securing and governing data with Microsoft Purview, ensuring consistent compliance and streamlined data management across departments. EY accelerated secure AI development by leveraging the Microsoft Purview SDK, enabling robust data governance and privacy controls for advanced analytics and AI initiatives. Prince William County Public Schools created a more cyber-safe classroom environment with Microsoft Purview, protecting sensitive student information while supporting digital learning. FSA (Food Standards Agency) helps keep the UK food supply safe using Microsoft Purview Records Management, ensuring regulatory compliance and safeguarding critical data assets. Conclusion Purview’s Unified Catalog centralizes governance across Discovery, Catalog Management, and Health Management. The Governance features in Purview allow organizations to confidently answer critical questions: What data do we have? Where did it come from? Who is responsible for it? Is it secure and compliant? Can we trust its quality? Microsoft Purview, when integrated with Azure Databricks and Microsoft Fabric, provides a unified approach to cataloging, classifying, and governing data across diverse environments. By leveraging Purview’s Unified Catalog, Data Map, and advanced governance features, organizations can achieve end-to-end visibility, enforce consistent policies, and improve data quality. You might ask, why does data quality matter? Well, in today’s world, data is the new gold. References Microsoft Purview | Microsoft Learn Pricing - Microsoft Purview | Microsoft Azure Use Microsoft Purview to Govern Microsoft Fabric Connect to and manage Azure Databricks Unity Catalog in Microsoft PurviewAzure DevOps for Container Apps: End-to-End CI/CD with Self-Hosted Agents
Join this hands-on session to learn how to build a complete CI/CD pipeline for containerized applications on Azure Container Apps using Azure DevOps. You'll discover how to leverage self-hosted agents running as event-driven Container Apps jobs to deploy a full-stack web application with frontend and backend components. In this practical demonstration, you'll see how to create an automated deployment pipeline that builds, tests, and deploys containerized applications to Azure Container Apps. You'll learn how self-hosted agents in Container Apps jobs provide a serverless, cost-effective solution that scales automatically with your pipeline demands—you only pay for the time your agents are running. Don't miss your spot !22Views0likes0CommentsAIOps Engineer Blueprint Survey Opportunity
Greetings! Microsoft is considering a certification for AIOps Engineer, and we need your input through our exam blueprinting survey. The blueprint determines how many questions each skill in the exam will be assigned. Please complete the online survey by December 1, 2025. Please also feel free to forward the survey to any colleagues you consider subject matter experts for this certification. If you have any questions, feel free to contact John Sowles at josowles@microsoft.com or Don Tanedo at dtanedo@microsoft.com. AIOps Engineer blueprint survey link: https://microsoftlearning.co1.qualtrics.com/jfe/form/SV_cM8VmV1IeGxxCMm10Views0likes0CommentsAutomate Defender for Cloud settings: FIM, Vulnerability Assessment, and Guest Configuration Agent
I’m working on automating the configuration of Microsoft Defender for Cloud – Server Plans across multiple subscriptions (100+), including any newly deployed subscriptions. The goal is to avoid manual changes and ensure compliance from day one. Current Setup: I’ve used the built-in policy: Configure Microsoft Defender for Servers plan, which successfully enables: Defender for Cloud Plan P2 Endpoint Protection Agentless scanning I attempted to copy this policy and add parameters for Vulnerability Assessment, but the assignment fails with an error. What I’ve Tried: For File Integrity Monitor: Policy name → Configure ChangeTracking Extension for Windows virtual machines For Vulnerability Assessment: Policy name → Configure machines to receive a vulnerability assessment provider Assigning these policies works on my non-prod subscription, but the toggle in Defender for Cloud → Environment Settings remains No. Challenge: How can I ensure these options (File Integrity Monitoring, Vulnerability Assessment, and preferably Guest Configuration Agent) are automatically enabled for: All existing subscriptions Any new subscriptions created in the future Goal: No manual intervention in Defender for Cloud portal Fully automated via Azure Policy or another recommended approach uestions: Is there a way to extend the built-in policy or create a custom initiative that enforces these settings at the subscription level? Are there ARM templates, Bicep modules, Powershell scripts or REST API calls that can toggle these settings programmatically? Any best practices for ensuring compliance across multiple subscriptions? Any help is much appreciated and looking forward to your expertise! Thank you in advance. Best Regards, Pascal Slot25Views0likes0CommentsUnleashing the Power of Model Context Protocol (MCP): A Game-Changer in AI Integration
Artificial Intelligence is evolving rapidly, and one of the most pressing challenges is enabling AI models to interact effectively with external tools, data sources, and APIs. The Model Context Protocol (MCP) solves this problem by acting as a bridge between AI models and external services, creating a standardized communication framework that enhances tool integration, accessibility, and AI reasoning capabilities. What is Model Context Protocol (MCP)? MCP is a protocol designed to enable AI models, such as Azure OpenAI models, to interact seamlessly with external tools and services. Think of MCP as a universal USB-C connector for AI, allowing language models to fetch information, interact with APIs, and execute tasks beyond their built-in knowledge. Key Features of MCP Standardized Communication – MCP provides a structured way for AI models to interact with various tools. Tool Access & Expansion – AI assistants can now utilize external tools for real-time insights. Secure & Scalable – Enables safe and scalable integration with enterprise applications. Multi-Modal Integration – Supports STDIO, SSE (Server-Sent Events), and WebSocket communication methods. MCP Architecture & How It Works MCP follows a client-server architecture that allows AI models to interact with external tools efficiently. Here’s how it works: Components of MCP MCP Host – The AI model (e.g., Azure OpenAI GPT) requesting data or actions. MCP Client – An intermediary service that forwards the AI model's requests to MCP servers. MCP Server – Lightweight applications that expose specific capabilities (APIs, databases, files, etc.). Data Sources – Various backend systems, including local storage, cloud databases, and external APIs. Data Flow in MCP The AI model sends a request (e.g., "fetch user profile data"). The MCP client forwards the request to the appropriate MCP server. The MCP server retrieves the required data from a database or API. The response is sent back to the AI model via the MCP client. Integrating MCP with Azure OpenAI Services Microsoft has integrated MCP with Azure OpenAI Services, allowing GPT models to interact with external services and fetch live data. This means AI models are no longer limited to static knowledge but can access real-time information. Benefits of Azure OpenAI Services + MCP Integration ✔ Real-time Data Fetching – AI assistants can retrieve fresh information from APIs, databases, and internal systems. ✔ Contextual AI Responses – Enhances AI responses by providing accurate, up-to-date information. ✔ Enterprise-Ready – Secure and scalable for business applications, including finance, healthcare, and retail. Hands-On Tools for MCP Implementation To implement MCP effectively, Microsoft provides two powerful tools: Semantic Workbench and AI Gateway. Microsoft Semantic Workbench A development environment for prototyping AI-powered assistants and integrating MCP-based functionalities. Features: Build and test multi-agent AI assistants. Configure settings and interactions between AI models and external tools. Supports GitHub Codespaces for cloud-based development. Explore Semantic Workbench Workbench interface examples Microsoft AI Gateway A plug-and-play interface that allows developers to experiment with MCP using Azure API Management. Features: Credential Manager – Securely handle API credentials. Live Experimentation – Test AI model interactions with external tools. Pre-built Labs – Hands-on learning for developers. Explore AI Gateway Setting Up MCP with Azure OpenAI Services Step 1: Create a Virtual Environment First, create a virtual environment using Python: python -m venv .venv Activate the environment: # Windows venv\Scripts\activate # MacOS/Linux source .venv/bin/activate Step 2: Install Required Libraries Create a requirements.txt file and add the following dependencies: langchain-mcp-adapters langgraph langchain-openai Then, install the required libraries: pip install -r requirements.txt Step 3: Set Up OpenAI API Key Ensure you have your OpenAI API key set up: # Windows setx OPENAI_API_KEY "<your_api_key> # MacOS/Linux export OPENAI_API_KEY=<your_api_key> Building an MCP Server This server performs basic mathematical operations like addition and multiplication. Create the Server File First, create a new Python file: touch math_server.py Then, implement the server: from mcp.server.fastmcp import FastMCP # Initialize the server mcp = FastMCP("Math") MCP.tool() def add(a: int, b: int) -> int: return a + b MCP.tool() def multiply(a: int, b: int) -> int: return a * b if __name__ == "__main__": mcp.run(transport="stdio") Your MCP server is now ready to run. Building an MCP Client This client connects to the MCP server and interacts with it. Create the Client File First, create a new file: touch client.py Then, implement the client: import asyncio from mcp import ClientSession, StdioServerParameters from langchain_openai import ChatOpenAI from mcp.client.stdio import stdio_client # Define server parameters server_params = StdioServerParameters( command="python", args=["math_server.py"], ) # Define the model model = ChatOpenAI(model="gpt-4o") async def run_agent(): async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() tools = await load_mcp_tools(session) agent = create_react_agent(model, tools) agent_response = await agent.ainvoke({"messages": "what's (4 + 6) x 14?"}) return agent_response["messages"][3].content if __name__ == "__main__": result = asyncio.run(run_agent()) print(result) Your client is now set up and ready to interact with the MCP server. Running the MCP Server and Client Step 1: Start the MCP Server Open a terminal and run: python math_server.py This starts the MCP server, making it available for client connections. Step 2: Run the MCP Client In another terminal, run: python client.py Expected Output 140 This means the AI agent correctly computed (4 + 6) x 14 using both the MCP server and GPT-4o. Conclusion Integrating MCP with Azure OpenAI Services enables AI applications to securely interact with external tools, enhancing functionality beyond text-based responses. With standardized communication and improved AI capabilities, developers can build smarter and more interactive AI-powered solutions. By following this guide, you can set up an MCP server and client, unlocking the full potential of AI with structured external interactions. Next Steps: Explore more MCP tools and integrations. Extend your MCP setup to work with additional APIs. Deploy your solution in a cloud environment for broader accessibility. For further details, visit the GitHub repository for MCP integration examples and best practices. MCP GitHub Repository MCP Documentation Semantic Workbench AI Gateway MCP Video Walkthrough MCP Blog MCP Github End to End Demo56KViews10likes6CommentsDuplicate detection in Logic App Trigger
Duplicate Detection in Logic App Trigger Overview In certain scenarios, a Logic App may process the same item multiple times due to various reasons. Common examples include: SharePoint Polling: The trigger might fire twice because of subsequent edits. (Reference: Microsoft SharePoint Connector for Power Automate | Microsoft Learn) Dataverse Webhook: This can occur when a field is updated by plugins. Solution Approach To address this issue, I implemented a solution using the Logic Apps REST API: Workflow Runs - List - REST API (Azure Logic Apps) | Microsoft Learn The idea is simple: Each time the Logic App is triggered, it searches all available runs within a specified time window for the same clientTrackingId. Implementation Details Step 1: Extract clientTrackingId For the SharePoint trigger example, I used the file name as the clientTrackingId with the following expression: @string(trigger()['outputs']['body']['value'][0]['{FilenameWithExtension}']) Step 2: Pass Payload to Duplicate Detector Flow The payload includes: clientTrackingId (chosen identifier) Resource URI for the SharePoint flow Time window to check against Duplicate Detector Flow Logic The flow retrieves the run history and checks if the number of runs is greater than 1. If so, the same file was processed before: the http response will have the below status if(greater(length(body('HTTP-Get_History')['value']), 1), '400', '200') Important Notes The Duplicate Detector Flow must have Logic Apps Standard Reader (Preview) permission for the resource group. The solution is built on Logic App Standard, but it can be adapted for Logic App Consumption or Power Automate. GitHub Link logicappfiles/Logic Apps Duplicate Detector at main · mbarqawi/logicappfilesI passed the GH‑900: GitHub Foundations exam!
Hi everyone, I’m excited to share that I cleared the GH‑900 (GitHub Foundations) exam with a good score! This certification validates my understanding of Git, repository collaboration, pull requests, and GitHub’s core features. Preparation Approach: I studied using Microsoft Learn resources and the GH‑900 study guide. For extra practice and exam-style questions, I used dumps-4-azure — it really gave me the extra edge for exam readiness. I also practiced hands-on with real GitHub workflows (branches, pull requests, projects) to reinforce my understanding. Key Takeaways: The exam tests foundational Git + GitHub collaboration skills — not just theory. Practical experience combined with mock questions made a big difference. Consistency in daily preparation is the key. Next Steps: After GH‑900, I’m planning to go for GH‑100 (GitHub Administration) to deepen my GitHub skills at the organizational level.3Views0likes0CommentsApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.1.4KViews6likes1CommentAzure Databricks Cost Optimization: A Practical Guide
Co-Authored by Sanjeev Nair This guide walks through a proven approach to Databricks cost optimization, structured in three phases: Discovery, Cluster/Data/Code Best Practices, and Team Alignment & Next Steps. Phase 1: Discovery Assessing Your Current State The following questions are designed to guide your initial assessment and help you identify areas for improvement. Documenting answers to each will provide a baseline for optimization and inform the next phases of your cost management strategy. Environment & Organization Cluster Management Cost Optimization Data Management Performance Monitoring Future Planning What is the current scale of your Databricks environment? How many workspaces do you have? How are your workspaces organized (e.g., by environment type, region, use case)? How many clusters are deployed? How many users are active? What are the primary use cases for Databricks in your organization? Data engineering Data science Machine learning Business intelligence How are clusters currently managed? Manual configuration Automated scripts Databricks REST API Cluster policies What is the average cluster uptime? Hours per day Days per week What is the average cluster utilization rate? CPU usage Memory usage What is the current monthly spend on Databricks? Total cost Breakdown by workspace Breakdown by cluster What cost management tools are currently in use? Azure Cost Management Third-party tools Are there any existing cost optimization strategies in place? Reserved instances Spot instances Cluster auto-scaling What is the current data storage strategy? Data lake Data warehouse Hybrid What is the average data ingestion rate? GB per day Number of files What is the average data processing time? ETL jobs Machine learning models What types of data formats are used in your environment? Delta Lake Parquet JSON CSV Other formats relevant to your workloads What performance monitoring tools are currently in use? Databricks Ganglia Azure Monitor Third-party tools What are the key performance metrics tracked? Job execution time Cluster performance Data processing speed Are there any planned expansions or changes to the Databricks environment? New use cases Increased data volume Additional users What are the long-term goals for Databricks cost optimization? Reducing overall spend Improving resource utilization & cost attribution Enhancing performance Understanding Databricks Cost Structure Total Cost = Cloud Cost + DBU Cost Cloud Cost: Compute (VMs, networking, IP addresses), storage (ADLS, MLflow artifacts), other services (firewalls), cluster type (serverless compute, classic compute) DBU Cost: Workload size, cluster/warehouse size, photon acceleration, compute runtime, workspace tier, SKU type (Jobs, Delta Live Tables, All Purpose Clusters, Serverless), model serving, queries per second, model execution time Diagnose Cost and Issues Effectively diagnosing cost and performance issues in Databricks requires a structured approach. Use the following steps and metrics to gain visibility into your environment and uncover actionable insights. 1. Identify Costly Workloads Account Console Usage Reports: Review usage reports to identify usage breakdowns by product, SKU name, and custom tags. Usage Breakdown by Product and SKU: Helps you understand which services and compute types (clusters, SQL warehouses, serverless options) are consuming the most resources. Custom Tags for Attribution: Tags allow you to attribute costs to teams, projects, or departments, making it easier to identify high-cost areas. Workflow and Job Analysis: By correlating usage data with workflows and jobs, you can pinpoint long-running or resource-heavy workloads that drive costs. Focus on Long-Running Workloads: Examine workloads with extended runtimes or high resource utilization. Key Question: Which pipelines or workloads are driving the majority of your costs? Now That You’ve Identified Long-Running Workloads, Review These Key Areas: 2. Review Cluster Metrics CPU Utilization: Track guest, iowait, idle, irq, nice, softirq, steal, system, and user times to understand how compute resources are being used. Memory Utilization: Monitor used, free, buffer, and cached memory to identify over- or under-utilization. Key Question: Is your cluster over- or under-utilized? Are resources being wasted or stretched too thin? 3. Review SQL Warehouse Metrics Live Statistics: Monitor warehouse status, running/queued queries, and current cluster count. Time Scale Filter: Analyze query and cluster activity over different time frames (8 hours, 24 hours, 7 days, 14 days). Peak Query Count Chart: Identify periods of high concurrency. Completed Query Count Chart: Track throughput and query success/failure rates. Running Clusters Chart: Observe cluster allocation and recycling events. Query History Table: Filter and analyze queries by user, duration, status, and statement type. Key Question: Is your SQL Warehouse over- or under-utilized? Are resources being wasted or stretched too thin? 4. Review Spark UI Stages Tab: Look for skewed data, high input/output, and shuffle times. Uneven task durations may indicate data skew or inefficient data handling. Jobs Timeline: Identify long-running jobs or stages that consume excessive resources. Stage Analysis: Determine if stages are I/O bound or suffering from data skew/spill. Executor Metrics: Monitor memory usage, CPU utilization, and disk I/O. Frequent garbage collection or high memory usage may signal the need for better resource allocation. 4.1. Spark UI: Storage & Jobs Tab Storage Level: Check if data is stored in memory, on disk, or both. Size: Assess the size of cached data. Job Analysis: Investigate jobs that dominate the timeline or have unusually long durations. Look for gaps caused by complex execution plans, non-Spark code, driver overload, or cluster malfunction. 4.2. Spark UI: Executor Tab Storage Memory: Compare used vs. available memory. Task Time (Garbage Collection): Review long tasks and garbage collection times. Shuffle Read/Write: Measure data transferred between stages. 5. Additional Diagnostic Methods System Tables in Unity Catalog: Query system tables for cost attribution and resource usage trends. Cost Observability Queries Tagging Analysis: Use tags to identify which teams or projects consume the most resources. Dashboards & Alerts: Set up cost dashboards and budget alerts for proactive monitoring. Phase 2: Cluster/Code/Data Best Practices Alignment Cluster UI Configuration and Cost Attribution Effectively configuring clusters/workloads in Databricks is essential for balancing performance, scalability, and cost. Tunning settings and features when used strategically can help organizations maximize resource efficiency and minimize unnecessary spending. Key Configuration Strategies 1. Reduce Idle Time: Clusters to incur costs even when not actively processing workloads. To avoid paying for unused resources: Enable Auto-Terminate: Set clusters automatically shut down after a period of inactivity. This simple setting can significantly reduce wasted spending. Enable Autoscaling: Workloads fluctuate in size and complexity. Autoscaling allows clusters to dynamically adjust the number of nodes based on demand: Automatic Resource Adjustment: Scale up for heavy jobs and scale down for lighter loads, ensuring you only pay for what you use. Use Spot Instances: For batch processing and non-critical workloads, spot instances offer substantial cost savings: Lower VM Costs: Spot instances are typically much cheaper than standard VMs. However, they are not recommended for jobs requiring constant uptime due to potential interruptions. Leverage Photon Engine: Photon is Databricks’ high-performance, vectorized query engine: Accelerate Large Workloads: Photon can dramatically reduce runtime for compute-intensive tasks, improving both speed and cost efficiency. Keep Runtimes Up to Date: Using the latest Databricks runtime ensures optimal performance and security: Benefit from Improvements: Regular updates include performance enhancements, bug fixes, and new features. Apply Cluster Policies: Cluster policies help standardize configurations and enforce cost controls across teams: Governance and Consistency: Policies can restrict certain settings, enforce tagging, and ensure clusters are created with cost-effective defaults. Optimize Storage: type impacts both performance and cost: Switch from HDDs to SSDs: SSDs provide faster caching and shuffle operations, which can improve job efficiency and reduce runtime. Tag Clusters for Cost Attribution: Tagging clusters enables granular tracking and reporting: Visibility and Accountability: Use tags to attribute costs to specific teams, projects, or environments, supporting better budgeting and chargeback processes. Select the Right Cluster Type: Different workloads require different cluster types, see table below for Serverless vs Classic Compute: Feature Classic Compute Serverless Compute Control Full control over config & network Minimal control, fully managed by Databricks Startup Time Slower (unless pre-warmed) Instant Cost Model Hourly, supports reservations Pay-per-use, elastic scaling Security VNet injection, private endpoints NCC-based private connectivity Best For Heavy ETL, ML, compliance workloads Interactive queries, unpredictable demand Job Clusters: Ideal for scheduled jobs and Delta Live Tables. All-Purpose Clusters: Suited for ad-hoc analysis and collaborative work. Single-Node Clusters: Efficient for simple exploratory data analysis or pure Python tasks. Serverless Compute: Scalable, managed workloads with automatic resource management. Monitor and Adjust Regularly: review cluster metrics and query history: Continuous Optimization: Use built-in dashboards to monitor usage, identify bottlenecks, and adjust cluster size or configuration as needed. Code Best Practices Avoid Reprocessing Large Tables Use a CDC (Change Data Capture) architecture with Delta Live Tables (DLT) to process only new or changed data, minimizing unnecessary computation. Ensure Code Parallelizes Well Write Spark code that leverages parallel processing. Avoid loops, deeply nested structures, and inefficient user-defined functions (UDFs) that can hinder scalability. Reduce Memory Consumption Tweak Spark configurations to minimize memory overhead. Clean out legacy or unnecessary settings that may have carried over from previous Spark versions. Prefer SQL Over Complex Python Use SQL (declarative language) for Spark jobs whenever possible. SQL queries are typically more efficient and easier to optimize than complex Python logic. Modularize Notebooks Use %run to split large notebooks into smaller, reusable modules. This improves maintainability. Use LIMIT in Exploratory Queries When exploring data, always use the LIMIT clause to avoid scanning large datasets unnecessarily. Monitor Job Performance Regularly review Spark UI to detect inefficiencies such as high shuffle, input, or output. Review the below table for optimization opportunities: Spark stage high I/O - Azure Databricks | Microsoft Learn Databricks Code Performance Enhancements & Data Engineering Best Practices By enabling the below features and applying best practices, you can significantly lower costs, accelerate job execution, and build Databricks pipelines that are both scalable and highly reliable. For more guidance review: Comprehensive Guide to Optimize Data Workloads | Databricks. Feature / Technique Purpose / Benefit How to Use / Enable / Key Notes Disk Caching Accelerates repeated reads of Parquet files Set spark.databricks.io.cache.enabled = true Dynamic File Pruning (DFP) Skips irrelevant data files during queries, improves query performance Enabled by default in Databricks Low Shuffle Merge Reduces data rewriting during MERGE operations, less need to recalculate ZORDER Use Databricks runtime with feature enabled Adaptive Query Execution (AQE) Dynamically optimizes query plans based on runtime statistics Available in Spark 3.0+, enabled by default Deletion Vectors Efficient row removal/change without rewriting entire Parquet file Enable in workspace settings, use with Delta Lake Materialized Views Faster BI queries, reduced compute for frequently accessed data Create in Databricks SQL Optimize Compacts Delta Lake files, improves query performance Run regularly, combine with ZORDER on high-cardinality columns ZORDER Physically sorts/co-locates data by chosen columns for faster queries Use with OPTIMIZE, select columns frequently used in filters/joins Auto Optimize Automatically compacts small files during writes Enable optimizeWrite and autoCompact table properties Liquid Clustering Simplifies data layout, replaces partitioning/ZORDER, flexible clustering keys Recommended for new Delta tables, enables easy redefinition of clustering keys File Size Tuning Achieve optimal file size for performance and cost Set delta.targetFileSize table property Broadcast Hash Join Optimizes joins by broadcasting smaller tables Adjust spark.sql.autoBroadcastJoinThreshold and spark.databricks.adaptive.autoBroadcastJoinThreshold Shuffle Hash Join Faster join alternative to sort-merge join Prefer over sort-merge join when broadcasting isn’t possible, Photon engine can help Cost-Based Optimizer (CBO) Improves query plans for complex joins Enabled by default, collect column/table statistics with ANALYZE TABLE Data Spilling & Skew Handles uneven data distribution and excessive shuffle Use AQE, set spark.sql.shuffle.partitions=auto, optimize partitioning Data Explosion Management Controls partition sizes after transformations (e.g., explode, join) Adjust spark.sql.files.maxPartitionBytes, use repartition() after reads Delta Merge Efficient upserts and CDC (Change Data Capture) Use MERGE operation in Delta Lake, combine with CDC architecture Data Purging (Vacuum) Removes stale data files, maintains storage efficiency Run VACUUM regularly based on transaction frequency Phase 3: Team Alignment and Next Steps Implementing Cost Observability and Taking Action Effective cost management in Databricks goes beyond configuration and code—it requires robust observability, granular tracking, and proactive measures. Below outlines how your teams can achieve this using system tables, tagging, dashboards, and actionable scripts. Cost Observability with System Tables Databricks Unity Catalog provides system tables that store operational data for your account. These tables enable historical cost observability and empower FinOps teams to analyze spend independently. System Tables Location: Found inside the Unity Catalog under the “system” schema. Key Benefits: Structured data for querying, historical analysis, and cost attribution. Action: Assign permissions to FinOps teams so they can access and analyze dedicated cost tables. Enable Tags for Granular Tracking Tagging is a powerful feature for tracking, reporting, and budgeting at a granular level. Classic Compute: Manually add key/value pairs when creating clusters, jobs, SQL Warehouses, or Model Serving endpoints. Use cluster policies to enforce custom tags. Serverless Compute: Create budget policies and assign permissions to teams or members for serverless workloads. Action: Tag all compute resources to enable detailed cost attribution and reporting. Track Costs with Dashboards and Alerts Databricks offers prebuilt dashboards and queries for cost forecasting and usage analysis. Dashboards: Visualize spend, usage trends, and forecast future costs. Prebuilt Queries: Use top queries with system tables to answer meaningful cost questions. Budget Alerts: Set up alerts in the Account Console (Usage > Budget) to receive notifications when spend approaches defined thresholds. Build Culture of Efficiency To go beyond technical fixes and build a culture of efficiency, by focusing on the below strategic actions: Collaborate with Internal Engineers: Spend time with engineering teams to understand workload patterns and optimization opportunities. Peer Reviews and Code Audits: Conduct regular code review sessions and peer reviews to ensure best practices are followed for Spark jobs, data pipelines, and cluster configurations. Create Internal Best Practice Documentation: Develop clear guidelines for writing optimized code, managing data, and maintaining clusters. Make these resources easily accessible for all teams. Implement Observability Dashboards: Use Databricks’ built-in features to create dashboards that track spend, monitor resource utilization, and highlight anomalies. Set Alerts and Budgets: Configure alerts for long-running workloads and establish budgets using prebuilt Databricks capabilities to prevent cost overruns. 5. Azure Reservations and Azure Savings Plan When optimizing Databricks costs on Azure, it’s important to understand the two main commitment-based savings options: Azure Reservations and Azure Savings Plans. Both can help you reduce compute costs, but they differ in flexibility and how savings are applied. Which Should You Choose? Reservations are ideal if you have stable, predictable Databricks workloads and want maximum savings. Savings Plans are better if you expect your compute needs to change, or if you want a simpler, more flexible way to save across multiple services. Pro Tip: You can combine both options—use Reservations for your baseline, always-on Databricks clusters, and Savings Plans for bursty, variable, or new workloads. Summary Table: Action Steps It’s critical to monitor costs continuously and align your teams with established best practices, while scheduling regular code review sessions to ensure efficiency and consistency. Area Best Practice / Action System Tables Use for historical cost analysis and attribution Tagging Apply to all compute resources for granular tracking Dashboards Visualize spend, usage, and forecasts Alerts Set budget alerts for proactive cost management Scripts/Queries Build custom analysis tools for deep insights Cluster/Data/Code Review & Align Regularly review best practices, share findings, and align teams on optimization Save on your Usage Consider Azure Reservations and Azure Savings Plan402Views0likes0CommentsOverload to Optimal: Tuning Microsoft Fabric Capacity
Co-Authored by: Daya Ram, Sr. Cloud Solutions Architect Optimizing Microsoft Fabric capacity is both a performance and cost exercise. By diagnosing workloads, tuning cluster and Spark settings, and applying data best practices, teams can reduce run times, avoid throttling, and lower total cost of ownership—without compromising SLAs. Use Fabric’s built-in observability (Monitoring Hub, Capacity Metrics, Spark UI) to identify hot spots and then apply cluster- and data-level remediations. Capacity Planning For capacity planning and sizing guidance, see Plan your capacity size. Selecting the wrong SKU can lead to two major issues: Over-provisioning: Paying for resources you don’t need. Under-provisioning: Struggling with performance bottlenecks and failed jobs. To simplify this process, Microsoft provides the Fabric SKU Estimator, a powerful tool designed to help organizations accurately size their capacity based on real-world usage patterns. Run the SKU Estimator before onboarding new workloads or scaling existing ones. Combine its recommendations with monitoring tools like Fabric Capacity Metrics to validate performance and adjust as needed. Options to Diagnose Capacity Issues 1) Monitoring Hub — Start with the Story of the Run What to use it for: Browse Spark activity across applications (notebooks, Spark Job Definitions, and pipelines). Quickly surface long‑running or anomalous runs; view read/write bytes, idle time, core allocation, and utilization. How to use it From the Fabric portal, open Monitoring (Monitor Hub). Select a Notebook or Spark Job Definition to run and choose Historical Runs. Inspect the Run Duration chart; click on a run to see read/write bytes, idle time, core allocation, overall utilization, and other Spark metrics. What to look for Use the guide: application detail monitoring to review and monitor your application. 2) Capacity Metrics App — Measure the Whole Environment What to use it for: Review capacity-wide utilization and system events (overloads, queueing); compare utilization across time windows and identify sustained peaks. How to use it Open the Microsoft Fabric Capacity Metrics app for your capacity. Review the Compute page (ribbon charts, utilization trends) and the System events tab to see overload or throttling windows. Use the Timepoint page to drill into a 30‑second interval and see which operations consumed the most compute. What to look for Use the Troubleshooting guide: Monitor and identify capacity usage to pinpoint top CU‑consuming items. 3) Spark UI — Diagnose at Deeper Level Why it matters: Spark UI exposes skew, shuffle, memory pressure, and long stages. Use it after Monitoring Hub/Capacity Metrics to pinpoint the problematic job. Key tabs to inspect Stages: uneven task durations (data skew), heavy shuffle read/write, large input/output volumes. Executors: storage memory, task time (GC), shuffle metrics. High GC or frequent spills indicate memory tuning is needed. Storage: which RDDs/cached tables occupy memory; any disk spill. Jobs: long‑running jobs and gaps in the timeline (driver compilation, non‑Spark code, driver overload). What to look for Set via environment Spark properties or session config. Data skew, Memory usage, High/Low Shuffles: Adjust Apache Spark settings: i.e. spark.ms.autotune.enabled, spark.task.cpus and spark.sql.shuffle.partitions. Remediation and Optimization Suggestions A) Cluster & Workspace Settings Runtime & Native Execution Engine (NEE) Use Fabric Runtime 1.3 (Spark 3.5, Delta 3.2) and enable the Native Execution Engine to boost performance; enable at the environment level under Spark compute → Acceleration. Starter Pools vs. Custom Pools Starter Pool: prehydrated, medium‑size pools; fast session starts, good for dev/quick runs. Custom Pools: size nodes, enable autoscale, dynamic executors. Create via workspace Spark Settings (requires capacity admin to enable workspace customization). High Concurrency Session Sharing Enable High Concurrency to share Spark Sessions across notebooks (and pipelines) to reduce session startup latency and cost; use session tags in pipelines to group notebooks. Autotune for Spark Enable Autotune (spark.ms.autotune.enabled = true) to auto‑adjust per‑query: spark.sql.shuffle.partitions Spark.sql.autoBroadcastJoinThreshold spark.sql.files.maxPartitionBytes. Autotune is disabled by default and is in preview; enable per environment or session. B) Data‑level best practices Microsoft Fabric offers several approaches to maintain optimal file sizes in Delta tables, review documentation here: Table Compaction - Microsoft Fabric. Intelligent Cache Enabled by default (Runtime 1.1/1.2) for Spark pools: caches frequently read files at node level for Delta/Parquet/CSV; improves subsequent read performance and TCO. OPTIMIZE & Z‑Order Run OPTIMIZE regularly to rewrite files and improve file layout. V‑Order V‑Order (disabled by default in new workspaces) can accelerate reads for read‑heavy workloads; enable via spark.sql.parquet.vorder.default = true. Vacuum Run VACUUM to remove unreferenced files (stale data); default retention is 7 days; align retention across OneLake to control storage costs and maintain time travel. Collaboration & Next Steps Engage Data Engineering Team to Define an Optimization Playbook Start with reviewing capacity sizing guidance, cluster‑level optimizations (runtime/NEE, pools, concurrency, Autotune) and then target data improvements (Z‑order, compaction, caching, query refactors). Triage: Monitor Hub → Capacity Metrics → Spark UI to map workloads and identify high‑impact jobs, and workloads causing throttling. Schedule: Operationalize maintenance: OPTIMIZE (full or selective) during off‑peak windows; enable Auto Compaction for micro‑batch/streaming writes; add VACUUM to your cadence with agreed retention. Add regular code review sessions to ensure consistent performance patterns. Fix: Adjust pool sizing or concurrency; enable Autotune; tune shuffle partitions; refactor problematic queries; re‑run compaction. Verify: Re‑run the job and change, i.e. reduced run time, lower shuffle, improved utilization.366Views1like0Comments