Blog Post

Analytics on Azure Blog
8 MIN READ

Azure Databricks Cost Optimization: A Practical Guide

Rafia_Aqil's avatar
Rafia_Aqil
Icon for Microsoft rankMicrosoft
Nov 15, 2025

Co-Authored by Sanjeev Nair​ 

 

This guide walks through a proven approach to Databricks cost optimization, structured in three phases: Discovery, Cluster/Data/Code Best Practices, and Team Alignment & Next Steps. 

Phase 1: Discovery 

Assessing Your Current State 

The following questions are designed to guide your initial assessment and help you identify areas for improvement. Documenting answers to each will provide a baseline for optimization and inform the next phases of your cost management strategy. 

Environment & Organization 

Cluster Management 

 

Cost Optimization 

 

Data Management 

 

Performance Monitoring 

 

Future Planning 

 

What is the current scale of your Databricks environment?  

 

How many workspaces do you have? 

 

How are your workspaces organized (e.g., by environment type, region, use case)? 

 

How many clusters are deployed? 

How many users are active? 

 

What are the primary use cases for Databricks in your organization?  

  • Data engineering 
  • Data science 
  • Machine learning 
  • Business intelligence

 

How are clusters currently managed?  

  • Manual configuration 
  • Automated scripts 
  • Databricks REST API 
  • Cluster policies 

 

What is the average cluster uptime?  

  • Hours per day 
  • Days per week 

 

What is the average cluster utilization rate?  

  • CPU usage 
  • Memory usage 

 

 

 

 

 

 

 

 

 

 

 

What is the current monthly spend on Databricks?  

  • Total cost 
  • Breakdown by workspace 
  • Breakdown by cluster 

What cost management tools are currently in use?  

  • Azure Cost Management 
  • Third-party tools 

Are there any existing cost optimization strategies in place?  

  • Reserved instances 
  • Spot instances 
  • Cluster auto-scaling 

 

 

 

 

 

 

 

 

What is the current data storage strategy?  

  • Data lake 
  • Data warehouse 
  • Hybrid 

What is the average data ingestion rate?  

  • GB per day 
  • Number of files 

What is the average data processing time?  

  • ETL jobs 
  • Machine learning models 

What types of data formats are used in your environment?  

  • Delta Lake 
  • Parquet 
  • JSON 
  • CSV 
  • Other formats relevant to your workloads 

 

 

What performance monitoring tools are currently in use?  

  • Databricks Ganglia 
  • Azure Monitor 
  • Third-party tools 

What are the key performance metrics tracked?  

  • Job execution time 
  • Cluster performance 
  • Data processing speed 

 

 

 

 

 

 

 

 

 

 

 

 

 

Are there any planned expansions or changes to the Databricks environment?  

  • New use cases 
  • Increased data volume 
  • Additional users 

What are the long-term goals for Databricks cost optimization?  

  • Reducing overall spend 
  • Improving resource utilization & cost attribution 
  • Enhancing performance 

 

 

 

 

 

 

 

 

 

 

 

 

 

Understanding Databricks Cost Structure 

Total Cost = Cloud Cost + DBU Cost 

    • Cloud Cost: Compute (VMs, networking, IP addresses), storage (ADLS, MLflow artifacts), other services (firewalls), cluster type (serverless compute, classic compute) 
    • DBU Cost: Workload size, cluster/warehouse size, photon acceleration, compute runtime, workspace tier, SKU type (Jobs, Delta Live Tables, All Purpose Clusters, Serverless), model serving, queries per second, model execution time 
Diagnose Cost and Issues 

Effectively diagnosing cost and performance issues in Databricks requires a structured approach. Use the following steps and metrics to gain visibility into your environment and uncover actionable insights. 

  1. Review Cluster Metrics
    • CPU Utilization: Track guest, iowait, idle, irq, nice, softirq, steal, system, and user times to understand how compute resources are being used. 
    • Memory Utilization: Monitor used, free, buffer, and cached memory to identify over- or under-utilization. 
    • Key Question: Is your cluster over- or under-utilized? Are resources being wasted or stretched too thin? 
  1. Review SQL Warehouse Metrics
    1. Live StatisticsMonitor warehouse status, running/queued queries, and current cluster count. 
    2. Time Scale Filter: Analyze query and cluster activity over different time frames (8 hours, 24 hours, 7 days, 14 days). 
    3. Peak Query Count Chart: Identify periods of high concurrency. 
    4. Completed Query Count Chart: Track throughput and query success/failure rates. 
    5. Running Clusters ChartObserve cluster allocation and recycling events. 
    6. Query History Table: Filter and analyze queries by user, duration, status, and statement type. 
  1.  
3. Review Spark UI
    • Stages Tab: Look for skewed data, high input/output, and shuffle times. Uneven task durations may indicate data skew or inefficient data handling. 
    • Jobs Timeline: Identify long-running jobs or stages that consume excessive resources. 
    • Stage Analysis: Determine if stages are I/O bound or suffering from data skew/spill. 
    • Executor Metrics: Monitor memory usage, CPU utilization, and disk I/O. Frequent garbage collection or high memory usage may signal the need for better resource allocation. 
3.1. Storage & Jobs Tab 
    • Storage Level: Check if data is stored in memory, on disk, or both. 
    • Size: Assess the size of cached data. 
    • Job Analysis: Investigate jobs that dominate the timeline or have unusually long durations. Look for gaps caused by complex execution plans, non-Spark code, driver overload, or cluster malfunction. 
3.2. Executor Tab 
    • Storage Memory: Compare used vs. available memory. 
    • Task Time (Garbage Collection): Review long tasks and garbage collection times. 
    • Shuffle Read/Write: Measure data transferred between stages. 

 

Phase 2: Cluster/Code/Data Best Practices Alignment 

Cluster UI Configuration and Cost Attribution 

Effectively configuring clusters/workloads in Databricks is essential for balancing performance, scalability, and cost. Tunning settings and features when used strategically can help organizations maximize resource efficiency and minimize unnecessary spending. 

Key Configuration Strategies 

1. Reduce Idle Time: Clusters to incur costs even when not actively processing workloads. To avoid paying for unused resources: 

    • Enable Auto-Terminate: Set clusters automatically shut down after a period of inactivity. This simple setting can significantly reduce wasted spending. 
  1. Enable Autoscaling: Workloads fluctuate in size and complexity. Autoscaling allows clusters to dynamically adjust the number of nodes based on demand: 
    • Automatic Resource Adjustment: Scale up for heavy jobs and scale down for lighter loads, ensuring you only pay for what you use. 
  1.  Use Spot Instances: For batch processing and non-critical workloads, spot instances offer substantial cost savings:
    • Lower VM Costs: Spot instances are typically much cheaper than standard VMs. However, they are not recommended for jobs requiring constant uptime due to potential interruptions. 
  1.  Leverage Photon Engine: Photon is Databricks’ high-performance, vectorized query engine:
    • Accelerate Large Workloads: Photon can dramatically reduce runtime for compute-intensive tasks, improving both speed and cost efficiency. 
  1. Keep Runtimes Up to Date: Using the latest Databricks runtime ensures optimal performance and security: 
    • Benefit from Improvements: Regular updates include performance enhancements, bug fixes, and new features. 
  1. Apply Cluster Policies: Cluster policies help standardize configurations and enforce cost controls across teams:
    • Governance and Consistency: Policies can restrict certain settings, enforce tagging, and ensure clusters are created with cost-effective defaults. 
  1. Optimize Storage: type impacts both performance and cost: 
    • Switch from HDDs to SSDs: SSDs provide faster caching and shuffle operations, which can improve job efficiency and reduce runtime. 
  1. Tag Clusters for Cost Attribution: Tagging clusters enables granular tracking and reporting:
    • Visibility and Accountability: Use tags to attribute costs to specific teams, projects, or environments, supporting better budgeting and chargeback processes. 
  1.  Select the Right Cluster Type: Different workloads require different cluster types:
    • Job Clusters: Ideal for scheduled jobs and Delta Live Tables. 
    • All-Purpose Clusters: Suited for ad-hoc analysis and collaborative work. 
    • Single-Node Clusters: Efficient for simple exploratory data analysis or pure Python tasks. 
    • Serverless Compute: Scalable, managed workloads with automatic resource management. 
  1. Monitor and Adjust Regularly: review cluster metrics and query history:
    • Continuous Optimization: Use built-in dashboards to monitor usage, identify bottlenecks, and adjust cluster size or configuration as needed. 
Code Best Practices 
  1. Avoid Reprocessing Large Tables
    • Use a CDC (Change Data Capture) architecture with Delta Live Tables (DLT) to process only new or changed data, minimizing unnecessary computation. 
  1. Ensure Code Parallelizes Well
    • Write Spark code that leverages parallel processing. Avoid loops, deeply nested structures, and inefficient user-defined functions (UDFs) that can hinder scalability. 
  1. Reduce Memory Consumption
    • Tweak Spark configurations to minimize memory overhead. Clean out legacy or unnecessary settings that may have carried over from previous Spark versions. 
  1. Prefer SQL Over Complex Python
    • Use SQL (declarative language) for Spark jobs whenever possible. SQL queries are typically more efficient and easier to optimize than complex Python logic. 
  1. Modularize Notebooks
    • Use %run to split large notebooks into smaller, reusable modules. This improves maintainability. 
  1. Use LIMIT in Exploratory Queries
    • When exploring data, always use the LIMIT clause to avoid scanning large datasets unnecessarily. 
  1. Monitor Job Performance
    • Regularly review Spark UI to detect inefficiencies such as high shuffle, input, or output. Optimize join strategies and data layout accordingly. 

 

Databricks Code Performance Enhancements & Data Engineering Best Practices  

By enabling the below features and applying best practices, you can significantly lower costs, accelerate job execution, and build Databricks pipelines that are both scalable and highly reliable. For more guidance review: Comprehensive Guide to Optimize Data Workloads | Databricks

Feature / Technique 

Purpose / Benefit 

How to Use / Enable / Key Notes 

Disk Caching

Accelerates repeated reads of Parquet files 

Set spark.databricks.io.cache.enabled = true 

Dynamic File Pruning (DFP)

Skips irrelevant data files during queries, improves query performance 

Enabled by default in Databricks 

Low Shuffle Merge

Reduces data rewriting during MERGE operations, less need to recalculate ZORDER 

Use Databricks runtime with feature enabled 

Adaptive Query Execution (AQE)

Dynamically optimizes query plans based on runtime statistics 

Available in Spark 3.0+, enabled by default 

Deletion Vectors

Efficient row removal/change without rewriting entire Parquet file 

Enable in workspace settings, use with Delta Lake 

Materialized Views

Faster BI queries, reduced compute for frequently accessed data 

Create in Databricks SQL 

Optimize

Compacts Delta Lake files, improves query performance 

Run regularly, combine with ZORDER on high-cardinality columns 

ZORDER 

Physically sorts/co-locates data by chosen columns for faster queries 

Use with OPTIMIZE, select columns frequently used in filters/joins 

Auto Optimize

Automatically compacts small files during writes 

Enable optimizeWrite and autoCompact table properties 

Liquid Clustering

Simplifies data layout, replaces partitioning/ZORDER, flexible clustering keys 

Recommended for new Delta tables, enables easy redefinition of clustering keys 

File Size Tuning

Achieve optimal file size for performance and cost 

Set delta.targetFileSize table property 

Broadcast Hash Join

Optimizes joins by broadcasting smaller tables 

Adjust spark.sql.autoBroadcastJoinThreshold and spark.databricks.adaptive.autoBroadcastJoinThreshold 

Shuffle Hash Join 

Faster join alternative to sort-merge join 

Prefer over sort-merge join when broadcasting isn’t possible, Photon engine can help 

Cost-Based Optimizer (CBO)

Improves query plans for complex joins 

Enabled by default, collect column/table statistics with ANALYZE TABLE 

Data Spilling & Skew

Handles uneven data distribution and excessive shuffle 

Use AQE, set spark.sql.shuffle.partitions=auto, optimize partitioning 

Data Explosion Management 

Controls partition sizes after transformations (e.g., explode, join) 

Adjust spark.sql.files.maxPartitionBytes, use repartition() after reads 

Delta Merge

Efficient upserts and CDC (Change Data Capture) 

Use MERGE operation in Delta Lake, combine with CDC architecture 

Data Purging (Vacuum)

Removes stale data files, maintains storage efficiency 

Run VACUUM regularly based on transaction frequency 

 

 Phase 3: Team Alignment and Next Steps

Implementing Cost Observability and Taking Action 

Effective cost management in Databricks goes beyond configuration and code—it requires robust observability, granular tracking, and proactive measures. Below outlines how your teams can achieve this using system tables, tagging, dashboards, and actionable scripts. 

  1. Cost Observability with System Tables

Databricks Unity Catalog provides system tables that store operational data for your account. These tables enable historical cost observability and empower FinOps teams to analyze spend independently. 

    • System Tables Location: Found inside the Unity Catalog under the “system” schema.
    • Key Benefits: Structured data for querying, historical analysis, and cost attribution.
    • Action: Assign permissions to FinOps teams so they can access and analyze dedicated cost tables. 
  1. Enable Tags for Granular Tracking

Tagging is a powerful feature for tracking, reporting, and budgeting at a granular level. 

    • Classic Compute: Manually add key/value pairs when creating clusters, jobs, SQL Warehouses, or Model Serving endpoints. Use cluster policies to enforce custom tags. 
    • Serverless Compute: Create budget policies and assign permissions to teams or members for serverless workloads. 
    • Action: Tag all compute resources to enable detailed cost attribution and reporting. 

 

  1. Track Costs with Dashboards and Alerts

Databricks offers prebuilt dashboards and queries for cost forecasting and usage analysis. 

    • Dashboards: Visualize spend, usage trends, and forecast future costs.
    • Prebuilt Queries: Use top queries with system tables to answer meaningful cost questions.
    • Budget Alerts: Set up alerts in the Account Console (Usage > Budget) to receive notifications when spend approaches defined thresholds. 
  1. Review Bottlenecks and Optimization Opportunities

With observability in place, regularly review system tables, dashboards, and tagged resources to identify: 

    • Cost Bottlenecks: Clusters or jobs with unusually high spend. 
    • Optimization Opportunities: Underutilized resources, inefficient jobs, or misconfigured clusters. 
    • Team Alignment: Share insights with engineering and FinOps teams to drive collaborative optimization. 

 

Summary Table: Cost Observability & Action Steps 

Area 

Best Practice / Action 

System Tables 

Use for historical cost analysis and attribution 

Tagging 

Apply to all compute resources for granular tracking 

Dashboards 

Visualize spend, usage, and forecasts 

Alerts 

Set budget alerts for proactive cost management 

Scripts/Queries 

Build custom analysis tools for deep insights 

Cluster/Data/Code Review & Align 

Regularly review best practices, share findings, and align teams on optimization 

 

Updated Nov 15, 2025
Version 12.0
No CommentsBe the first to comment