Analytics on Azure Blog

8 MIN READ

Azure Databricks Cost Optimization: A Practical Guide

Rafia_Aqil

Microsoft

Nov 15, 2025

Co-Authored by Sanjeev Nair

This guide walks through a proven approach to Databricks cost optimization, structured in three phases: Discovery, Cluster/Data/Code Best Practices, and Team Alignment & Next Steps.

Phase 1: Discovery

Assessing Your Current State

The following questions are designed to guide your initial assessment and help you identify areas for improvement. Documenting answers to each will provide a baseline for optimization and inform the next phases of your cost management strategy.

Environment & Organization

Cluster Management

Cost Optimization

Data Management

Performance Monitoring

Future Planning

What is the current scale of your Databricks environment?

How many workspaces do you have?

How are your workspaces organized (e.g., by environment type, region, use case)?

How many clusters are deployed?

How many users are active?

What are the primary use cases for Databricks in your organization?

Data engineering
Data science
Machine learning
Business intelligence

How are clusters currently managed?

Manual configuration
Automated scripts
Databricks REST API
Cluster policies

What is the average cluster uptime?

Hours per day
Days per week

What is the average cluster utilization rate?

CPU usage
Memory usage

What is the current monthly spend on Databricks?

Total cost
Breakdown by workspace
Breakdown by cluster

What cost management tools are currently in use?

Azure Cost Management
Third-party tools

Are there any existing cost optimization strategies in place?

Reserved instances
Spot instances
Cluster auto-scaling

What is the current data storage strategy?

Data lake
Data warehouse
Hybrid

What is the average data ingestion rate?

GB per day
Number of files

What is the average data processing time?

ETL jobs
Machine learning models

What types of data formats are used in your environment?

Delta Lake
Parquet
JSON
CSV
Other formats relevant to your workloads

What performance monitoring tools are currently in use?

Databricks Ganglia
Azure Monitor
Third-party tools

What are the key performance metrics tracked?

Job execution time
Cluster performance
Data processing speed

Are there any planned expansions or changes to the Databricks environment?

New use cases
Increased data volume
Additional users

What are the long-term goals for Databricks cost optimization?

Reducing overall spend
Improving resource utilization & cost attribution
Enhancing performance

Understanding Databricks Cost Structure

Total Cost = Cloud Cost + DBU Cost

- Cloud Cost: Compute (VMs, networking, IP addresses), storage (ADLS, MLflow artifacts), other services (firewalls), cluster type (serverless compute, classic compute)

- DBU Cost: Workload size, cluster/warehouse size, photon acceleration, compute runtime, workspace tier, SKU type (Jobs, Delta Live Tables, All Purpose Clusters, Serverless), model serving, queries per second, model execution time

Diagnose Cost and Issues

Effectively diagnosing cost and performance issues in Databricks requires a structured approach. Use the following steps and metrics to gain visibility into your environment and uncover actionable insights.

Review Cluster Metrics

- CPU Utilization: Track guest, iowait, idle, irq, nice, softirq, steal, system, and user times to understand how compute resources are being used.
- Memory Utilization: Monitor used, free, buffer, and cached memory to identify over- or under-utilization.

- Key Question: Is your cluster over- or under-utilized? Are resources being wasted or stretched too thin?

Review SQL Warehouse Metrics

1. Live Statistics: Monitor warehouse status, running/queued queries, and current cluster count.
2. Time Scale Filter: Analyze query and cluster activity over different time frames (8 hours, 24 hours, 7 days, 14 days).
3. Peak Query Count Chart: Identify periods of high concurrency.
4. Completed Query Count Chart: Track throughput and query success/failure rates.
5. Running Clusters Chart: Observe cluster allocation and recycling events.
6. Query History Table: Filter and analyze queries by user, duration, status, and statement type.

3. Review Spark UI

- Stages Tab: Look for skewed data, high input/output, and shuffle times. Uneven task durations may indicate data skew or inefficient data handling.

- Jobs Timeline: Identify long-running jobs or stages that consume excessive resources.

- Stage Analysis: Determine if stages are I/O bound or suffering from data skew/spill.

- Executor Metrics: Monitor memory usage, CPU utilization, and disk I/O. Frequent garbage collection or high memory usage may signal the need for better resource allocation.

3.1. Storage & Jobs Tab

- Storage Level: Check if data is stored in memory, on disk, or both.

- Size: Assess the size of cached data.

- Job Analysis: Investigate jobs that dominate the timeline or have unusually long durations. Look for gaps caused by complex execution plans, non-Spark code, driver overload, or cluster malfunction.

3.2. Executor Tab

- Storage Memory: Compare used vs. available memory.

- Task Time (Garbage Collection): Review long tasks and garbage collection times.

- Shuffle Read/Write: Measure data transferred between stages.

Phase 2: Cluster/Code/Data Best Practices Alignment

Cluster UI Configuration and Cost Attribution

Effectively configuring clusters/workloads in Databricks is essential for balancing performance, scalability, and cost. Tunning settings and features when used strategically can help organizations maximize resource efficiency and minimize unnecessary spending.

Key Configuration Strategies

1. Reduce Idle Time: Clusters to incur costs even when not actively processing workloads. To avoid paying for unused resources:

- Enable Auto-Terminate: Set clusters automatically shut down after a period of inactivity. This simple setting can significantly reduce wasted spending.

Enable Autoscaling: Workloads fluctuate in size and complexity. Autoscaling allows clusters to dynamically adjust the number of nodes based on demand:

- Automatic Resource Adjustment: Scale up for heavy jobs and scale down for lighter loads, ensuring you only pay for what you use.

Use Spot Instances: For batch processing and non-critical workloads, spot instances offer substantial cost savings:

- Lower VM Costs: Spot instances are typically much cheaper than standard VMs. However, they are not recommended for jobs requiring constant uptime due to potential interruptions.

Leverage Photon Engine: Photon is Databricks’ high-performance, vectorized query engine:

- Accelerate Large Workloads: Photon can dramatically reduce runtime for compute-intensive tasks, improving both speed and cost efficiency.

Keep Runtimes Up to Date: Using the latest Databricks runtime ensures optimal performance and security:

- Benefit from Improvements: Regular updates include performance enhancements, bug fixes, and new features.

Apply Cluster Policies: Cluster policies help standardize configurations and enforce cost controls across teams:

- Governance and Consistency: Policies can restrict certain settings, enforce tagging, and ensure clusters are created with cost-effective defaults.

Optimize Storage: type impacts both performance and cost:

- Switch from HDDs to SSDs: SSDs provide faster caching and shuffle operations, which can improve job efficiency and reduce runtime.

Tag Clusters for Cost Attribution: Tagging clusters enables granular tracking and reporting:

- Visibility and Accountability: Use tags to attribute costs to specific teams, projects, or environments, supporting better budgeting and chargeback processes.

Select the Right Cluster Type: Different workloads require different cluster types:

- Job Clusters: Ideal for scheduled jobs and Delta Live Tables.

- All-Purpose Clusters: Suited for ad-hoc analysis and collaborative work.

- Single-Node Clusters: Efficient for simple exploratory data analysis or pure Python tasks.

- Serverless Compute: Scalable, managed workloads with automatic resource management.

Monitor and Adjust Regularly: review cluster metrics and query history:

- Continuous Optimization: Use built-in dashboards to monitor usage, identify bottlenecks, and adjust cluster size or configuration as needed.

Code Best Practices

Avoid Reprocessing Large Tables

- Use a CDC (Change Data Capture) architecture with Delta Live Tables (DLT) to process only new or changed data, minimizing unnecessary computation.

Ensure Code Parallelizes Well

- Write Spark code that leverages parallel processing. Avoid loops, deeply nested structures, and inefficient user-defined functions (UDFs) that can hinder scalability.

Reduce Memory Consumption

- Tweak Spark configurations to minimize memory overhead. Clean out legacy or unnecessary settings that may have carried over from previous Spark versions.

Prefer SQL Over Complex Python

- Use SQL (declarative language) for Spark jobs whenever possible. SQL queries are typically more efficient and easier to optimize than complex Python logic.

Modularize Notebooks

- Use %run to split large notebooks into smaller, reusable modules. This improves maintainability.

Use LIMIT in Exploratory Queries

- When exploring data, always use the LIMIT clause to avoid scanning large datasets unnecessarily.

Monitor Job Performance

- Regularly review Spark UI to detect inefficiencies such as high shuffle, input, or output. Optimize join strategies and data layout accordingly.

Databricks Code Performance Enhancements & Data Engineering Best Practices

By enabling the below features and applying best practices, you can significantly lower costs, accelerate job execution, and build Databricks pipelines that are both scalable and highly reliable. For more guidance review: Comprehensive Guide to Optimize Data Workloads | Databricks.

Feature / Technique	Purpose / Benefit	How to Use / Enable / Key Notes
Disk Caching	Accelerates repeated reads of Parquet files	Set spark.databricks.io.cache.enabled = true
Dynamic File Pruning (DFP)	Skips irrelevant data files during queries, improves query performance	Enabled by default in Databricks
Low Shuffle Merge	Reduces data rewriting during MERGE operations, less need to recalculate ZORDER	Use Databricks runtime with feature enabled
Adaptive Query Execution (AQE)	Dynamically optimizes query plans based on runtime statistics	Available in Spark 3.0+, enabled by default
Deletion Vectors	Efficient row removal/change without rewriting entire Parquet file	Enable in workspace settings, use with Delta Lake
Materialized Views	Faster BI queries, reduced compute for frequently accessed data	Create in Databricks SQL
Optimize	Compacts Delta Lake files, improves query performance	Run regularly, combine with ZORDER on high-cardinality columns
ZORDER	Physically sorts/co-locates data by chosen columns for faster queries	Use with OPTIMIZE, select columns frequently used in filters/joins
Auto Optimize	Automatically compacts small files during writes	Enable optimizeWrite and autoCompact table properties
Liquid Clustering	Simplifies data layout, replaces partitioning/ZORDER, flexible clustering keys	Recommended for new Delta tables, enables easy redefinition of clustering keys
File Size Tuning	Achieve optimal file size for performance and cost	Set delta.targetFileSize table property
Broadcast Hash Join	Optimizes joins by broadcasting smaller tables	Adjust spark.sql.autoBroadcastJoinThreshold and spark.databricks.adaptive.autoBroadcastJoinThreshold
Shuffle Hash Join	Faster join alternative to sort-merge join	Prefer over sort-merge join when broadcasting isn’t possible, Photon engine can help
Cost-Based Optimizer (CBO)	Improves query plans for complex joins	Enabled by default, collect column/table statistics with ANALYZE TABLE
Data Spilling & Skew	Handles uneven data distribution and excessive shuffle	Use AQE, set spark.sql.shuffle.partitions=auto, optimize partitioning
Data Explosion Management	Controls partition sizes after transformations (e.g., explode, join)	Adjust spark.sql.files.maxPartitionBytes, use repartition() after reads
Delta Merge	Efficient upserts and CDC (Change Data Capture)	Use MERGE operation in Delta Lake, combine with CDC architecture
Data Purging (Vacuum)	Removes stale data files, maintains storage efficiency	Run VACUUM regularly based on transaction frequency

Phase 3: Team Alignment and Next Steps

Implementing Cost Observability and Taking Action

Effective cost management in Databricks goes beyond configuration and code—it requires robust observability, granular tracking, and proactive measures. Below outlines how your teams can achieve this using system tables, tagging, dashboards, and actionable scripts.

Cost Observability with System Tables

Databricks Unity Catalog provides system tables that store operational data for your account. These tables enable historical cost observability and empower FinOps teams to analyze spend independently.

- System Tables Location: Found inside the Unity Catalog under the “system” schema.
- Key Benefits: Structured data for querying, historical analysis, and cost attribution.
- Action: Assign permissions to FinOps teams so they can access and analyze dedicated cost tables.

Enable Tags for Granular Tracking

Tagging is a powerful feature for tracking, reporting, and budgeting at a granular level.

- Classic Compute: Manually add key/value pairs when creating clusters, jobs, SQL Warehouses, or Model Serving endpoints. Use cluster policies to enforce custom tags.

- Serverless Compute: Create budget policies and assign permissions to teams or members for serverless workloads.

- Action: Tag all compute resources to enable detailed cost attribution and reporting.

Track Costs with Dashboards and Alerts

Databricks offers prebuilt dashboards and queries for cost forecasting and usage analysis.

- Dashboards: Visualize spend, usage trends, and forecast future costs.
- Prebuilt Queries: Use top queries with system tables to answer meaningful cost questions.
- Budget Alerts: Set up alerts in the Account Console (Usage > Budget) to receive notifications when spend approaches defined thresholds.

Review Bottlenecks and Optimization Opportunities

With observability in place, regularly review system tables, dashboards, and tagged resources to identify:

- Cost Bottlenecks: Clusters or jobs with unusually high spend.

- Optimization Opportunities: Underutilized resources, inefficient jobs, or misconfigured clusters.

- Team Alignment: Share insights with engineering and FinOps teams to drive collaborative optimization.

Summary Table: Cost Observability & Action Steps

Area	Best Practice / Action
System Tables	Use for historical cost analysis and attribution
Tagging	Apply to all compute resources for granular tracking
Dashboards	Visualize spend, usage, and forecasts
Alerts	Set budget alerts for proactive cost management
Scripts/Queries	Build custom analysis tools for deep insights
Cluster/Data/Code Review & Align	Regularly review best practices, share findings, and align teams on optimization