Co-Authored by Sanjeev Nair
This guide walks through a proven approach to Databricks cost optimization, structured in three phases: Discovery, Cluster/Data/Code Best Practices, and Team Alignment & Next Steps.
Phase 1: Discovery
Assessing Your Current State
The following questions are designed to guide your initial assessment and help you identify areas for improvement. Documenting answers to each will provide a baseline for optimization and inform the next phases of your cost management strategy.
|
Environment & Organization |
Cluster Management
|
Cost Optimization
|
Data Management
|
Performance Monitoring
|
Future Planning
|
|
What is the current scale of your Databricks environment?
How many workspaces do you have?
How are your workspaces organized (e.g., by environment type, region, use case)?
How many clusters are deployed? How many users are active?
What are the primary use cases for Databricks in your organization?
|
How are clusters currently managed?
What is the average cluster uptime?
What is the average cluster utilization rate?
|
What is the current monthly spend on Databricks?
What cost management tools are currently in use?
Are there any existing cost optimization strategies in place?
|
What is the current data storage strategy?
What is the average data ingestion rate?
What is the average data processing time?
What types of data formats are used in your environment?
|
What performance monitoring tools are currently in use?
What are the key performance metrics tracked?
|
Are there any planned expansions or changes to the Databricks environment?
What are the long-term goals for Databricks cost optimization?
|
Understanding Databricks Cost Structure
Total Cost = Cloud Cost + DBU Cost
-
- Cloud Cost: Compute (VMs, networking, IP addresses), storage (ADLS, MLflow artifacts), other services (firewalls), cluster type (serverless compute, classic compute)
-
- DBU Cost: Workload size, cluster/warehouse size, photon acceleration, compute runtime, workspace tier, SKU type (Jobs, Delta Live Tables, All Purpose Clusters, Serverless), model serving, queries per second, model execution time
Diagnose Cost and Issues
Effectively diagnosing cost and performance issues in Databricks requires a structured approach. Use the following steps and metrics to gain visibility into your environment and uncover actionable insights.
-
Review Cluster Metrics
-
- CPU Utilization: Track guest, iowait, idle, irq, nice, softirq, steal, system, and user times to understand how compute resources are being used.
- Memory Utilization: Monitor used, free, buffer, and cached memory to identify over- or under-utilization.
-
- Key Question: Is your cluster over- or under-utilized? Are resources being wasted or stretched too thin?
-
Review SQL Warehouse Metrics
-
- Live Statistics: Monitor warehouse status, running/queued queries, and current cluster count.
- Time Scale Filter: Analyze query and cluster activity over different time frames (8 hours, 24 hours, 7 days, 14 days).
- Peak Query Count Chart: Identify periods of high concurrency.
- Completed Query Count Chart: Track throughput and query success/failure rates.
- Running Clusters Chart: Observe cluster allocation and recycling events.
- Query History Table: Filter and analyze queries by user, duration, status, and statement type.
3. Review Spark UI
-
- Stages Tab: Look for skewed data, high input/output, and shuffle times. Uneven task durations may indicate data skew or inefficient data handling.
-
- Jobs Timeline: Identify long-running jobs or stages that consume excessive resources.
-
- Stage Analysis: Determine if stages are I/O bound or suffering from data skew/spill.
-
- Executor Metrics: Monitor memory usage, CPU utilization, and disk I/O. Frequent garbage collection or high memory usage may signal the need for better resource allocation.
3.1. Storage & Jobs Tab
-
- Storage Level: Check if data is stored in memory, on disk, or both.
-
- Size: Assess the size of cached data.
-
- Job Analysis: Investigate jobs that dominate the timeline or have unusually long durations. Look for gaps caused by complex execution plans, non-Spark code, driver overload, or cluster malfunction.
3.2. Executor Tab
-
- Storage Memory: Compare used vs. available memory.
-
- Task Time (Garbage Collection): Review long tasks and garbage collection times.
-
- Shuffle Read/Write: Measure data transferred between stages.
Phase 2: Cluster/Code/Data Best Practices Alignment
Cluster UI Configuration and Cost Attribution
Effectively configuring clusters/workloads in Databricks is essential for balancing performance, scalability, and cost. Tunning settings and features when used strategically can help organizations maximize resource efficiency and minimize unnecessary spending.
Key Configuration Strategies
1. Reduce Idle Time: Clusters to incur costs even when not actively processing workloads. To avoid paying for unused resources:
-
- Enable Auto-Terminate: Set clusters automatically shut down after a period of inactivity. This simple setting can significantly reduce wasted spending.
- Enable Autoscaling: Workloads fluctuate in size and complexity. Autoscaling allows clusters to dynamically adjust the number of nodes based on demand:
-
- Automatic Resource Adjustment: Scale up for heavy jobs and scale down for lighter loads, ensuring you only pay for what you use.
- Use Spot Instances: For batch processing and non-critical workloads, spot instances offer substantial cost savings:
-
- Lower VM Costs: Spot instances are typically much cheaper than standard VMs. However, they are not recommended for jobs requiring constant uptime due to potential interruptions.
- Leverage Photon Engine: Photon is Databricks’ high-performance, vectorized query engine:
-
- Accelerate Large Workloads: Photon can dramatically reduce runtime for compute-intensive tasks, improving both speed and cost efficiency.
- Keep Runtimes Up to Date: Using the latest Databricks runtime ensures optimal performance and security:
-
- Benefit from Improvements: Regular updates include performance enhancements, bug fixes, and new features.
- Apply Cluster Policies: Cluster policies help standardize configurations and enforce cost controls across teams:
-
- Governance and Consistency: Policies can restrict certain settings, enforce tagging, and ensure clusters are created with cost-effective defaults.
- Optimize Storage: type impacts both performance and cost:
-
- Switch from HDDs to SSDs: SSDs provide faster caching and shuffle operations, which can improve job efficiency and reduce runtime.
- Tag Clusters for Cost Attribution: Tagging clusters enables granular tracking and reporting:
-
- Visibility and Accountability: Use tags to attribute costs to specific teams, projects, or environments, supporting better budgeting and chargeback processes.
- Select the Right Cluster Type: Different workloads require different cluster types:
-
- Job Clusters: Ideal for scheduled jobs and Delta Live Tables.
-
- All-Purpose Clusters: Suited for ad-hoc analysis and collaborative work.
-
- Single-Node Clusters: Efficient for simple exploratory data analysis or pure Python tasks.
-
- Serverless Compute: Scalable, managed workloads with automatic resource management.
- Monitor and Adjust Regularly: review cluster metrics and query history:
-
- Continuous Optimization: Use built-in dashboards to monitor usage, identify bottlenecks, and adjust cluster size or configuration as needed.
Code Best Practices
- Avoid Reprocessing Large Tables
-
- Use a CDC (Change Data Capture) architecture with Delta Live Tables (DLT) to process only new or changed data, minimizing unnecessary computation.
- Ensure Code Parallelizes Well
-
- Write Spark code that leverages parallel processing. Avoid loops, deeply nested structures, and inefficient user-defined functions (UDFs) that can hinder scalability.
- Reduce Memory Consumption
-
- Tweak Spark configurations to minimize memory overhead. Clean out legacy or unnecessary settings that may have carried over from previous Spark versions.
- Prefer SQL Over Complex Python
-
- Use SQL (declarative language) for Spark jobs whenever possible. SQL queries are typically more efficient and easier to optimize than complex Python logic.
- Modularize Notebooks
-
- Use %run to split large notebooks into smaller, reusable modules. This improves maintainability.
- Use LIMIT in Exploratory Queries
-
- When exploring data, always use the LIMIT clause to avoid scanning large datasets unnecessarily.
- Monitor Job Performance
-
- Regularly review Spark UI to detect inefficiencies such as high shuffle, input, or output. Optimize join strategies and data layout accordingly.
Databricks Code Performance Enhancements & Data Engineering Best Practices
By enabling the below features and applying best practices, you can significantly lower costs, accelerate job execution, and build Databricks pipelines that are both scalable and highly reliable. For more guidance review: Comprehensive Guide to Optimize Data Workloads | Databricks.
|
Feature / Technique |
Purpose / Benefit |
How to Use / Enable / Key Notes |
|
Accelerates repeated reads of Parquet files |
Set spark.databricks.io.cache.enabled = true | |
|
Skips irrelevant data files during queries, improves query performance |
Enabled by default in Databricks | |
|
Reduces data rewriting during MERGE operations, less need to recalculate ZORDER |
Use Databricks runtime with feature enabled | |
|
Dynamically optimizes query plans based on runtime statistics |
Available in Spark 3.0+, enabled by default | |
|
Efficient row removal/change without rewriting entire Parquet file |
Enable in workspace settings, use with Delta Lake | |
|
Faster BI queries, reduced compute for frequently accessed data |
Create in Databricks SQL | |
|
Compacts Delta Lake files, improves query performance |
Run regularly, combine with ZORDER on high-cardinality columns | |
|
ZORDER |
Physically sorts/co-locates data by chosen columns for faster queries |
Use with OPTIMIZE, select columns frequently used in filters/joins |
|
Automatically compacts small files during writes |
Enable optimizeWrite and autoCompact table properties | |
|
Simplifies data layout, replaces partitioning/ZORDER, flexible clustering keys |
Recommended for new Delta tables, enables easy redefinition of clustering keys | |
|
Achieve optimal file size for performance and cost |
Set delta.targetFileSize table property | |
|
Optimizes joins by broadcasting smaller tables |
Adjust spark.sql.autoBroadcastJoinThreshold and spark.databricks.adaptive.autoBroadcastJoinThreshold | |
|
Shuffle Hash Join |
Faster join alternative to sort-merge join |
Prefer over sort-merge join when broadcasting isn’t possible, Photon engine can help |
|
Improves query plans for complex joins |
Enabled by default, collect column/table statistics with ANALYZE TABLE | |
|
Handles uneven data distribution and excessive shuffle |
Use AQE, set spark.sql.shuffle.partitions=auto, optimize partitioning | |
|
Data Explosion Management |
Controls partition sizes after transformations (e.g., explode, join) |
Adjust spark.sql.files.maxPartitionBytes, use repartition() after reads |
|
Efficient upserts and CDC (Change Data Capture) |
Use MERGE operation in Delta Lake, combine with CDC architecture | |
|
Removes stale data files, maintains storage efficiency |
Run VACUUM regularly based on transaction frequency |
Phase 3: Team Alignment and Next Steps
Implementing Cost Observability and Taking Action
Effective cost management in Databricks goes beyond configuration and code—it requires robust observability, granular tracking, and proactive measures. Below outlines how your teams can achieve this using system tables, tagging, dashboards, and actionable scripts.
-
Cost Observability with System Tables
Databricks Unity Catalog provides system tables that store operational data for your account. These tables enable historical cost observability and empower FinOps teams to analyze spend independently.
-
- System Tables Location: Found inside the Unity Catalog under the “system” schema.
- Key Benefits: Structured data for querying, historical analysis, and cost attribution.
- Action: Assign permissions to FinOps teams so they can access and analyze dedicated cost tables.
- Enable Tags for Granular Tracking
Tagging is a powerful feature for tracking, reporting, and budgeting at a granular level.
-
- Classic Compute: Manually add key/value pairs when creating clusters, jobs, SQL Warehouses, or Model Serving endpoints. Use cluster policies to enforce custom tags.
-
- Serverless Compute: Create budget policies and assign permissions to teams or members for serverless workloads.
-
- Action: Tag all compute resources to enable detailed cost attribution and reporting.
- Track Costs with Dashboards and Alerts
Databricks offers prebuilt dashboards and queries for cost forecasting and usage analysis.
-
- Dashboards: Visualize spend, usage trends, and forecast future costs.
- Prebuilt Queries: Use top queries with system tables to answer meaningful cost questions.
- Budget Alerts: Set up alerts in the Account Console (Usage > Budget) to receive notifications when spend approaches defined thresholds.
- Review Bottlenecks and Optimization Opportunities
With observability in place, regularly review system tables, dashboards, and tagged resources to identify:
-
- Cost Bottlenecks: Clusters or jobs with unusually high spend.
-
- Optimization Opportunities: Underutilized resources, inefficient jobs, or misconfigured clusters.
-
- Team Alignment: Share insights with engineering and FinOps teams to drive collaborative optimization.
Summary Table: Cost Observability & Action Steps
|
Area |
Best Practice / Action |
|
System Tables |
Use for historical cost analysis and attribution |
|
Tagging |
Apply to all compute resources for granular tracking |
|
Dashboards |
Visualize spend, usage, and forecasts |
|
Alerts |
Set budget alerts for proactive cost management |
|
Scripts/Queries |
Build custom analysis tools for deep insights |
|
Cluster/Data/Code Review & Align |
Regularly review best practices, share findings, and align teams on optimization |