Blog Post

Analytics on Azure Blog
4 MIN READ

Overload to Optimal: Tuning Microsoft Fabric Capacity

Rafia_Aqil's avatar
Rafia_Aqil
Icon for Microsoft rankMicrosoft
Oct 28, 2025

Co-Authored by: Daya Ram, Sr. Cloud Solutions Architect 

 

Optimizing Microsoft Fabric capacity is both a performance and cost exercise. By diagnosing workloads, tuning cluster and Spark settings, and applying data best practices, teams can reduce run times, avoid throttling, and lower total cost of ownership—without compromising SLAs. Use Fabric’s built-in observability (Monitoring Hub, Capacity Metrics, Spark UI) to identify hot spots and then apply cluster- and data-level remediations. For capacity planning and sizing guidance, see Plan your capacity size. 

 

Options to Diagnose Capacity Issues 

1) Monitoring Hub — Start with the Story of the Run 

What to use it for: Browse Spark activity across applications (notebooks, Spark Job Definitions, and pipelines). Quickly surface long‑running or anomalous runs; view read/write bytes, idle time, core allocation, and utilization.  

How to use it 

    • From the Fabric portal, open Monitoring (Monitor Hub). 
    • Select a Notebook or Spark Job Definition to run and choose Historical Runs. 
    • Inspect the Run Duration chart; click on a run to see read/write bytes, idle time, core allocation, overall utilization, and other Spark metrics.  

What to look for 

2) Capacity Metrics App — Measure the Whole Environment 

What to use it for: Review capacity-wide utilization and system events (overloads, queueing); compare utilization across time windows and identify sustained peaks.  

How to use it 

    • Open the Microsoft Fabric Capacity Metrics app for your capacity. 
    • Review the Compute page (ribbon charts, utilization trends) and the System events tab to see overload or throttling windows. 
    • Use the Timepoint page to drill into a 30‑second interval and see which operations consumed the most compute.  

What to look for 

    • Use the Troubleshooting guide: Monitor and identify capacity usage to pinpoint top CU‑consuming items.  
3) Spark UI — Diagnose at Deeper Level 

Why it matters: Spark UI exposes skew, shuffle, memory pressure, and long stages. Use it after Monitoring Hub/Capacity Metrics to pinpoint the problematic job.  

Key tabs to inspect 

    • Stages: uneven task durations (data skew), heavy shuffle read/write, large input/output volumes. 
    • Executors: storage memory, task time (GC), shuffle metrics. High GC or frequent spills indicate memory tuning is needed. 
    • Storage: which RDDs/cached tables occupy memory; any disk spill. 
    • Jobs: long‑running jobs and gaps in the timeline (driver compilation, non‑Spark code, driver overload).  

What to look for 

Set via environment Spark properties or session config. 

    • Data skew, Memory usage, High/Low Shuffles: Adjust Apache Spark settings: i.e. spark.ms.autotune.enabled,  spark.task.cpus and spark.sql.shuffle.partitions. 

 

Section 2: Remediation and Optimization Suggestions 

A) Cluster & Workspace Settings
Runtime & Native Execution Engine (NEE) 
    • Use Fabric Runtime 1.3 (Spark 3.5, Delta 3.2) and enable the Native Execution Engine to boost performance; enable at the environment level under Spark compute → Acceleration.  
Starter Pools vs. Custom Pools 
    • Starter Pool: prehydrated, medium‑size pools; fast session starts, good for dev/quick runs.  
    • Custom Pools: size nodes, enable autoscale, dynamic executors. Create via workspace Spark Settings (requires capacity admin to enable workspace customization).  
High Concurrency Session Sharing 
    • Enable High Concurrency to share Spark Sessions across notebooks (and pipelines) to reduce session startup latency and cost; use session tags in pipelines to group notebooks.  
Autotune for Spark 

Enable Autotune (spark.ms.autotune.enabled = true) to auto‑adjust per‑query:  

    • spark.sql.shuffle.partitions 
    • Spark.sql.autoBroadcastJoinThreshold 
    • spark.sql.files.maxPartitionBytes 

Autotune is disabled by default and is in preview; enable per environment or session.  

B) Data‑level best practices

Microsoft Fabric offers several approaches to maintain optimal file sizes in Delta tables, review documentation here: Table Compaction - Microsoft Fabric.

 

Intelligent Cache 

    • Enabled by default (Runtime 1.1/1.2) for Spark pools: caches frequently read files at node level for Delta/Parquet/CSV; improves subsequent read performance and TCO.  

OPTIMIZE & Z‑Order 

    • Run OPTIMIZE regularly to rewrite files and improve file layout.

V‑Order 

    • V‑Order (disabled by default in new workspaces) can accelerate reads for read‑heavy workloads; enable via spark.sql.parquet.vorder.default = true.  

Vacuum  

    • Run VACUUM to remove unreferenced files (stale data); default retention is 7 days; align retention across OneLake to control storage costs and maintain time travel.  

 

Collaboration & Next Steps 

Engage Data Engineering Team to Define an Optimization Playbook 
    • Start with reviewing capacity sizing guidance, cluster‑level optimizations (runtime/NEE, pools, concurrency, Autotune) and then target data improvements (Z‑order, compaction, caching, query refactors).  
    • ScheduleOperationalize maintenance: OPTIMIZE (full or selective) during off‑peak windows; enable Auto Compaction for micro‑batch/streaming writes; add VACUUM to your cadence with agreed retention.
    • Add regular code review sessions to ensure consistent performance patterns.
    • Verify: Re‑run the job and change, i.e. reduced run time, lower shuffle, improved utilization.  

 

 

Updated Oct 28, 2025
Version 1.0
No CommentsBe the first to comment