Analytics on Azure Blog

4 MIN READ

Overload to Optimal: Tuning Microsoft Fabric Capacity

Rafia_Aqil

Microsoft

Oct 28, 2025

Co-Authored by: Daya Ram, Sr. Cloud Solutions Architect

Optimizing Microsoft Fabric capacity is both a performance and cost exercise. By diagnosing workloads, tuning cluster and Spark settings, and applying data best practices, teams can reduce run times, avoid throttling, and lower total cost of ownership—without compromising SLAs. Use Fabric’s built-in observability (Monitoring Hub, Capacity Metrics, Spark UI) to identify hot spots and then apply cluster- and data-level remediations.

Capacity Planning

For capacity planning and sizing guidance, see Plan your capacity size. Selecting the wrong SKU can lead to two major issues:

Over-provisioning: Paying for resources you don’t need.
Under-provisioning: Struggling with performance bottlenecks and failed jobs.

To simplify this process, Microsoft provides the Fabric SKU Estimator, a powerful tool designed to help organizations accurately size their capacity based on real-world usage patterns. Run the SKU Estimator before onboarding new workloads or scaling existing ones. Combine its recommendations with monitoring tools like Fabric Capacity Metrics to validate performance and adjust as needed.

Options to Diagnose Capacity Issues

1) Monitoring Hub — Start with the Story of the Run

What to use it for: Browse Spark activity across applications (notebooks, Spark Job Definitions, and pipelines). Quickly surface long‑running or anomalous runs; view read/write bytes, idle time, core allocation, and utilization.

How to use it

- From the Fabric portal, open Monitoring (Monitor Hub).

- Select a Notebook or Spark Job Definition to run and choose Historical Runs.

- Inspect the Run Duration chart; click on a run to see read/write bytes, idle time, core allocation, overall utilization, and other Spark metrics.

What to look for

- Use the guide: application detail monitoring to review and monitor your application.

2) Capacity Metrics App — Measure the Whole Environment

What to use it for: Review capacity-wide utilization and system events (overloads, queueing); compare utilization across time windows and identify sustained peaks.

How to use it

- Open the Microsoft Fabric Capacity Metrics app for your capacity.

- Review the Compute page (ribbon charts, utilization trends) and the System events tab to see overload or throttling windows.

- Use the Timepoint page to drill into a 30‑second interval and see which operations consumed the most compute.

What to look for

- Use the Troubleshooting guide: Monitor and identify capacity usage to pinpoint top CU‑consuming items.

3) Spark UI — Diagnose at Deeper Level

Why it matters: Spark UI exposes skew, shuffle, memory pressure, and long stages. Use it after Monitoring Hub/Capacity Metrics to pinpoint the problematic job.

Key tabs to inspect

- Stages: uneven task durations (data skew), heavy shuffle read/write, large input/output volumes.

- Executors: storage memory, task time (GC), shuffle metrics. High GC or frequent spills indicate memory tuning is needed.

- Storage: which RDDs/cached tables occupy memory; any disk spill.

- Jobs: long‑running jobs and gaps in the timeline (driver compilation, non‑Spark code, driver overload).

What to look for

Set via environment Spark properties or session config.

- Data skew, Memory usage, High/Low Shuffles: Adjust Apache Spark settings: i.e. spark.ms.autotune.enabled, spark.task.cpus and spark.sql.shuffle.partitions.

Remediation and Optimization Suggestions

A) Cluster & Workspace Settings

Runtime & Native Execution Engine (NEE)

- Use Fabric Runtime 1.3 (Spark 3.5, Delta 3.2) and enable the Native Execution Engine to boost performance; enable at the environment level under Spark compute → Acceleration.

Starter Pools vs. Custom Pools

- Starter Pool: prehydrated, medium‑size pools; fast session starts, good for dev/quick runs.

- Custom Pools: size nodes, enable autoscale, dynamic executors. Create via workspace Spark Settings (requires capacity admin to enable workspace customization).

High Concurrency Session Sharing

- Enable High Concurrency to share Spark Sessions across notebooks (and pipelines) to reduce session startup latency and cost; use session tags in pipelines to group notebooks.

Autotune for Spark

Enable Autotune (spark.ms.autotune.enabled = true) to auto‑adjust per‑query:

- spark.sql.shuffle.partitions

- Spark.sql.autoBroadcastJoinThreshold

- spark.sql.files.maxPartitionBytes.

Autotune is disabled by default and is in preview; enable per environment or session.

B) Data‑level best practices

Microsoft Fabric offers several approaches to maintain optimal file sizes in Delta tables, review documentation here: Table Compaction - Microsoft Fabric.

Intelligent Cache

- Enabled by default (Runtime 1.1/1.2) for Spark pools: caches frequently read files at node level for Delta/Parquet/CSV; improves subsequent read performance and TCO.

OPTIMIZE & Z‑Order

- Run OPTIMIZE regularly to rewrite files and improve file layout.

V‑Order

- V‑Order (disabled by default in new workspaces) can accelerate reads for read‑heavy workloads; enable via spark.sql.parquet.vorder.default = true.

Vacuum

- Run VACUUM to remove unreferenced files (stale data); default retention is 7 days; align retention across OneLake to control storage costs and maintain time travel.

Collaboration & Next Steps

Engage Data Engineering Team to Define an Optimization Playbook

- Start with reviewing capacity sizing guidance, cluster‑level optimizations (runtime/NEE, pools, concurrency, Autotune) and then target data improvements (Z‑order, compaction, caching, query refactors).

- Triage: Monitor Hub → Capacity Metrics → Spark UI to map workloads and identify high‑impact jobs, and workloads causing throttling.

- Schedule: Operationalize maintenance: OPTIMIZE (full or selective) during off‑peak windows; enable Auto Compaction for micro‑batch/streaming writes; add VACUUM to your cadence with agreed retention.
- Add regular code review sessions to ensure consistent performance patterns.