Microsoft Mission Critical Blog

3 MIN READ

Diagnose performance issues in Spark jobs through Spark UI.

PraveenPentareddy

Microsoft

Oct 13, 2025

This guide provides a detailed walkthrough on using the Spark UI to diagnose performance issues in Spark jobs. It covers understanding job composition, navigating the Spark UI, interpreting job timelines, and troubleshooting common problems such as failing jobs, executor issues, memory problems, and identifying performance bottlenecks.

Agenda

Introduction
Overview of Spark UI
Navigating to Spark UI
Jobs Timeline
Opening Jobs Timeline
Reading Event Timeline
Failing Jobs or Executors
Diagnosing Failing Jobs
Diagnosing Failing Executors
Scenario - Memory Issues
Scenario - Long Running Jobs
Scenario - Identifying Longest Stage

Introduction

Diagnosing performance issues of job using Spark UI
This guide walks you through how to use the Spark UI to diagnose performance issues

Overview of Spark UI

Job Composition
Composed of multiple stages
Stages may contain more than one task
Task Breakdown
Tasks are broken into executors

Navigating to Spark UI: Navigating to Cluster's Page

Navigate to your cluster’s page:

Navigating to Spark UI: Clicking Spark UI

Click Spark UI:

Jobs Timeline

Jobs timeline
The jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way

Opening Jobs Timeline

Accessing the Jobs Timeline
Navigate to the Spark UI
Click on the Jobs tab
Viewing the Event Timeline
Click on Event Timeline
Highlighted in red in the screenshot
Example Timeline
Shows driver and executor 0 being added

Failing Jobs or Executors: Example of Failed Job

Failed Job Example
Indicated by a red status
Shown in the event timeline
Removed Executors
Also indicated by a red status
Shown in the event timeline

Failing Jobs or Executors: Common Reasons for Executors Being Removed

Autoscaling
Expected behavior, not an error
See Enable autoscaling for more details Compute configuration reference - Azure Databricks | Microsoft Learn
Spot instance losses
Cloud provider reclaiming your VMs
Learn more about Spot instances here
Executors running out of memory

Diagnosing Failing Jobs: Steps to Diagnose Failing Jobs

Identifying Failing Jobs
Click on the failing job to access its page
Reviewing Failure Details
Scroll down to see the failed stage
Check the failure reason

Diagnosing Failing Jobs: Generic Errors

You may get a generic error. Click on the link in the description to see if you can get more info:

Diagnosing Failing Jobs: Memory Issues

Task Failure Explanation
Scroll down the page to see why each task failed
Memory issue identified as the cause

Scenario – Spot instance , Auto-scaling
Diagnosing Failing Executors: Checking Event Log

Check Event Log
Identify any explanations for executor failures
Spot Instances
Cloud provider may reclaim spot instances

Diagnosing Failing Executors: Navigating to Executors Tab

Check Event Log for Executor Loss
Look for messages indicating cluster resizing or spot instance loss
Navigate to Spark UI
Click on the Executors tab

Diagnosing Failing Executors: Getting Logs from Failed Executors

Here you can get the logs from the failed executors:

Scenario - Memory Issues

Memory Issues
Common cause of problems
Requires thorough investigation
Quality of Code
Potential source of memory issues
Needs to be checked for efficiency
Data Quality
Can affect memory usage
Must be organized correctly
Spark memory issues - Azure Databricks | Microsoft Learn

Identifying Longest Stage

Identify the longest stage of the job
Scroll to the bottom of the job’s page
Locate the list of stages
Order the stages by duration

Identifying Longest Stage

Identify the longest stage of the job
Scroll to the bottom of the job’s page
Locate the list of stages
Order the stages by duration

Stage I/O Details

High-Level Data Overview
Input
Output
Shuffle Read
Shuffle Write

Number of Tasks in Long Stage

Identifying the number of tasks
Helps in pinpointing the issue
Look at the specified location to determine the number of tasks

Investigating Stage Details

Investigate Further if Multiple Tasks
Check if the stage has more than one task
Click on the link in the stage’s description
Get More Info About Longest Stage
Click on the link provided
Gather detailed information

Conclusion

Potential Data Skew Issues
Data skew can impact performance
May cause uneven distribution of data
Spelling Errors in Data
Incorrect spelling can affect data processing
Ensure data accuracy for optimal performance
Learn More
Navigate to Skew and Spill - Skew and spill - Azure Databricks | Microsoft Learn

Published Oct 13, 2025

Version 1.0

data & ai

PraveenPentareddy

Microsoft

Joined April 07, 2025

View Profile

Microsoft Mission Critical Blog

Follow this blog board to get notified when there's new activity

Blog Post

Diagnose performance issues in Spark jobs through Spark UI.