Blog Post

Microsoft Mission Critical Blog
3 MIN READ

Diagnose performance issues in Spark jobs through Spark UI.

PraveenPentareddy's avatar
Oct 13, 2025

This guide provides a detailed walkthrough on using the Spark UI to diagnose performance issues in Spark jobs. It covers understanding job composition, navigating the Spark UI, interpreting job timelines, and troubleshooting common problems such as failing jobs, executor issues, memory problems, and identifying performance bottlenecks.

Agenda

  • Introduction
  • Overview of Spark UI
  • Navigating to Spark UI
  • Jobs Timeline
  • Opening Jobs Timeline
  • Reading Event Timeline
  • Failing Jobs or Executors
  • Diagnosing Failing Jobs
  • Diagnosing Failing Executors
  • Scenario - Memory Issues
  • Scenario - Long Running Jobs
  • Scenario - Identifying Longest Stage

 

Introduction

  • Diagnosing performance issues of job using Spark UI
  • This guide walks you through how to use the Spark UI to diagnose performance issues

Overview of Spark UI

  • Job Composition
  • Composed of multiple stages
  • Stages may contain more than one task
  • Task Breakdown
  • Tasks are broken into executors

 

Navigating to Spark UI: Navigating to Cluster's Page

  • Navigate to your cluster’s page:

 

 

 

 

 

Navigating to Spark UI: Clicking Spark UI

  • Click Spark UI:

 

 

 

Jobs Timeline

  • Jobs timeline
  • The jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way

Opening Jobs Timeline

 

  • Accessing the Jobs Timeline
  • Navigate to the Spark UI
  • Click on the Jobs tab
  • Viewing the Event Timeline
  • Click on Event Timeline
  • Highlighted in red in the screenshot
  • Example Timeline
  • Shows driver and executor 0 being added

 

 

 

 

 

 

 

Failing Jobs or Executors: Example of Failed Job

  • Failed Job Example
  • Indicated by a red status
  • Shown in the event timeline
  • Removed Executors
  • Also indicated by a red status
  • Shown in the event timeline

 

 

 

Failing Jobs or Executors: Common Reasons for Executors Being Removed

 

 

 

Diagnosing Failing Jobs: Steps to Diagnose Failing Jobs

  • Identifying Failing Jobs
  • Click on the failing job to access its page
  • Reviewing Failure Details
  • Scroll down to see the failed stage
  • Check the failure reason

 

 

 

Diagnosing Failing Jobs: Generic Errors

 

You may get a generic error. Click on the link in the description to see if you can get more info:

 

 

 

Diagnosing Failing Jobs: Memory Issues

 

  • Task Failure Explanation
  • Scroll down the page to see why each task failed
  • Memory issue identified as the cause

 

 

 

Scenario – Spot instance , Auto-scaling
Diagnosing Failing Executors: Checking Event Log

 

  • Check Event Log
  • Identify any explanations for executor failures
  • Spot Instances
  • Cloud provider may reclaim spot instances

 

 

 

Diagnosing Failing Executors: Navigating to Executors Tab

 

  • Check Event Log for Executor Loss
  • Look for messages indicating cluster resizing or spot instance loss
  • Navigate to Spark UI
  • Click on the Executors tab

 

 

 

Diagnosing Failing Executors: Getting Logs from Failed Executors

 

  • Here you can get the logs from the failed executors:

 

 

 

Scenario - Memory Issues

 

 

Identifying Longest Stage

  • Identify the longest stage of the job
  • Scroll to the bottom of the job’s page
  • Locate the list of stages
  • Order the stages by duration

 

 

 

Identifying Longest Stage

  • Identify the longest stage of the job
  • Scroll to the bottom of the job’s page
  • Locate the list of stages
  • Order the stages by duration

 

 

 

Stage I/O Details

  • High-Level Data Overview
  • Input
  • Output
  • Shuffle Read
  • Shuffle Write

 

 

 

 

 

 

Number of Tasks in Long Stage

 

  • Identifying the number of tasks
  • Helps in pinpointing the issue
  • Look at the specified location to determine the number of tasks

 

 

 

Investigating Stage Details

 

  • Investigate Further if Multiple Tasks
  • Check if the stage has more than one task
  • Click on the link in the stage’s description
  • Get More Info About Longest Stage
  • Click on the link provided
  • Gather detailed information

 

 

 

Conclusion

  • Potential Data Skew Issues
  • Data skew can impact performance
  • May cause uneven distribution of data
  • Spelling Errors in Data
  • Incorrect spelling can affect data processing
  • Ensure data accuracy for optimal performance
  • Learn More
  • Navigate to Skew and Spill - Skew and spill - Azure Databricks | Microsoft Learn
Published Oct 13, 2025
Version 1.0
No CommentsBe the first to comment