This guide provides a detailed walkthrough on using the Spark UI to diagnose performance issues in Spark jobs. It covers understanding job composition, navigating the Spark UI, interpreting job timelines, and troubleshooting common problems such as failing jobs, executor issues, memory problems, and identifying performance bottlenecks.
Agenda
- Introduction
- Overview of Spark UI
- Navigating to Spark UI
- Jobs Timeline
- Opening Jobs Timeline
- Reading Event Timeline
- Failing Jobs or Executors
- Diagnosing Failing Jobs
- Diagnosing Failing Executors
- Scenario - Memory Issues
- Scenario - Long Running Jobs
- Scenario - Identifying Longest Stage
Introduction
- Diagnosing performance issues of job using Spark UI
- This guide walks you through how to use the Spark UI to diagnose performance issues
Overview of Spark UI
- Job Composition
- Composed of multiple stages
- Stages may contain more than one task
- Task Breakdown
- Tasks are broken into executors
Navigating to Spark UI: Navigating to Cluster's Page
- Navigate to your cluster’s page:
Navigating to Spark UI: Clicking Spark UI
- Click Spark UI:
Jobs Timeline
- Jobs timeline
- The jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way
Opening Jobs Timeline
- Accessing the Jobs Timeline
- Navigate to the Spark UI
- Click on the Jobs tab
- Viewing the Event Timeline
- Click on Event Timeline
- Highlighted in red in the screenshot
- Example Timeline
- Shows driver and executor 0 being added
Failing Jobs or Executors: Example of Failed Job
- Failed Job Example
- Indicated by a red status
- Shown in the event timeline
- Removed Executors
- Also indicated by a red status
- Shown in the event timeline
Failing Jobs or Executors: Common Reasons for Executors Being Removed
- Autoscaling
- Expected behavior, not an error
- See Enable autoscaling for more details Compute configuration reference - Azure Databricks | Microsoft Learn
- Spot instance losses
- Cloud provider reclaiming your VMs
- Learn more about Spot instances here
- Executors running out of memory
Diagnosing Failing Jobs: Steps to Diagnose Failing Jobs
- Identifying Failing Jobs
- Click on the failing job to access its page
- Reviewing Failure Details
- Scroll down to see the failed stage
- Check the failure reason
Diagnosing Failing Jobs: Generic Errors
You may get a generic error. Click on the link in the description to see if you can get more info:
Diagnosing Failing Jobs: Memory Issues
- Task Failure Explanation
- Scroll down the page to see why each task failed
- Memory issue identified as the cause
Scenario – Spot instance , Auto-scaling
Diagnosing Failing Executors: Checking Event Log
- Check Event Log
- Identify any explanations for executor failures
- Spot Instances
- Cloud provider may reclaim spot instances
Diagnosing Failing Executors: Navigating to Executors Tab
- Check Event Log for Executor Loss
- Look for messages indicating cluster resizing or spot instance loss
- Navigate to Spark UI
- Click on the Executors tab
Diagnosing Failing Executors: Getting Logs from Failed Executors
- Here you can get the logs from the failed executors:
Scenario - Memory Issues
- Memory Issues
- Common cause of problems
- Requires thorough investigation
- Quality of Code
- Potential source of memory issues
- Needs to be checked for efficiency
- Data Quality
- Can affect memory usage
- Must be organized correctly
- Spark memory issues - Azure Databricks | Microsoft Learn
Identifying Longest Stage
- Identify the longest stage of the job
- Scroll to the bottom of the job’s page
- Locate the list of stages
- Order the stages by duration
Identifying Longest Stage
- Identify the longest stage of the job
- Scroll to the bottom of the job’s page
- Locate the list of stages
- Order the stages by duration
Stage I/O Details
- High-Level Data Overview
- Input
- Output
- Shuffle Read
- Shuffle Write
Number of Tasks in Long Stage
- Identifying the number of tasks
- Helps in pinpointing the issue
- Look at the specified location to determine the number of tasks
Investigating Stage Details
- Investigate Further if Multiple Tasks
- Check if the stage has more than one task
- Click on the link in the stage’s description
- Get More Info About Longest Stage
- Click on the link provided
- Gather detailed information
Conclusion
- Potential Data Skew Issues
- Data skew can impact performance
- May cause uneven distribution of data
- Spelling Errors in Data
- Incorrect spelling can affect data processing
- Ensure data accuracy for optimal performance
- Learn More
- Navigate to Skew and Spill - Skew and spill - Azure Databricks | Microsoft Learn