A few basic terms:
Scheduler – each scheduler is mapped to an individual logical processor (DMV sys.dm_os_schedulers)
Worker – each worker is mapped to a thread1 (DMV sys.dm_os_workers)
Task – each task is mapped to a series of programming instructions that form a unit of work. The task is then bound to a worker in order to execute the instructions on a CPU. (DMV sys.dm_os_tasks)
Please reference SQL Server Thread and Task Architecture Guide for comprehensive explanation.
1Unless the SQL Server instance is set to fiber mode, which is generally not recommended.
The following graph is a high-level life cycle workflow of a task from the moment that it is bound to a worker to the time that it is unbound from its worker after task completion.
SQL Server internally tracks the current task count as load factor for each scheduler and always assigns a newly created task to the least loaded scheduler for balanced CPU usage. This mechanism works well for typical short-running query tasks.
For long-running tasks, this mechanism may not always sustain evenly distributed CPU usage across schedulers. It is possible for more than one long-running task to be assigned to the same scheduler when other schedulers have higher task loads (which are all short-running tasks) at the moment the long-running task is assigned to a scheduler. Workers are bound to the same scheduler for the life of the worker, and a task is bound to a worker for its entire execution duration. This means when other schedulers become idle or have fewer tasks, long-running tasks cannot be re-assigned to these schedulers because they are bound to their originally assigned scheduler. This can lead to an unbalanced workload with multiple competing runnable workers on a single scheduler unable to utilize available CPU cycles on other idle schedulers.
With computer hardware evolution, SQL Server customers gradually switch to host machines with a higher number of CPUs. Imbalanced CPU usage in the SQL Server engine becomes a more visible problem these days.
Worker migration (AKA “worker stealing”) allows an idle SOS scheduler to migrate a worker from the runnable queue of another scheduler on the same NUMA node and immediately resume the task of the migrated worker. This enhancement provides more balanced CPU usage and reduces the amount of time long-running tasks spend in the runnable queue.
A long-running task that is enabled for worker migration is no longer bound to a fixed scheduler. Instead, it will frequently move across schedulers within the same NUMA node which naturally results in less loaded schedulers. Together with the existing load factor mechanism, worker migration provides SQL Server with an enriched solution for balanced CPU usage.
In SQL Server 2019, workers associated with availability group parallel redo tasks are enabled for worker migration to address a commonly observed scheduler contention issue among redo tasks on secondary replicas.
Availability group parallel redo is the default redo mode starting with SQL Server 2016 RTM and later versions. For each database in an availability group secondary replica, a set of parallel redo worker tasks are created for this database in addition to its main redo task. The main redo task is dedicated to analyzing incoming log records and dispatching them to different parallel redo workers. Each parallel redo worker processes assigned log records and applies changes to data pages.
Both the main redo task and parallel redo worker tasks are long-running tasks. After a database successfully starts its parallel redo worker tasks during database startup phase, these parallel redo worker tasks will stay alive until the database is shut down. The main redo task also keeps running as long as there are transaction logs to process and does not stay idle for more than 60 seconds.
For a heavy transaction workload database on the availability group primary replica, the main redo task of the database on a secondary replica is generally CPU-intensive. Its parallel redo worker tasks may have high CPU demand depending on database configuration, table schema and transaction types.
It is possible for one or more parallel redo worker tasks of a database to be assigned to the same scheduler as its main redo task. If more than one long-running redo task competes for CPU cycles on the same scheduler, scheduler contention may occur. When this happens, the overall transaction log redo throughput of the impacted database can drop 20-30%.
When several availability group databases have heavy redo workloads during the same period, this scheduler contention pattern is more common because of the increased chance for one or more database main redo tasks to be allocated to the same scheduler with parallel redo worker tasks.
Worker migration avoids this performance degradation and maintains more consistent redo throughput for a database. For multiple database concurrent redo workloads, the overall redo throughput has about 10-15% improvement compared to the situation when there is scheduler contention between some main redo tasks and parallel redo worker tasks.
To monitor worker migration occurrence, a new column “worker_migration_count” is added into the existing DMV sys.dm_os_workers. It is a cumulative indicator of the number of migrations that have happened to a worker since the last SQL Server service startup.
Also, for parallel redo worker task records (command=”PARALLEL REDO TASK” or “PARALLEL REDO HELP TASK”) from the output of sys.dm_exec_requests, their scheduler_id values are expected to change frequently when the corresponding database has ongoing redo workloads.
T-SQL query samples:
-- Query worker_migration_count column in DMV sys.dm_os_workers
SELECT worker_address, worker_migration_count
ORDER BY worker_migration_count DESC
-- Query parallel redo tasks in DMV sys.dm_exec_requests
-- (scheduler_id values may change for some rows when re-running this query)
SELECT scheduler_id, session_id, task_address, command
WHERE command = 'PARALLEL REDO TASK' OR command = 'PARALLEL REDO HELP TASK'
ORDER BY session_id
In the future, other SQL Server system long-running tasks may be evaluated for enabling worker migration to avoid scheduler contention. One such example would be the availability group main redo task, which can also be impacted by scheduler contention. When the main redo tasks of two databases are assigned to the same scheduler, or a main redo task shares the same scheduler with a long-running and CPU intensive query task, redo performance can drop 30% or more.
Principal Software Engineer
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.