SQLServerAlwaysOn
56 TopicsSQL Server 2019 Intelligent Performance -Worker Migration
SQL Server task scheduling basics A few basic terms: Scheduler – each scheduler is mapped to an individual logical processor (DMV sys.dm_os_schedulers) Worker – each worker is mapped to a thread 1 (DMV sys.dm_os_workers) Task – each task is mapped to a series of programming instructions that form a unit of work. The task is then bound to a worker in order to execute the instructions on a CPU. (DMV sys.dm_os_tasks) Please reference SQL Server Thread and Task Architecture Guide for comprehensive explanation. 1 Unless the SQL Server instance is set to fiber mode, which is generally not recommended. The following graph is a high-level life cycle workflow of a task from the moment that it is bound to a worker to the time that it is unbound from its worker after task completion. Mechanism to balance CPU usage in previous versions of SQL Server SQL Server internally tracks the current task count as load factor for each scheduler and always assigns a newly created task to the least loaded scheduler for balanced CPU usage. This mechanism works well for typical short-running query tasks. For long-running tasks, this mechanism may not always sustain evenly distributed CPU usage across schedulers. It is possible for more than one long-running task to be assigned to the same scheduler when other schedulers have higher task loads (which are all short-running tasks) at the moment the long-running task is assigned to a scheduler. Workers are bound to the same scheduler for the life of the worker, and a task is bound to a worker for its entire execution duration. This means when other schedulers become idle or have fewer tasks, long-running tasks cannot be re-assigned to these schedulers because they are bound to their originally assigned scheduler. This can lead to an unbalanced workload with multiple competing runnable workers on a single scheduler unable to utilize available CPU cycles on other idle schedulers. With computer hardware evolution, SQL Server customers gradually switch to host machines with a higher number of CPUs. Imbalanced CPU usage in the SQL Server engine becomes a more visible problem these days. What is worker migration Worker migration (AKA “worker stealing”) allows an idle SOS scheduler to migrate a worker from the runnable queue of another scheduler on the same NUMA node and immediately resume the task of the migrated worker. This enhancement provides more balanced CPU usage and reduces the amount of time long-running tasks spend in the runnable queue. A long-running task that is enabled for worker migration is no longer bound to a fixed scheduler. Instead, it will frequently move across schedulers within the same NUMA node which naturally results in less loaded schedulers. Together with the existing load factor mechanism, worker migration provides SQL Server with an enriched solution for balanced CPU usage. Availability group parallel redo tasks with worker migration In SQL Server 2019, workers associated with availability group parallel redo tasks are enabled for worker migration to address a commonly observed scheduler contention issue among redo tasks on secondary replicas. Availability group parallel redo is the default redo mode starting with SQL Server 2016 RTM and later versions. For each database in an availability group secondary replica, a set of parallel redo worker tasks are created for this database in addition to its main redo task. The main redo task is dedicated to analyzing incoming log records and dispatching them to different parallel redo workers. Each parallel redo worker processes assigned log records and applies changes to data pages. Both the main redo task and parallel redo worker tasks are long-running tasks. After a database successfully starts its parallel redo worker tasks during database startup phase, these parallel redo worker tasks will stay alive until the database is shut down. The main redo task also keeps running as long as there are transaction logs to process and does not stay idle for more than 60 seconds. For a heavy transaction workload database on the availability group primary replica, the main redo task of the database on a secondary replica is generally CPU-intensive. Its parallel redo worker tasks may have high CPU demand depending on database configuration, table schema and transaction types. It is possible for one or more parallel redo worker tasks of a database to be assigned to the same scheduler as its main redo task. If more than one long-running redo task competes for CPU cycles on the same scheduler, scheduler contention may occur. When this happens, the overall transaction log redo throughput of the impacted database can drop 20-30%. When several availability group databases have heavy redo workloads during the same period, this scheduler contention pattern is more common because of the increased chance for one or more database main redo tasks to be allocated to the same scheduler with parallel redo worker tasks. Worker migration avoids this performance degradation and maintains more consistent redo throughput for a database. For multiple database concurrent redo workloads, the overall redo throughput has about 10-15% improvement compared to the situation when there is scheduler contention between some main redo tasks and parallel redo worker tasks. To monitor worker migration occurrence, a new column “worker_migration_count” is added into the existing DMV sys.dm_os_workers. It is a cumulative indicator of the number of migrations that have happened to a worker since the last SQL Server service startup. Also, for parallel redo worker task records (command=”PARALLEL REDO TASK” or “PARALLEL REDO HELP TASK”) from the output of sys.dm_exec_requests, their scheduler_id values are expected to change frequently when the corresponding database has ongoing redo workloads. T-SQL query samples: -- Query worker_migration_count column in DMV sys.dm_os_workers SELECT worker_address, worker_migration_count FROM sys.dm_os_workers ORDER BY worker_migration_count DESC -- Query parallel redo tasks in DMV sys.dm_exec_requests -- (scheduler_id values may change for some rows when re-running this query) SELECT scheduler_id, session_id, task_address, command FROM sys.dm_exec_requests WHERE command = 'PARALLEL REDO TASK' OR command = 'PARALLEL REDO HELP TASK' ORDER BY session_id Next Steps In the future, other SQL Server system long-running tasks may be evaluated for enabling worker migration to avoid scheduler contention. One such example would be the availability group main redo task, which can also be impacted by scheduler contention. When the main redo tasks of two databases are assigned to the same scheduler, or a main redo task shares the same scheduler with a long-running and CPU intensive query task, redo performance can drop 30% or more. Dong Cao Principal Software Engineer16KViews10likes1CommentSQL Server Licensing: High Availability and Disaster Recovery benefits
Learn about how the new SQL Server High Availability and Disaster Recovery Software Assurance benefits can help you architect an optimal business continuity and disaster recovery strategy with the best TCO.79KViews4likes13CommentsHPE SGLX - The new Azure VM extension for SQL Server on Linux
Overview Mission critical SQL Server instances have the need to utilize high availability and disaster recovery features to ensure business continuity. SQL Server on Linux supports HPE Serviceguard as one of the supported clustering solutions.To know more, please refer to HPE Serviceguard for Linux. We have an official documentation that provides detailed steps on how you can configure SQL Server Always On availability groups with HPE Serviceguard for Linux. The purpose of this blog is to inform you about the HPE Serviceguard Extension available in Azure marketplace. This extension allows users to easily create Azure linux-based VMs with SQL Server and HPE Serviceguard pre-installed. As a result, customers can quickly configure SQL Server high availability (HA) solutions. The Serviceguard - Azure VM marketplace extension is available in all Azure regions for use. Let's get started! Create an Azure virtual machine using Azure portal: Log in to Azure portal – portal.azure.com Create Resource group in Azure under the subscription or choose existing resource group of choice. Choose SQL Server on Linux based marketplace image (RHELor SUSE) and configure the VM on the basis of disks, virtual network, etc. On the Advanced tab of VM creation, click on “Select an extension to install.” Search for HPE Serviceguard for Linux Click on load more and select the extension. Under ‘Select Serviceguard Add-on to install’ , Select Microsoft SQL server on Linux if it’s non-quorum server node machine and select ‘Centralized Serviceguard Management and Arbitration’ if you need Serviceguard quorum server and SGMGR+ (UI) to be configured,. Enter Serviceguard administrative user “sgmgr” password and confirm the same. Proceed to create the VM upon entering the SGLX extension details. After you have configured the other two VMs using the steps as outlined above, you can follow this documentation starting from "Create HPE Service guard cluster" section to configure SQL Server Always On availability groups for SQL Server on Linux3.8KViews1like0CommentsAnnouncing General Availability of Azure Portal Experience to Deploy Multi-Subnet Availability Group
Today, we are excited to announce general availability (GA) of the end-to-end experienceto deploy a multi-subnet Availability Group (AG) for SQL Server on Azure Virtual Machines.3.6KViews3likes2CommentsMonitoring AlwaysOn Health with PowerShell - Part 3 : A Simple Monitoring Application
First published on MSDN on Feb 14, 2012 Part 3 – A Simple Monitoring Application for AlwaysOnInPart 1 and Part 2 of this series,we learned quite a bit about how the AlwaysOn health cmdlets operate.1.8KViews0likes2CommentsLift and Shift Always On SQL Server Failover Cluster Instance (SQL FCI) to Azure VMs
Today, we are announcing two new features enabling lift and shift of Always On SQL server Failover Cluster instances (SQL FCI) from on-premises to Azure Virtual Machines: Distributed Network Name (DNN) for SQL FCI and Azure Shared Disks for SQL FCI.23KViews9likes7Comments