SQL Server Always on Availability Group Scalability Experiment and Result on Azure VM
Published Feb 23 2021 08:42 AM 3,928 Views
Microsoft

Introduction

The purpose of this scalability experiment is to evaluate the performance of AG cluster when it has large number of databases and some of the databases undergo large amount of traffic, and measure 1) OLTP transaction performance, 2) Failover success rate and time, and 3) Resource consumption, such that we can provide reference for customer’s large scale AG cluster configuration based on the experiment result. For an AG cluster, based on the specific configuration of CPU cores, memory, and disk capacity, we want to know the maximum number of databases an AG can maintain healthy state without breaching resources limitation.

 

Configuration

Hardware

Test Client configuration: one standard L8s_v2 (8 vcpus, 64 GiB RAM) and three standard E2s v3 (2 vcpus, 16 GiB RAM).
Due to the limitation of core quota for current resource group we applied the L8s and E2s machines. We need multiple test clients in order to simulate multiple user drivers (up to 40) in the test, and for a single machine the user drivers can only be up to 10 due to named pipe limitations(used in user driver).

 

64 vCores test server configuration : Two standard E64s (64 vCPUs, RAM 432 GiB) VMs for one primary replica and one synchronous secondary replica AG.   Each machine has four P30 1Tb disks with read-only caching enabled in a simple storage pool (For data files, 5000 max IOPS and 200 MBPS throughput per disk * 4 = 20000 max IOPS and 800 MBPS throughput total) and two P30 1Tb disks with caching set to none in a simple storage pool (For log files, 5000 max IOPS and 200 MBPS throughput per disk * 2 = 10000 max IOPS and 400 MBPS throughput total).

 

128 vCores test server configuration: Two standard M128ms (128 vCPUs, 3892 GiB memory) for one primary replica and one synchronous secondary replica AG.  Each machine has eight P30 1Tb disks with read-only caching enabled in a simple storage pool (For data files, 5000 max IOPS and 200 MBPS throughput per disk * 8 = 40000 max IOPS and 1600 MBPS throughput total) and four P30 1Tb disks with caching set to none in a simple storage pool (For log files, 5000 max IOPS and 200 MBPS throughput per disk * 4 = 20000 max IOPS and 800 MBPS throughput total).

 

Software

OS version: Windows Server 2019

SQL Server version: SQL Server 2019 Enterprise Edition without additional trace flags

SQL Server Configurations: All databases including system database are on dedicated data and log disks Max SQL Server memory is set at 90% of available memory, and soft NUMA is enabled. Auto growth of databases is disabled.  Auto shrink of databases is also disabled.

 

AG types

AG configuration: One primary replica and one synchronous secondary replica

AG Databases:  

During the test some databases will receive workload, and some other will not receive workload

1) Databases not receiving workload: 523Mb per database, which is subset of 1K customer TPCE database, indices are removed and number of rows are reduced, no traffic will be performed on them.

2) Databases receiving workload: 23.8Gb per database, which is 1K customer TPCE database, assigned with TPCE traffic generated from our internal tool.

 

Traffic type

TPC-E traffic generated from internal tool. Every database receiving workload will be assigned with a driver engine which simulates 50 to 100 users’ traffic.  

 

Test Types

We performed two types of tests:

  1.  Static test: There is no workload traffic against the AG databases and we do multiple failovers (automatic and manual) with different number of databases configured in AG, to evaluate the failover success rate and failover time. 
  2. Dynamic test: There is TPC-E workload traffic against a certain percentage (5% and 10%) of AG databases, and we do failover during the traffic, to evaluate the failover success rate and failover time. 

Metrics

The most crucial metrics we evaluate after we place large number of databases in an AG are the failover success rate and failover time, while we also collect OLTP transaction rate of databases in AG. Besides them, we also collect the metrics of resource consumption including i) CPU usage ii) Disk io latency iii) secondary replica receiving rate iv) redo rate, to make sure they are under healthy and reasonable state during the test.

 

Test Result

Bellow are the test results of different number of AG databases scenarios:

A.64 vCores test

a.1 Static Test Result

# of AG databases

Failover success rate

Manual Failover

Automatic Failover

200

100%

AG level: 0.8 seconds, DB level: 2s seconds

AG level:0.8 second,  DB level : 5 seconds

400

100%

AG level: 1 seconds, DB level: 35 seconds

AG level: 1 second, DB level : 34 seconds

600

100%

AG level: 1.2 seconds, DB level: 58 seconds

AG level: 1.6 second,  DB level : 58 seconds

 

800

100%

AG level: 1.3 seconds, DB level: 256 seconds

AG level: 1 second,  DB level : 255 seconds

 

 

a.2 Dynamic Test Result

# of AG databases

Percent of AG databases in traffic

Failover Time

Metrics on Primary

Metrics on Secondary

Transactions per second

200

5%, 100 users’ traffic per dynamic database

AG level 1.3 second, DB level:  60 seconds

 

CPU usage: 17%

 

Avg disk queue length: 50

 

Receiving rate: 5Mb per second

Redo rate:  5Mb per second

 

Avg disk queue length: 21

 

252

 

200

10%, 100 users’ traffic per dynamic database

AG level: 0.8 second, DB level :180 seconds

CPU usage: 90%

Avg disk queue length: 98

 

Receiving rate: 8Mb per seconds

 

Redo rate: 8Mb per seconds

Avg disk queue length: 48

 

417.26

 

400

5%,100 users’ traffic per dynamic database

AG level: 2 seconds,  DB level : 228 seconds

CPU usage: 93%

Avg disk queue length: 105

 

Receiving rate: 8Mb per seconds

Redo rate: 8Mb per seconds

Avg disk queue length: 40

 

431.26

 

 

 

B.128 vCores test 

b.1 Static Test Result

# of AG databases

Failover success rate

Manual & Automatic Failover time

800

100%

AG level: 1.1 second

DB level:  184 seconds

 

1000

100%

AG level: 2.25 seconds

DB level: 567.5 seconds

 

b.2 Dynamic Test Result

 

# of AG databases

Percent of AG databases in traffic

Failover Time

Metrics on Primary

Metrics on Secondary

Transaction per second

800

5%, 50 users traffic per dynamic database

AG level: 2.9 seconds

DB level:  282 seconds

CPU usage: >95%

Avg disk queue length: 10

 

Receiving rate: 5.2 Mb per seconds

Redo rate:  5.2Mb per seconds

Avg disk queue length: 3

 

327.06

 

Test Summary

 

1.Based on static test result:

 

1.1 Failover time

When number of databases is in extremely high scale within a single AG(> 100), the more databases you have in an AG, the longer failover will take.

 

For customer if they have too many databases within an AG then they need to watch AG’s failover time since it might exceed their business requirements for RTO.

 

 

 

2.Based on dynamic test result:

 

2.1 Resource usage

The more databases you have in an AG, the more resources it takes to keep the AG healthy and process user transactions.

 

2.2 High traffic databases in AG

The more high traffic databases you have in an AG, the more resources it takes to keep the AG healthy and process user transactions, and also much longer failover time will be expected.

 

For Customer who is introducing AG technology to an existing set of databases,

Regarding 2.1, they need to increase the capacity of VM to accommodate both user traffic and system traffic generated by the AG synchronization.

 

Regarding 2.2, they need to be careful when there are many high traffic databases within single AG since it can lead to high resource usage and longer failover.

 

 

Test Customization

 

If you wish to perform your own experiment, here are some areas for consideration:

 

1)Test environment

You need to prepare test machines as test clients(where you generate traffic against test server) and test servers in which the AG cluster resides.
For the test client machine you need to make sure 1) it is compatible with test tool, 2) its capacity (CPU, memory and disk IO) can hold the traffic generation and 3) there is no high network latency between test client and test server.
For the test server machine you can choose the ones with the same capacity as the machines used in production environment thus can get an accurate estimation of test result as reference.
For the software specs you can refer test configuration in this blog.

 

2)Test type
You can customize the test plan based on what's your concern in business scenario. In this test our focus is to evaluate the AG performance and stability, in the scenario that large amount of AG databases
hold in single AG with small percent of them have traffic. Depending on what you need to evaluate, you can schedule your own specific test, for example you want to know if AG can sustain with online index build operation on
a big table resides in AG DB, when there is a high traffic against AG. Then your test can be : in test client, schedule traffic (for example OLTP) against AG DB in test server, and write a job to run online index build against
the same AG DB, then to monitor AG traffic changes, time costed in online index build, AG health state and system resource usage state.


3)Test metrics

From the resource usage perspective, you can monitor CPU usage, memory usage, disk io throughput and network throughput in all AG replicas.

 

From the performance perspective, you can monitor transaction rate, secondary replica log receiving rate and redo rate, secondary read traffic process rate, below is AG performance monitor practice reference:
https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitor-performance...

 

From the stability perspective, you can evaluate success rate and the time takes of automatic and manually failover.

 

Reference

Tutorial: Prerequisites for an availability group - SQL Server on Azure VM | Microsoft Docs

Tutorial: Configure a SQL Server Always On availability group - SQL Server on Azure VM | Microsoft D...

Performance guidelines for SQL Server in Azure - SQL Server on Azure VM | Microsoft Docs

TPC-E Homepage

https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitor-performance...

 

Co-Authors
Version history
Last update:
‎Feb 21 2021 01:14 PM
Updated by: