Written by Mine Tokus, Senior Program Manager; Jamie Reding, Senior Program Manager and Sadashivan Krishnamurthy, Principal Architect
Reviewed by Bob Ward, Principal Architect
Most of you are running OLTP applications today with read heavy and complex transactions. The TPC-E benchmark is a reliable representation of modern OLTP workloads and highly recommended to measure OLTP workload performance. The TPC-E benchmark uses diverse transactions and utilizes wide range of data types with 12 transactions acting on 33 tables all exercising the ACID compliance tests. These transactions cover traditional OLTP processing along with some very light analytic queries. In this article, we will share recommendations to optimize SQL Server performance on Azure VMs based on performance testing with scaled down TPC-E  benchmark.
Azure Es_v3 series offer price/perf advantages with the optimized memory to v-core ratios and commonly selected by SQL Server customers. We used E64s_v3 VM size as the server for this tests which comes with 64 hyper threaded v-cores on 2.3 GHz Intel XEON ® E5-2673 v4 (Broadwell) processor family, 432 GB memory and 1600 GB of Azure Blob Cache capacity. On the client side, we placed the client in the same region and availability zone as the server. We used DS13-v2’s for all clients and scaled down TPC-E workload to 800,000 customers with data size at 800GB.
Azure Blob Cache is the key for the best price/perf
In the first test, workload drove 18,403 IO per second and reached up to 490 TPS (Transaction Per Second) with 256 concurrent users (Table-1). That is a great price/performance optimization level as the paid IO capacity is used only for writes (9,668 write IOPS served by 2-P30 disks) and all reads (8,735 total read IOPS) are served from the blob cache. This is a huge benefit for ready heavy workloads as all the read IOPS are served directly from the host cache with very low response times and did not occur any additional charges.
Table 1- E64s_v3, 2-P30 for data and 1 P20 for Log
Scale up application throughput by simply by adding more storage capacity
The first test was bottlenecked on the writes as 2- P30 disks are throttled at 10,000 IOPS. We repeated this test with 4 P30 disks stripped on a single drive (all disks have ReadOnly cache enabled) to host SQL Server data file and 1 P20 disk for SQL Server log file. The same workload scaled up to 823 TPS with 67% increase on the application throughput. Running 823 TPS (14,672 batch requests/sec) required total 37K IO per second (Table -2). Again, 4 P30 disks are used only for the writes (18,978 writes IOPS) and all reads are served from the free blob cache (17,809 read IOPS).
To scale up the workload and increase the throughput start with adding additional storage capacity until CPU becomes the bottleneck. With 823 TPS on E64s_v3, the CPU was only consumed at 45%, there is still significant bandwidth on this VM to increase throughput by simply increasing storage capacity.
Table 2- E64s_v3, 4-P30 for data and 1 P20 for Log
Maximize throughput with lower storage response times
Premium Disks has great price/performance advantages with ReadOnly caching and monthly low-end storage cost. If your workload requires very low storage response times, then you can use ultra-disk to get microsecond level IO latencies at very high IOPS. Azure ultra-disks deliver high throughput, high IOPS, and consistent low latency disk storage for Azure VMs.
With the same test repeated with SQL server data and log files on Ultra Disk, we got up to 1,489 TPS and utilized all VM resources at peak (the CPU is consumed at 92%). Single ultra-disk provided 60,000 total IOPS (31,431 read IOPS and 29,250 write IOPS) and total 964 MB/sec throughput with microsecond level response times (Table-3).
Table 3- E64s_v3, Ultra Disk for Data and Log
Premium SSD delivers great performance for write heavy workloads
If you are running a write heavy workload with simple transactions, then you can validate performance with the TPC-C benchmark. Hammer DB TPC-C  is a tool that can be used to simplify TPC-C benchmarking. We run Hammer DB TPC-C with 1000 warehouses at 100GB total size. Test server is configured with SQL Server 2017 Enterprise edition on E64_v3; data files located on a storage pool stripped over 2-P30 disks (ReadOnly cache enabled), log file is located on a single P20 disk and Temp DB files are hosted on Local disk. With this test, workload reached to 936,896 TPM (Transaction Per Minute) as total IOPS are throttled at 10,000 with 2-P30 disks (Table - 4).
Table 4 – E64_v3, 2 P30 for Data, 1 P20 for Log
Note that, on this run CPU utilization was only at 12%. Can we double the throughput by simply adding more storage capacity? The answer is yes, simply by adding 2 more P30 drives, workload reached to 1.6 million TPM (Table - 5).
Table 5- E64s_v3, 4 P30 for Data and 1 P20 for log.
Write heavy workloads modify data in memory at a frequent rate. Checkpoint process writes the in-memory modified pages (known as dirty pages) and transaction log information from memory to disk. For write heavy workloads, SQL Server performs intensive IO activity to achieve the 1-minute default recovery time.
To show the impact of the checkpoint process on throughput, we repeated the same test with recovery interval set to the maximum value (checkpoint processes is disabled). When recovery interval is set to maximum, the HammerDB TPC-C workload doubled the throughput and reached up to 3,2 million TPM (Table - 6).
Table 6 – E64s_v3, Data on 4-P30, Recovery Time is set to max.
This setting was experimental only and cannot be used on production environments. But recovery time interval can be relaxed to increase the throughout for write heavy workloads modifying millions of pages per minute. If aiming for recovery in ~5 minutes or more is acceptable for the workload, then relaxing the recovery time interval setting will enable higher throughput at the same cost.
 Scaled down TPC-E model reduces the initial footprint of the database. It is primarily used for ease of benchmarking; we reduced the amount of history rows and then instruct the driver to not search too far back. For our tests, we reduced the number of trade days in the history table from 300 to 30. Please note that results from the scaled down model are non-comparable to the published results.
 HammerDB provides an implementation of the TPC-C benchmark. HammerDB TPC-C is derived from TPC-C and as such is not comparable to published TPC-C results, and the HammerDB TPC-C results do not fully comply with TPC-C.