Failover Clustering

20 MIN READ

Cluster Shared Volume - A Systematic Approach to Finding Bottlenecks

Microsoft

Mar 15, 2019

First published on MSDN on Jul 29, 2015

In this post we will discuss how to find if performance that you observe on a Cluster Shared Volume (CSV) is what you expect and how to find which layer in your solution may be the bottleneck. This blog assumes you have read the previous blogs in the CSV series (see the bottom of this blog for links to all the blogs in the series).

Cluster Shared Volume (CSV) Inside Out
https://techcommunity.microsoft.com/t5/failover-clustering/cluster-shared-volume-csv-inside-out/ba-p/371872

Cluster Shared Volume Diagnostics
https://techcommunity.microsoft.com/t5/failover-clustering/cluster-shared-volume-diagnostics/ba-p/371908

Cluster Shared Volume Performance Counters
https://techcommunity.microsoft.com/t5/failover-clustering/cluster-shared-volume-performance-counters/ba-p/371980

Cluster Shared Volume Failure Handling
https://techcommunity.microsoft.com/t5/failover-clustering/cluster-shared-volume-failure-handling/ba-p/371989

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
https://techcommunity.microsoft.com/t5/failover-clustering/troubleshooting-cluster-shared-volume-auto-pauses-8211-event/ba-p/371994

Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142
https://techcommunity.microsoft.com/t5/failover-clustering/troubleshooting-cluster-shared-volume-recovery-failure-8211/ba-p/371997

Sometimes someone asks a question in why CSV performance does not match their expectations and how to investigate. The answer is that CSV consists of multiple layers, and the most straight forward troubleshooting approach is through a process of elimination to first remove all the layers, test speed of the disk and then start adding layers one by one until you find the one causing the issue.

You might be tempted to use copy file as a quick way to test performance. While copy file is an important workload it is not the best way to test your storage performance. Review this blog which goes into more details why it does not work well.

Using file copy to measure storage performance – Why it’s not a good idea and what you should do instead

https://docs.microsoft.com/en-us/archive/blogs/josebda/using-file-copy-to-measure-storage-performance-why-its-not-a-good-idea-and-what-you-should-do-instead

It is important to understand copy file performance that you can expect from your storage so I would suggest to run copy file after you are done with micro benchmarks as a part of workload testing.

To test performance you can use DiskSpd that is described in this blog post.

DiskSpd, PowerShell and storage performance: measuring IOPs, throughput and latency for both local disks and SMB file shares

https://docs.microsoft.com/en-us/archive/blogs/josebda/diskspd-powershell-and-storage-performance-measuring-iops-throughput-and-latency-for-both-local-disks-and-smb-file-shares

When selecting file size you will run the tests on be aware of the caches and tiers on your storage. For instance a storage might have cache on NVRAM or NVME. All writes that go to fast tier might be very fast, but then once you used up all the space on the cache you will have to go with the speed of the next slower tier. If your intention is to test cache then create a file that fits into the cache, otherwise create file that is larger than the cache.

Some LUNs might have some offsets mapped to SSDs while others map to HDDs. An example would be tiered space. When creating a file be aware what tier the blocks of the files are located on.

Additionally, when measuring performance do not assume that if you’ve created two LUNs with the similar characteristics you will get identical performance. If the LUN’s are not laid out on the physical spindles in a different way it might be enough to cause completely different performance behavior. To avoid surprises as you are running tests through different layers (will be described below) ALWAYS use the same LUN. Several times we’ve seen cases when someone would run tests against one LUN, and then would run tests over CSVFS with another, with what was believed to be a similar LUN. Only to observe worse results in CSVFS case and would incorrectly come to a conclusion that CSVFS is the problem. When in the end, removing disk from CSV and running test directly on the LUN was showing that two LUNs have different performance.

Sample number you will see in this post were collected on a 2 Node Cluster,

CPU: Intel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, Intel64 Family 6 Model 45 Stepping 7, GenuineIntel,
2 NUMA nodes 8 Cores each with Hyperthreading disabled.
RAM: 32 GB DDR3.
Network: one RDMA Mellanox ConnectX-3 IPoIB Adapter 54GBiPS, and one Intel(R) I350 Gigabit network adapter.
The shared disk is a single HDD connected using SAS. Model HP EG0300FBLSE Firmware version HPD6. Disk cache is disabled.

With this hardware my expectation is that the disk should be the bottleneck, and going over the network should not have any impact on throughput.

In the samples you will see below I was running a single threaded test application, which at any time was keeping eight 8K outstanding IOs on the disk. In your tests you might want to add more variations with different queue depth and different IO sizes, and different number of threads/CPU cores utilized. To help, I have provided the table below which outlines some tests to run and data to capture to get a more exhaustive picture of your disk performance. Running all these variation may take several hours. If you know IO patterns of your workloads then you can significantly reduce the test matrix.

			Queue Depth
			1	4	16	32	64	128	256
Unbuffered Write-Trough	4K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	8K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	16K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	64K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	128K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	256K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	512K	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes
	1MB	sequential read
		sequential write
		random read
		random write
		random 70% reads 30 % writes

If you have Storage Spaces then it might be useful to first collect performance numbers of the individual disks this Space will be created with. This will help set expectations around what kind of performance you should expect in best/worst case scenario from the Space.

As you are testing individual spindles that will be used to build Storage Spaces pay attention to different MPIO (Multi Path IO) modes. For instance you might expect that round robin over multiple paths would be faster than fail over, but for some HDDs you might find that they give you better throughput with fail over than with round robin. When it comes to SAN MPIO considerations are different. In case of SAN, MPIO is between the computer and a controller in the SAN storage box. In case of Storage Spaces MPIO is between computer and the HDD, so it comes to how efficient is the HDD’s firmware handling IO from different paths. In production for a JBOD connected to multiple computers IO will be coming from different computers so in any case HDD firmware need to be able to efficiently handle IOs coming from multiple computers/paths. Like with any kind of performance testing you should not jump to a conclusion that a particular MPIO mode is good or bad, always test first.

Another commonly discussed topic is what should be the file system allocation unit size (A.K.A cluster size). There is a variety of options between 4K and 64K.

For starters, CSVFS has no requirements for the underlying file system cluster size. It is fully compatible with all cluster sizes. The primary influencer for the cluster size is driven by the workload. For Hyper-V and SQL Server data and log files it is recommended to use a 64K cluster size with NTFS. Since CSV is most commonly used to host VHD’s in one form or another, 64K is the recommended allocation unit size with NTFS and 4k with ReFS. Another influencer is your storage array, so it is good to have a discussion with your storage vendor for any optimizations unique to your storage device they recommend. There are also a few other considerations, let’s discuss:

File system fragmentation . If for the moment, we forget about the storage underneath the file system aside and look only at the file system layer by itself then

File system block alignment and storage block alignment . When you create a LUN on a SAN or Storage Space it may be created out of multiple disks with different performance characteristics. For instance a mirrored spaces (http://blogs.msdn.com/b/b8/archive/2012/01/05/virtualizing-storage-for-scale-resiliency-and-efficiency.aspx) would contain slabs on many disks, some slabs will be acting as mirrors, and then the entire space address range will be subdivided in 64K blocks and round robin across these slabs on different disks in RAID0 fashion to give you better aggregated throughput of multiple spindles.

This means that if you have 128K IO it will have to be split to 2 64K IOs that will go to different spindles. What if your File system is formatted with cluster size smaller than 64K? That means continues block in file system might not be 64K aligned. For example, if the file system is formatted with 4K clusters, and we have a file that is 128K, then my file can start at 4K alignment. If my application performs a 128K read, then it is possible this 128K block will map to up to 3 64 blocks on the storage spaces.

If your format your file system with 64K cluster size, then file allocations are always 64K aligned and on average you will see less IOPS on the spindles. Performance difference will be even larger when it comes to writes to Parity, RAID5 or RAID6 like LUNs. When you are overwriting part of the block storage have to do read-modify-write multiplying number of IOPS that is hitting your spindles. If you overwriting the entire block then it will be exactly one IO. If you want to be accurate then you need to evaluate what is the average block size you expect your workload to produce. If it is larger than 4K then you want FS cluster size to be at least as large your average IO size so on average it would not get split at the storage layer. A rule of thumb might be to simply use the same cluster size as block size used by the storage layer. Always consult your storage vendor for advice, modern storage arrays have very sophisticated tiering and load balancing logic and unless you understand everything about how your storage box works you might end up with unexpected results. Alternatively you can run variety of performance tests with different cluster sizes and see which one gives you better results. If you do not have time to do that then I recommend 64k block size.

Performance of HDD/SSD might change after updating disk or storage box firmware so it might save you time if you rerun performance tests after update.

As you are running the tests you can use performance counters described here

Cluster Shared Volume Performance Counters
https://techcommunity.microsoft.com/t5/failover-clustering/cluster-shared-volume-performance-counters/ba-p/371980

to get further insights into behavior of each layer by monitoring average queue depth, latency, throughput and IOPS at CSV, SMB and Physical Disk layers. For instance, if your disk is bottleneck then latency, and queue depth at all of these layers will be the same. Once you see queue depth and latency at the higher level is above what you see on the disk that means this layer might be the bottleneck.

Run performance tests only on the hardware that is currently not used by any other workloads/tests otherwise your results may not be valid because of too much variability. You also might want to rerun each variation several times to make sure there is no variability.

Baseline 1 – No CSV; Measure Performance of NTFS

In this case IO has to traverse the NTFS file system and disk stack in the OS, so conceptually we can represent it this way:

For most disks, expectations are that sequential read >= sequential write >= random read >= random write. For an SSD you may observe no difference between random and sequential while for HDD the difference may be significant. Differences between read and write will vary from disk to disk.

As you are running this test keep an eye out if you are saturating CPU. This might happen when your disk is very fast. For instance if you are using Simple Space backed by 40 SSDs.

Run baseline tests multiple times. If you see variance at this level then most likely it is coming from the disk and it will be affecting other tests as well. Below you can see the number I’ve collected on my hardware, the results match expectations.

				Queue Depth
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19906
		sequential read	MB/Sec	155
		sequential write	IOPS	17311
		sequential write	MB/Sec	135
		random read	IOPS	359
		random read	MB/Sec	2
		random write	IOPS	273
		random write	MB/Sec	2

Baseline 2 - No CSV; Measure SMB Performance between Cluster Nodes

To run this test online clustered disk on one cluster node.
Assign it a drive letter - for example K:. Run test from another node over SMB using an admin share. For instance your path might look like this \\Node1\K$ . In this case IO have to go over following layers

You need to be aware of SMB multichannel and make sure that you are using only the NICs that you expect cluster to use for intra-node traffic. You can read more about SMB multichannel in clustered environment in this blog post

http://blogs.msdn.com/b/emberger/archive/2014/09/15/force-network-traffic-through-a-specific-nic-with-smb-multichannel.aspx

If you have RDMA network or when your disk is slower than what SMB can pump through all channels, and you have sufficiently large queue depth then you might see Baseline 2 close or even equal to Baseline 1. That means your bottleneck is disk, and not network.

Run the baseline test several times. If you see variance at this level then most likely it is coming from the disk or network and it will be affecting other tests as well. Assuming you’ve already sorted out variance that is coming from the disk while you were collecting Baseline 1, now you should focus on variance that is causing by network.

Here are the numbers I’ve collected on my hardware. To make it easier for you to compare I am repeating Baseline 1 numbers here.

				Queue Depth	Baseline 1
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19821	19906
		sequential read	MB/Sec	154	155
		sequential write	IOPS	810	17311
		sequential write	MB/Sec	6	135
		random read	IOPS	353	359
		random read	MB/Sec	2	2
		random write	IOPS	272	273
		random write	MB/Sec	2	2

In my case I have verified that IO is going over RDMA and network indeed almost does not add latency, but there is a difference in IOPS between sequential write with Baseline 1 which seems odd. First I’ve looked at performance counters:

Physical disk performance counters for Baseline 1

Physical disk and SMB Server Share performance counters for Baseline 2

SMB Client Share and SMB Direct Connection performance counters for Baseline 2

Observe that in both cases PhysicalDisk\Avg.Disk Queue Length is the same. That tells us SMB does not queue IO, and disk has all the pending IOs all the time. Second observe that PhysicalDisk\Avg.Disk sec/Transfer in Baseline 1 is 0 while in Baseline 2 is 10 milliseconds. Huh!

This tells me that the disk got slower because requests came over SMB!?

Next step was to record a trace using Windows Performance Toolkit (http://msdn.microsoft.com/en-us/library/windows/hardware/hh162962.aspx) with Disk IO for both Baseline 1 and Baseline 2. Looking at the traces I’ve noticed the Disk Service time for some reason got longer for Baseline 2! Then I also noticed that when requests were coming from SMB they hit disk from 2 threads while using my test utility all requests were issued from single thread. Remember that we are investigating sequential write. Even though when running over SMB test is issuing all writes from one thread in sequential order, SMB on the server was dispatching these writes to the disk using 2 threads and sometimes writes would get reordered. Consequently IOPS I am getting for sequential write are close to random write. To verify that I reran test for Baseline 1 with 2 threads, and bingo! I’ve got matching numbers.

Here is what you would see in WPA for IO over SMB.

Average disk service time is about 8.1 milliseconds, and IO time is about 9.6 milliseconds. The green and violate colors match to IO issued by different threads. If you look close, expand table, remove thread Id from grouping and sort by Init Time you can see how IO are interleaving and Min Offset is not strictly sequential:

While without SMB all IOs came on one thread, disk service time is about 600 microseconds, and IO time is about 4 milliseconds

If you expand and sort by Init Time you will see Min Offset is strictly increasing

In production in most of the cases you will have workload that is close to random IO, and sequential IO is only giving you a theoretical best case scenario.

Next interesting question is why we do not see similar degradation for sequential read. The theory is that in case of read disk might be reading the entire track and keeping it in the cache so even when reads are rearranged the track is already in the cache and reads on average stay not affected. Since I disabled disk cache for writes, they always have to hit spindle and more often would pay seek cost.

Baseline 3 - No CSV; Measure SMB Performance between Compute Nodes and Cluster Nodes

If you are planning to run workload and storage on the same set of nodes then you can skip this step. If you are planning to disaggregate workload and storage and access storage using a Scale Out File Server (SoFS) then you should run the same test as Baseline 2, just in this case select a compute node as a client, and make sure that over network you are using the NICs that will be used to handle compute to storage traffic once you create the cluster.

Remember that for reliability reasons files over SOFS are always opened with write-through so we would suggest to always add write-through to your tests. As an option you can create a classing singleton (non SOFS) file server over a clustered disk, create a Continuously Available share on that file server and run your test there. It will make sure traffic will go only over networks marked in the cluster as public, and because this is a CA share all opens will be write-through.

Layers diagram and performance considerations in this case is exactly the same as in case of Baseline 2.

CSVFS Case 1 - CSV Direct IO

Now add disk to CSVFS.

You can run same test on coordinating node and non-coordinating node and you should see the same results. Numbers should match to the Baseline 1. The length of the code path is the same, just instead of NTFS you will have CSVFS. Following diagram represents the layers IO will be going through

Here are the number I’ve collected on my hardware, to make it easier for you to compare I am repeating Baseline 1 numbers here.

On coordinating node:

				Queue Depth	Baseline 1
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19808	19906
		sequential read	MB/Sec	154	155
		sequential write	IOPS	17590	17311
		sequential write	MB/Sec	137	135
		random read	IOPS	356	359
		random read	MB/Sec	2	2
		random write	IOPS	273	273
		random write	MB/Sec	2	2

On non-coordinating node

				Queue Depth	Baseline 1
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19793	19906
		sequential read	MB/Sec	154	155
		sequential write	IOPS	177880	17311
		sequential write	MB/Sec	138	135
		random read	IOPS	359	359
		random read	MB/Sec	2	2
		random write	IOPS	273	273
		random write	MB/Sec	2	2

CSVFS Case 2 - CSV File System Redirected IO on Coordinating Node

In this case we are not traversing network, but we do traverse 2 file systems. If you are disk bound you should see numbers matching Baseline 1. If you have very fast storage and you are CPU bound then you will saturate CPU a bit faster and will be about 5-10% below Baseline 1.

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

				Queue Depth	Baseline 1	Baseline 2
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19807	19906	19821
		sequential read	MB/Sec	154	155	154
		sequential write	IOPS	5670	17311	810
		sequential write	MB/Sec	44	135	6
		random read	IOPS	354	359	353
		random read	MB/Sec	2	2	2
		random write	IOPS	271	273	272
		random write	MB/Sec	2	2	2

Looks like some IO reordering is happening in this case too so you can see sequential write numbers are somewhere between Baseline 1 and Baseline 2. All other number perfectly lines up with expectations.

CSVFS Case 3 - CSV File System Redirected IO on Non-Coordinating Node

You can put CSV in file system redirected mode using cluster UI

Or using PowerShell cmdlet Suspend-ClusterResource with parameter –RedirectedAccess.

This is the longest IO path where we are not only traversing 2 file systems, but also going over SMB and network. If you are network bound then you should see your numbers are close to Baseline 2. If your network is very fast and your bottleneck is storage then numbers will be close to Baseline 1. If storage is also very fast and you are CPU bound then numbers should be 10-15% below Baseline 1.

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

				Queue Depth	Baseline 1	Baseline 2
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19793	19906	19821
		sequential read	MB/Sec	154	155	154
		sequential write	IOPS	835	17311	810
		sequential write	MB/Sec	6	135	6
		random read	IOPS	352	359	353
		random read	MB/Sec	2	2	2
		random write	IOPS	273	273	272
		random write	MB/Sec	2	2	2

In my case numbers are matching Baseline 2, and in all cases, except sequential write are close to Baseline 1.

CSVFS Case 4 - CSV Block Redirected IO on Non-Coordinating Node

If you have SAN then you can play with LUN masking to hide this LUN from the node where you will run this test. If you are using Storage Spaces then Mirrored Space is always attached only on the Coordinator node and any non-coordinator node will be in block redirected mode as long as you do not have tiering heatmap enabled on this volume. See this blog post for more details on how Storage Spaces tiering affects CSV IO mode.

Cluster Shared Volume Diagnostics

https://techcommunity.microsoft.com/t5/failover-clustering/cluster-shared-volume-diagnostics/ba-p/371908 http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Please note that CSV never uses Block Redirected IO on Coordinator node. Since on the coordinator node disk is always attached CSV will always use Direct IO. So remember to run this test on non-coordinating node. If you are network bound then you should see your numbers are close to Baseline 2. If your network is very fast and your bottleneck is storage then numbers will be close to Baseline 1. If storage is also very fast and you are CPU bound then numbers should be about 10-15% below Baseline 1.

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

				Queue Depth	Baseline 1	Baseline 2
				8
Unbuffered Write-Trough	8K	sequential read	IOPS	19773	19906	19821
		sequential read	MB/Sec	154	155	154
		sequential write	IOPS	820	17311	810
		sequential write	MB/Sec	6	135	6
		random read	IOPS	352	359	353
		random read	MB/Sec	2	2	2
		random write	IOPS	274	273	272
		random write	MB/Sec	2	2	2

In my case numbers match to the Baseline 2 and are very close to Baseline 1.

Scale-out File Server (SoFS)

To test Scale-out File Server you need to create the SOFS resource using Failover Cluster Manager or PowerShell, and add a share that maps to the same CSV volume that you have been using for the tests so far. Now your baselines will be CSVFS cases. In case of SOFS SMB will deliver IO to CSVFS on coordinating or non-coordinating node (depending where the client is connected; you use PowerShell Get-SMBWitnessClient to learn client connectivity), and then it will be up to CSVFS to deliver IO to the disk. The path that CSVFS will take is predictable, but depends on nature of your storage and current connectivity. You will need to select baseline between CSV Case 1 – 4.

If you see numbers are similar to CSV baseline then you know that SMB above CSV is not adding overhead and you can look at numbers collected for the CSV baseline to detect where the bottleneck is. If you see numbers are lower comparing to CSV baseline then your client network is the bottleneck, and you should validate that it matches difference between Baseline 3 and Baseline 1.

Summary

In this blog post we looked at how to tell if CSVFS performance for reads and writes is at expected levels. You can achieve that by running performance tests before and after adding disk to CSV. You will use ‘before’ numbers as your baseline. Then add disk to CSV and test different IO dispatch modes. Compare observed numbers to the baselines to learn what layer is your bottleneck.

Thanks!
Vladimir Petter
Principal Software Engineer
High-Availability & Storage
Microsoft