Heya folks, Ned here again. A customer contacted us about a strange behavior they were seeing when copying large files to a Windows Server 2019 cluster using SMB 3.1.1. Around every 5GB transferred, the copy would temporarily pause for a few seconds, then start sending again until the next 5GB, and then pause again, and so on until it was done. There was no error, it was just mysteriously pausing and resuming, again and again. As usual, there were lots of suspects in a complex system - the network, the drives, third party drivers, Storage Spaces Direct, clustering, SMB, the file system, antivirus, filter drivers, etc. There was no improvement using robocopy versus File Explorer and the disks weren't showing any errors. Extra weirdly, if they copied to a similarly configured Windows Server 2012 R2 destination, they didn't see the behavior! And if it was a bug, why weren't lots of customers reporting this issue? This large file copy scenario probably happens countless millions of times a day!
Well, the kernel team explained it and I'm here to share what I learned.
Years ago they had a case where a customer saw timeouts when copying large files from a fast IO source to slow IO destination on Windows Server 2012 R2 (hmmmmm) over SMB. Back then, if the dirty page threshold in the cache was reached, the subsequent writes used write-through behavior (i.e. where the disk is told it can't cache data, it must commit it). This would cause a very large flush of data to drives, and if the storage was slow, it could lead to long delays and even a timeout for the client. Which the customer was seeing with their particular hardware configuration.
So in Window Server 2016, the kernel team added a new mitigation: a separate dirty page threshold for remote writes and a new inline flush forced when the threshold was crossed. If there was heavy write IO activity this had a new effect of occasional short slowdowns but meant that actual timeouts - and therefore total copy failure - were now extremely unlikely. This remote dirty page threshold was 5 GB per file by default (oho, sound familiar?).
They added a registry setting and DWORD value to control all this on the destination server, so I'll quote from their content as it's math and details time:
This threshold can be controlled with the following DWORD registry value name:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\
RemoteFileDirtyPageThreshold
If the default size of 5 GB does not work well for your configuration, it is recommended to try increasing the limit in 256-MB increments until performance is satisfactory. Please note the following:
- A reboot is required for changes to this registry value to take effect.
- The units of RemoteFileDirtyPageThreshold are number of pages (with page size as managed by Cache Manager). This means it should be set to the desired size in bytes, divided by 4096.
- Recommended values are 128MB <= N <= 50% of available memory.
- This threshold can be disabled completely by setting it to -1. This is not recommended as it can result in timeouts for remote connections.
For example, if you want to set the limit to 10GiB that's 10,737,418,240 bytes / 4096 = 2,621,440 which is decimal DWORD: 2621440
None of this has anything specifically to do with SMB, it's just the most likely thing to be (1) remote copying and (2) fast enough to keep up with the client machine's high speed drives' throughput to server, getting into this situation of being "too good" at copying for the destination's taste. When I say "client" and "server" here I mean architecture, not operating systems; for instance, a Windows Server 2022 copying data to another Windows Server 2022 is still an SMB client and SMB server.
As you'd guess, when they looked at the customer machines, the client copying the data had extremely fast drives with excellent READ throughout and low latency, but the destination server had slower drives and WRITE performance. As soon as they changed the default value their pauses went away and they saved... well, just a few extra seconds, naturally :D. Unless your files are truly massive and your pauses are excruciating, it's probably not worth changing the setting to gain a few seconds and risk returning to the original timeout behavior. Most customers aren't reporting this because SMB client machine IO isn't faster than their SMB server machine, especially when their congested Wi-Fi, 1Gbs, or 10Gbs networks get figured in. So don't feel like you need to run out there and touch every server you own, ya wackadoos.
I didn't want people trying to troubleshoot a working SMB, so we decided to update the Performance Tuning Guidelines for Windows Server 2022 so that the Troubleshoot Cache and Memory Manager Performance Issues section explained the issue and the Performance tuning for SMB file servers section gave good tuning advice for IT pros. And now they do.
Oh, and let me take the opportunity here to remind you that if you're into copying very large files, you should try SMB compression. We added that in Windows Server 2022 and Windows 11 and it is extremely dope. I have another blog post coming soon with more examples of said dope'itude, but you can get a feel for it with this demo:
Until next time,
Ned "dirty cash" Pyle