Blog Post

Storage at Microsoft
4 MIN READ

The mystery of the slow file copy from the fast computer

NedPyle's avatar
NedPyle
Former Employee
Nov 05, 2021

Heya folks, Ned here again. A customer contacted us about a strange behavior they were seeing when copying large files to a Windows Server 2019 cluster using SMB 3.1.1. Around every 5GB transferred, the copy would temporarily pause for a few seconds, then start sending again until the next 5GB, and then pause again, and so on until it was done. There was no error, it was just mysteriously pausing and resuming, again and again. As usual, there were lots of suspects in a complex system - the network, the drives, third party drivers, Storage Spaces Direct, clustering, SMB, the file system, antivirus, filter drivers, etc. There was no improvement using robocopy versus File Explorer and the disks weren't showing any errors. Extra weirdly, if they copied to a similarly configured Windows Server 2012 R2 destination, they didn't see the behavior! And if it was a bug, why weren't lots of customers reporting this issue? This large file copy scenario probably happens countless millions of times a day!

 

 

Well, the kernel team explained it and I'm here to share what I learned. 

 

Years ago they had a case where a customer saw timeouts when copying large files from a fast IO source to slow IO destination on Windows Server 2012 R2 (hmmmmm) over SMB. Back then, if the dirty page threshold in the cache was reached, the subsequent writes used write-through behavior (i.e. where the disk is told it can't cache data, it must commit it). This would cause a very large flush of data to drives, and if the storage was slow, it could lead to long delays and even a timeout for the client. Which the customer was seeing with their particular hardware configuration.

 

So in Window Server 2016, the kernel team added a new mitigation: a separate dirty page threshold for remote writes and a new inline flush forced when the threshold was crossed. If there was heavy write IO activity this had a new effect of occasional short slowdowns but meant that actual timeouts - and therefore total copy failure - were now extremely unlikely. This remote dirty page threshold was 5 GB per file by default (oho, sound familiar?).

 

They added a registry setting and DWORD value to control all this on the destination server, so I'll quote from their content as it's math and details time:

 

This threshold can be controlled with the following DWORD registry value name:

 

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\
RemoteFileDirtyPageThreshold

 

If the default size of 5 GB does not work well for your configuration, it is recommended to try increasing the limit in 256-MB increments until performance is satisfactory. Please note the following:

 

  • A reboot is required for changes to this registry value to take effect.
  • The units of RemoteFileDirtyPageThreshold are number of pages (with page size as managed by Cache Manager). This means it should be set to the desired size in bytes, divided by 4096.
  • Recommended values are 128MB <= N <= 50% of available memory.
  • This threshold can be disabled completely by setting it to -1. This is not recommended as it can result in timeouts for remote connections.

For example, if you want to set the limit to 10GiB that's 10,737,418,240 bytes / 4096 = 2,621,440 which is decimal DWORD: 2621440

 

None of this has anything specifically to do with SMB, it's just the most likely thing to be (1) remote copying and (2) fast enough to keep up with the client machine's high speed drives' throughput to server, getting into this situation of being "too good" at copying for the destination's taste. When I say "client" and "server" here I mean architecture, not operating systems; for instance, a Windows Server 2022 copying data to another Windows Server 2022 is still an SMB client and SMB server.

 

As you'd guess, when they looked at the customer machines, the client copying the data had extremely fast drives with excellent READ throughout and low latency, but the destination server had slower drives and WRITE performance. As soon as they changed the default value their pauses went away and they saved... well, just a few extra seconds, naturally :D. Unless your files are truly massive and your pauses are excruciating, it's probably not worth changing the setting to gain a few seconds and risk returning to the original timeout behavior. Most customers aren't reporting this because SMB client machine IO isn't faster than their SMB server machine, especially when their congested Wi-Fi, 1Gbs, or 10Gbs networks get figured in. So don't feel like you need to run out there and touch every server you own, ya wackadoos.

 

I didn't want people trying to troubleshoot a working SMB, so we decided to update the Performance Tuning Guidelines for Windows Server 2022 so that the Troubleshoot Cache and Memory Manager Performance Issues section explained the issue and the Performance tuning for SMB file servers section gave good tuning advice for IT pros. And now they do. 

 

Oh, and let me take the opportunity here to remind you that if you're into copying very large files, you should try SMB compression. We added that in Windows Server 2022 and Windows 11 and it is extremely dope. I have another blog post coming soon with more examples of said dope'itude, but you can get a feel for it with this demo:

 

 

Until next time,

 

Ned "dirty cash" Pyle

 

Updated Nov 07, 2022
Version 3.0

28 Comments

  • DJ8014's avatar
    DJ8014
    Copper Contributor

    Tagging NedPyle (as I probably should have done the first time around)...

     

    I haven't been able to get this registry edit to take effect (still seems capped at 5GB), but I notice one difference between your quoted language and their current site:

     

    Your quote:

    This threshold can be controlled with the following DWORD registry value name:

    Current https://learn.microsoft.com/en-us/windows-server/administration/performance-tuning/subsystem/cache-memory-management/troubleshoot#remote-file-dirty-page-threshold-is-consistently-exceeded:

    This threshold can be controlled with the following regkey

     

    Does this mean it should be a key value (i.e. registry folder value) rather than a DWORD value? That doesn't seem to work for me either, though. It still drops to 0 once the modified memory hits 5 GB. This screenshot was taken as it dropped from 5 GB (it's at 4 GB). The transfer resumes when it gets back down to under 1 GB.

     

     

     

     

     

     

  • NedPyle's avatar
    NedPyle
    Former Employee

    DBR14 I wish I could help more but it's a question for Azure networking support at this point. The intermittency is especially telling of it being a networking issue. SMB is just snitching on a lower stack problem

  • DBR14's avatar
    DBR14
    Iron Contributor

    NedPyle that's where I get stuck. Because both servers are in Azure, same VNet the only difference is the AVD environment sits behind an Azure Firewall where as the File Server doesn't (different subnets). The error only pops up intermittently, sometimes we see it twice a day, sometimes not at all.

     

  • NedPyle's avatar
    NedPyle
    Former Employee

    DBR14 From the event, the client is losing connectivity to the server and when it reconnects to some files, it's failing the durable handle reconnect because too much time has passed and the server is no longer preserving the handle (120 seconds on WS2022 - "Reason: Reconnect durable file").

     

    I would suggest checking network connectivity. SMBClient/Connectivity channel on the AVD machines might have some more errors, but the problem will probably be networking itself, with SMB a victim. 

  • DBR14's avatar
    DBR14
    Iron Contributor

    NedPyle What are your thoughts on this scenario? I'm going to pump up our file server with this tonight moving from the default to 5.5Gb

    Reason being, we have a recently migrated file server that is a Windows Server 2022 Azure Edition which is serving up files to our AVD environment consisting of 20ish + hosts with 10+ users on each on newest feature/quality updates of Win10 Multisession which are just hammering this thing. As of mid-may we're getting weird intermittent blocks of time where people get an error within a software that the vendor has basically stated is a "lock" error. It's not a backup, EDR/Malware Protection, Windows Defender, or VSS as this has all been tuned not to do any sort of interference on this data drive which is separate from the OS. However it is large, a 10TB SSD (not Azure Premium SSD) -- I originally was going to blame the 2022 Server OS but I found the same errors on its predecessor server which was a 2012 R2 ~ Mid May 2023. This "setup" has been non problematic for 2+ years, and for it to start in May is odd because that was after our busiest workload time of year by about a month.

     

    The errors we're seeing are:

    SMBServer - Operational - Error 1016 ->Occuring on the Windows 2022 Server

    Client Name: \\AVDHOSTIP
    Client Address: AVDHOSTIP:RANDOM_PORT
    User Name: DOMAIN\username
    Session ID: 0x580024000A09
    Share Name: FileShareName
    File Name: FileNameonFileServer
    Resume Key: {#####-####-####-####-#########}
    Status: Object Name not found. (0xC0000034)
    RKF Status: STATUS_SUCCESS (0x0)
    Durable: false
    Resilient: false
    Persistent: false
    Reason: Reconnect durable file

     

    Also considering adding some fine tuning to the AVD Hosts, the FileNotFoundCacheEntriesMax interests me.

    Client tuning example

    The general tuning parameters for client computers can optimize a computer for accessing remote file shares, particularly over some high-latency networks (such as branch offices, cross-datacenter communication, home offices, and mobile broadband). The settings are not optimal or appropriate on all computers. You should evaluate the impact of individual settings before applying them.

    Parameter Value Default
    DisableBandwidthThrottling10
    FileInfoCacheEntriesMax3276864
    DirectoryCacheEntriesMax409616
    FileNotFoundCacheEntriesMax32768128
    MaxCmds3276815