Business or Enterprise DFS-R fails

Copper Contributor

We worked with MS Professional Services to design an Windows Server 2016 file server replacement for our NAS. One of our concerns was if Windows Server could handle larger partitions in the 1-10 TB range. 

MS assured us that this would not be an issue and instructed us to create multiple server pairs using DFS Namespaces and DFS replication for redundancy.

Following their instructions and the documentation for DFS (which is quite short and simple), we created the (virtual) servers and partitions. The primary servers and the DFS Namespaces worked just fine (except for VSS, separate issue). However DFS Replication (DFSR) kept failing.

We contacted MS Premier support, who checked everything and said it was configured correctly and everything should be working. We created an additional test server with new partitions and it worked, until we populated it. Then DFS-R failed again, it would not complete replication. In some cases, it would not even start.

 

As long as the partition had under a couple thousand files, it was fine, but as soon as we populated it with the normal 100GB to 2TB of data (100,000 to 5,000,000 files), it completely and totally failed. MS support was unable to get it working despite 30-40 hours working with our tech. 

 

This also occurred with one of our sister entities who tried the same simple design, with much less data. They had it much worse, they started using the copied files expecting them to replicate. Since DFS was sending different users to different files in the server pairs, the files became badly out of sync. It is costing them hundreds of work hours to manually fix each paired file set. VERY painful and expensive.

 

At this point, it appears that DFS Replication is not Enterprise or business ready. It is not consistent or reliable, and fails very quietly, resulting in serious business impact. 

 

MS does not appear have any solutions to the issue. DFS-R simply cannot be depended on.

2 Replies

Additional information from MS Support.

Helpful for small DFS-R instances, but does not apply to large file sets:

 

Elaborate on ConflictAndDeleted and Pre-Existing:

https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2003/cc773238(v=ws.... - DFS Replication: Frequently Asked Questions (FAQ).

Excerpt:

What happens if the primary member suffers a database loss during initial replication?

During initial replication, the primary member's files will always take precedence in the conflict resolution that occurs if the receiving members have different versions of files on the primary member. The primary member designation is stored in Active Directory Domain Services, and the designation is cleared after the primary member is ready to replicate, but before all members of the replication group replicate.

What happens when two users simultaneously update the same file on different servers?

When DFS Replication detects a conflict, it uses the version of the file that was saved last. It moves the other file into the DfsrPrivate\ConflictandDeleted folder (under the local path of the replicated folder on the computer that resolved the conflict). It remains there until Conflict and Deleted folder cleanup, which occurs when the Conflict and Deleted folder exceeds the configured size or DFS Replication encounters an Out of disk space error. The Conflict and Deleted folder is not replicated, and this method of conflict resolution avoids the problem of morphed directories that was possible in FRS.

When a conflict occurs, DFS Replication logs an informational event to the DFS Replication event log. This event does not require user action for the following reasons:

  • It is not visible to users (it is visible only to server administrators).
  • DFS Replication treats the Conflict and Deleted folder as a cache. When a quota threshold is reached, it cleans out some of those files. There is no guarantee that conflicting files will be saved.
  • The conflict could reside on a server different from the origin of the conflict.

 

https://blogs.technet.microsoft.com/askds/2010/01/05/understanding-dfsr-conflict-algorithms-and-doin... - Understanding DFSR conflict algorithms (and doing something about conflicts).

 

Troubleshooting steps for events 4304 & 5002:

 

Troubleshooting DFSR Event 4304:

 

https://blogs.technet.microsoft.com/askds/2007/10/05/top-10-common-causes-of-slow-replication-with-d... - Top 10 Common Causes of Slow Replication with DFSR.

Excerpt:

Many applications can create a large number of spurious sharing violations, because they create temporary files that shouldn’t be replicated. If they have a predictable extension, you can prevent DFSR from trying to replicate them by setting and exception in DFSMGMT.MSC. The default file filter excludes file extensions ~*, *.bak, and *.tmp, so for example the Microsoft Office temporary files (~*) are excluded by default.

 

There two kinds of DFSR events for sharing violations: Event ID 4302 and Event ID 4304. The DFSR Diagnostics combines both kinds of events, and reports them only as "Event ID 4302.". The following information explains more about these two kinds of events:

Event ID 4302: A local sharing violation occurs when the service cannot receive an updated file because the local file is being used. This occurs on the "receive" side of the file change. The file is already replicated. However, it cannot be moved from the installing directory to the final destination.

Event ID 4304: The service cannot stage a file for replication because of a sharing violation. This occurs on the "send" side of the file change. DFSR wants to stage or copy the file for replication. However, an exclusive lock prevents this. (In our case)

 

https://blogs.technet.microsoft.com/filecab/2006/05/15/troubleshooting-erroneous-sharing-violations-... - Troubleshooting erroneous sharing violations in the DFS Replication health report.

https://blogs.technet.microsoft.com/askds/2009/02/20/understanding-the-lack-of-distributed-file-lock... - Understanding (the Lack of) Distributed File Locking in DFSR.

 

Troubleshooting DFSR Event 5002:

 

Event ID 5002 is a very common DFSR warning event that is logged when connection failures occur. There are different root causes of this event, and in each case the event must be evaluated individually. There are some common root causes and resolutions listed below which should be considered only after the error codes within the events are understood.

 

Common errors returned with Event ID 5002

This section lists the error portion of the "Additional Information" section for the Event ID 5014 with known solutions. The majority of the events fall into two main categories: Remote Procedure Call (RPC) failures and errors returned by the service itself.

Remote Procedure Call (RPC) errors

  • Error: 1723 (The RPC server is too busy to complete this operation)
  • Error: 1726 (The remote procedure call failed)
  • Error: 1727 (The remote procedure call failed and did not execute)
  • Error: 1753 (There are no more endpoints available from the endpoint mapper.)  (In our case)

 

Common Solutions for RPC errors:

  1. Make sure all DFSR servers are patched with the latest DFSR releases.

Note: When troubleshooting DFSR always confirm all servers are up to date with latest DFSR hotfixes -

List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2003 and in Windows Server 2003 R2: http://support.microsoft.com/kb/958802  

List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2008 and in Windows Server 2008 R2: http://support.microsoft.com/kb/968429

Updates to DFSR are released as needed, and all the servers using DFSR should be maintained as part of regular patching and maintenance schedules.

  1. Disable Task Offloading on all members of the Replication Group - http://technet.microsoft.com/en-us/library/cc959732.aspx
  2. Windows 2003 R2 (Reboot necessary) - http://support.microsoft.com/default.aspx?scid=kb;EN-US;904946
  3. Windows 2008 and 2008 R2 (No reboot necessary) –
  4. Run this command from an elevated command prompt:

netsh int ip set global taskoffload=disabled

  1. Disable and then re-enable the network interface card

iii. To confirm the command completed successfully, run:
netsh int ip show offload

  1. Check for Wan accelerators. Exclude DFSR traffic either by IP address or UUID (897e2e5f-93f3-4376-9c9c-fd2277495c27 Frs2 Service)
  2. Check firewall rules. Make sure DFSR traffic is not being blocked
  3. Check max MTU size on your network and adjust accordingly so that servers have a common max MTU size. (Link KB)
  4. Update Network Card drivers to latest versions.

 

Service Error codes

These errors are more commonly service related. They can be caused by AD replication latency and RPC issues as well

Error: 9026 (The connection is invalid)

Error: 9033 (The request was cancelled by a shutdown)

Error: 9027 (A failure was reported by the remote partner)

 

Describe the data difference:

 

There could be multiple reasons for the data inconsistency as mentioned above (ConflictAndDeleted, Pre-Existing, Sharing Violations, etc. and including below:

https://social.technet.microsoft.com/wiki/contents/articles/406.dfsr-does-not-replicate-temporary-fi... - DFSR Does Not Replicate Temporary Files.

https://blogs.technet.microsoft.com/askds/2007/09/04/wheres-my-file-root-cause-analysis-of-frs-and-d... - Where’s my file? Root cause analysis of FRS and DFSR data deletion.

 

Also, as we’re using DFSN (Namespace) in conjunction with DFSR (Replication), we need to consider below articles as well:

 

https://blogs.technet.microsoft.com/askds/2012/07/24/common-dfsn-configuration-mistakes-and-oversigh... - Common DFSN Configuration Mistakes and Oversights.

 

I used DFS in my last job and its not great for large quantity of files. The replication is slow and if you have any issues it can be a real headache to fix. I would suggest looking up Ned Pyles articles on DFS really helpful. You could possible look at cloning to speed up the initial sync of the volumes. https://blogs.technet.microsoft.com/filecab/2013/08/21/dfs-replication-initial-sync-in-windows-serve...

 

You may want to look into Storage Replica as an alternative replication method, again Ned Pyle has good articles on this.