Oct 29 2018 09:13 AM
We worked with MS Professional Services to design an Windows Server 2016 file server replacement for our NAS. One of our concerns was if Windows Server could handle larger partitions in the 1-10 TB range.
MS assured us that this would not be an issue and instructed us to create multiple server pairs using DFS Namespaces and DFS replication for redundancy.
Following their instructions and the documentation for DFS (which is quite short and simple), we created the (virtual) servers and partitions. The primary servers and the DFS Namespaces worked just fine (except for VSS, separate issue). However DFS Replication (DFSR) kept failing.
We contacted MS Premier support, who checked everything and said it was configured correctly and everything should be working. We created an additional test server with new partitions and it worked, until we populated it. Then DFS-R failed again, it would not complete replication. In some cases, it would not even start.
As long as the partition had under a couple thousand files, it was fine, but as soon as we populated it with the normal 100GB to 2TB of data (100,000 to 5,000,000 files), it completely and totally failed. MS support was unable to get it working despite 30-40 hours working with our tech.
This also occurred with one of our sister entities who tried the same simple design, with much less data. They had it much worse, they started using the copied files expecting them to replicate. Since DFS was sending different users to different files in the server pairs, the files became badly out of sync. It is costing them hundreds of work hours to manually fix each paired file set. VERY painful and expensive.
At this point, it appears that DFS Replication is not Enterprise or business ready. It is not consistent or reliable, and fails very quietly, resulting in serious business impact.
MS does not appear have any solutions to the issue. DFS-R simply cannot be depended on.
Oct 29 2018 09:16 AM
Additional information from MS Support.
Helpful for small DFS-R instances, but does not apply to large file sets:
Elaborate on ConflictAndDeleted and Pre-Existing:
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2003/cc773238(v=ws.... - DFS Replication: Frequently Asked Questions (FAQ).
Excerpt:
What happens if the primary member suffers a database loss during initial replication?
During initial replication, the primary member's files will always take precedence in the conflict resolution that occurs if the receiving members have different versions of files on the primary member. The primary member designation is stored in Active Directory Domain Services, and the designation is cleared after the primary member is ready to replicate, but before all members of the replication group replicate.
What happens when two users simultaneously update the same file on different servers?
When DFS Replication detects a conflict, it uses the version of the file that was saved last. It moves the other file into the DfsrPrivate\ConflictandDeleted folder (under the local path of the replicated folder on the computer that resolved the conflict). It remains there until Conflict and Deleted folder cleanup, which occurs when the Conflict and Deleted folder exceeds the configured size or DFS Replication encounters an Out of disk space error. The Conflict and Deleted folder is not replicated, and this method of conflict resolution avoids the problem of morphed directories that was possible in FRS.
When a conflict occurs, DFS Replication logs an informational event to the DFS Replication event log. This event does not require user action for the following reasons:
https://blogs.technet.microsoft.com/askds/2010/01/05/understanding-dfsr-conflict-algorithms-and-doin... - Understanding DFSR conflict algorithms (and doing something about conflicts).
Troubleshooting steps for events 4304 & 5002:
Troubleshooting DFSR Event 4304:
https://blogs.technet.microsoft.com/askds/2007/10/05/top-10-common-causes-of-slow-replication-with-d... - Top 10 Common Causes of Slow Replication with DFSR.
Excerpt:
Many applications can create a large number of spurious sharing violations, because they create temporary files that shouldn’t be replicated. If they have a predictable extension, you can prevent DFSR from trying to replicate them by setting and exception in DFSMGMT.MSC. The default file filter excludes file extensions ~*, *.bak, and *.tmp, so for example the Microsoft Office temporary files (~*) are excluded by default.
There two kinds of DFSR events for sharing violations: Event ID 4302 and Event ID 4304. The DFSR Diagnostics combines both kinds of events, and reports them only as "Event ID 4302.". The following information explains more about these two kinds of events:
Event ID 4302: A local sharing violation occurs when the service cannot receive an updated file because the local file is being used. This occurs on the "receive" side of the file change. The file is already replicated. However, it cannot be moved from the installing directory to the final destination.
Event ID 4304: The service cannot stage a file for replication because of a sharing violation. This occurs on the "send" side of the file change. DFSR wants to stage or copy the file for replication. However, an exclusive lock prevents this. (In our case)
https://blogs.technet.microsoft.com/filecab/2006/05/15/troubleshooting-erroneous-sharing-violations-... - Troubleshooting erroneous sharing violations in the DFS Replication health report.
https://blogs.technet.microsoft.com/askds/2009/02/20/understanding-the-lack-of-distributed-file-lock... - Understanding (the Lack of) Distributed File Locking in DFSR.
Troubleshooting DFSR Event 5002:
Event ID 5002 is a very common DFSR warning event that is logged when connection failures occur. There are different root causes of this event, and in each case the event must be evaluated individually. There are some common root causes and resolutions listed below which should be considered only after the error codes within the events are understood.
Common errors returned with Event ID 5002
This section lists the error portion of the "Additional Information" section for the Event ID 5014 with known solutions. The majority of the events fall into two main categories: Remote Procedure Call (RPC) failures and errors returned by the service itself.
Remote Procedure Call (RPC) errors
Common Solutions for RPC errors:
Note: When troubleshooting DFSR always confirm all servers are up to date with latest DFSR hotfixes -
List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2003 and in Windows Server 2003 R2: http://support.microsoft.com/kb/958802
List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2008 and in Windows Server 2008 R2: http://support.microsoft.com/kb/968429
Updates to DFSR are released as needed, and all the servers using DFSR should be maintained as part of regular patching and maintenance schedules.
netsh int ip set global taskoffload=disabled
iii. To confirm the command completed successfully, run:
netsh int ip show offload
Service Error codes
These errors are more commonly service related. They can be caused by AD replication latency and RPC issues as well
Error: 9026 (The connection is invalid)
Error: 9033 (The request was cancelled by a shutdown)
Error: 9027 (A failure was reported by the remote partner)
Describe the data difference:
There could be multiple reasons for the data inconsistency as mentioned above (ConflictAndDeleted, Pre-Existing, Sharing Violations, etc. and including below:
https://social.technet.microsoft.com/wiki/contents/articles/406.dfsr-does-not-replicate-temporary-fi... - DFSR Does Not Replicate Temporary Files.
https://blogs.technet.microsoft.com/askds/2007/09/04/wheres-my-file-root-cause-analysis-of-frs-and-d... - Where’s my file? Root cause analysis of FRS and DFSR data deletion.
Also, as we’re using DFSN (Namespace) in conjunction with DFSR (Replication), we need to consider below articles as well:
https://blogs.technet.microsoft.com/askds/2012/07/24/common-dfsn-configuration-mistakes-and-oversigh... - Common DFSN Configuration Mistakes and Oversights.
Oct 30 2018 08:47 AM
I used DFS in my last job and its not great for large quantity of files. The replication is slow and if you have any issues it can be a real headache to fix. I would suggest looking up Ned Pyles articles on DFS really helpful. You could possible look at cloning to speed up the initial sync of the volumes. https://blogs.technet.microsoft.com/filecab/2013/08/21/dfs-replication-initial-sync-in-windows-serve...
You may want to look into Storage Replica as an alternative replication method, again Ned Pyle has good articles on this.