Using DFS Replication Clone Feature to prepare 100TB of data in 3 days (A Test Perspective)

Former Employee

Apr 10, 2019

First published on TECHNET on Nov 15, 2013

I’m Tsan Zheng, a Senior Test Lead on the DFS team. If you’ve used DFSR (DFS Replication), you’re probably aware that the largest amount of data that we had tested replication with until recently was 10 TB. A few years ago, that was a lot of data, but now, not so much.

In this post, I’m going to talk about how we verified preparing 100 TB of data for replication in 3 days. With Windows Server 2012 R2, we introduced the ability to export a clone of the DFSR database , which dramatically reduces the amount of time used to get preseeded data ready for replication. Now, it only takes roughly 3 days to get 100 TB of data ready for replication with Windows Server 2012 R2. On Windows Server 2012, we think this would’ve taken more than 300 days based on our testing of 100 GB of data, which took 8 hours to prep on Window Server 2012 (we decided not to wait around for 300 days). In this blog post, we’ll show you how we tested the replication of 100 TB of data on Windows Server 2012 R2.

First of all, let’s all look at what 100 TB of data could mean: It could be around 340,000 8 megapixel pictures (that’s 10 years of pictures if you take 100 pictures every day), or 3,400 Blu-Ray quality full-length movies, or billions of office documents, or 5,000 decent sized Exchange mailbox files, or 2,000 decent virtual machine files. That’s a lot of data even in the year of 2013. If you’re using 2 TB hard drives, you need at least 120 of them just to set up two servers to handle this amount of data. Now we have to clarify here that the absolute performance of cloning a DFSR dataset is largely dependent on the number of files and directories, not the actual size of the files (if we use verification level 0 or 1, which don’t involve verifying full file hashes).

In designing the test, we not only need to make sure we set up things correctly, but also we need to make sure that replication happens as expected after the initial preparation of the dataset - you don’t want data corruption when replication is being set up! Preparing the data for replication also must go fast if we’re going to prep a 100 TB of data in a reasonable amount of time.

Now let’s look at our test setup. As mentioned earlier, you need some storage. We deployed two virtual machines, each with 8GB RAM and data volumes using a Storage Spaces simple space (in a production environment you’d probably want to use a mirror space for resiliency). The data volumes were served by a single-node scale-out file server , which provided continuous availability. Hyper-V host (Fujitsu PRIMERGY CX250, 2.5Ghz, 6cores, 128GB RAM) and file server (HP Mach1 Server – 24GB, Xeon 2.27GHz - 8 Core) were connected using dual-10GbE network to ensure near local performance IO-wise. We used 120 drives (2TB each) in 2 Raid Inc JBODs for the file server.

In order to get several performance data points from a DFSR perspective (as DFSR uses one database per volume), we used following volume sizes that total 100 TB on both ends. We used a synthetic file generator to create ~92 TB of unique data; the remaining 8 TB was human-generated data harvested from internal file sets. It’s difficult to have that much real data...not counting VHDx files and peeking into personal archives, of course! We used the robocopy commands provided by DFSR cloning to pre-seed the second member.

Volume	Size	Number of files	Number of folders	Number of Replicated Folders
F	64 TB	68,296,288	2,686,455	1
G	18 TB	21,467,280	70,400	18
H	10 TB	14,510,974	39,122	10
I	7 TB	1,141,246	31,134	7
J	1 TB	1,877,651	7,448	1
TOTAL	100 TB	107,293,439	2,834,559

In a nutshell, following diagram shows the test topology used.

Now that storage and file sets are ready, let’s look at what verification we did during Export -> Pre-seed -> Import sequence.

No errors in the DFSR event log. (From Event Viewer)

No skipping or invalid records in DFSR debug log (By checking “[ERROR]”)

Replication works fine after cloning, by probing each replicated folder with canary files to check convergence.

No mismatched records after cloning, by checking DFSR debug log and DFSR event log.

Time taken for cloning was measured using Windows PowerShell cmdlet measure-command :

Measure-Command { Export-DfsrClone…}

Measure-Command { Import-DfsrClone…}

Following table and graphs summarize the results one of our testers Jialin Le took on a build that was very close to the RTM build of Windows Server 2012 R2. Given the nature of DFSR clone verification levels, it’s not recommended to use validation level 2 (which involves full file hash and is too time consuming for large dataset like this one!)

Note, the performance for level 0 and level 1 validation is largely dependent on count of files and directories rather than absolute file size, it explains why it takes proportionally more time for 64TB volume to export compared that of 18TB as the former has proportionally more folders.

*Validation Level*	*Volume Size*	*Time used to Export (minutes)*	*Time used to Import(minutes)*
*0 – None*	64 TB	394	2129
18 TB	111	1229
10 TB	73	368
7 TB	70	253
1 TB	11	17
Sum(100TB)	659 (0.4 days)	3996 (2.8 days)
*1 – Basic*	64 TB	1043	2701
18 TB	211	1840
10 TB	168	577
7 TB	203	442
1 TB	17	37
Sum(100TB)	1642 (1.1 days)	5597 (3.8 days)

From the chart above, you can see getting DFSR ready for replication for large dataset (totaling 100TB) is getting more practical!

Update May 2014: See it all in video! TechEd North America 2014 with live demos and walkthroughs:

I hope you have enjoyed learning more about how we test DFSR features here at Microsoft. For more info on cloning, see our TechNet walkthrough article.

- Tsan Zheng

Updated Apr 10, 2019

Version 2.0

dfs replication

dfsr

windows server 2012 r2

NedPyle

Former Employee

Joined April 26, 2017

View Profile

Storage at Microsoft

Follow this blog board to get notified when there's new activity