If you have an existing system with deduplication enabled on one or more volumes, you can do a quick check to see if your existing volume sizes are adequate.
The following script can help quickly answer if your current deduplication volume size is appropriate for the workload churn happening on the storage or if it is regularly falling behind.
$ddpVol = Get-DedupStatus <volume>
switch ($ddpVol.LastOptimizationResult) {
0 { write-host "Volume size is appropriate for server." }
2153141053 { write-host "The volume could not be optimized in the time available. If this persists over time, this volume may be too large for deduplication to keep up on this server." }
Default { write-host "The last optimization job for this volume failed with an error. See the Deduplication event log for more details." }
}
If the result is that the volume size is appropriate for your server, then you can stop here (and work on your other tasks!)
If the result from the above script is that the volume cannot be optimized in the time available, administrators should determine the appropriate size of the volume for the given time window to complete optimization.
Let's start with some basic principles:
Therefore, to know how to estimate the maximum size for a deduplicated volume, it is required to understand the size of the data churn and the speed of optimization processing.
The following sections provide guidance on how to determine maximum volume size using two different methods to determine data churn and deduplication processing speed:
Method 1: Use reference data from our internal testing to estimate the values for your system
Method 2:
Perform measurements directly on your system based on representative samples of your data
Scripts are provided to then calculate the maximum volume size using these values.
From internal testing, we have measured deduplication processing, or throughput, rates that vary depending on the combination of the underlying hardware as well as the types of workloads being deduplicated. These measured rates can be used as reference points for estimating the rates for your target configuration. The assumption is that you can scale these values according to your estimate of your system and data workload.
For roughly estimating deduplication throughput, we have broken data workloads into two broad types.
Notice that the form the data churn takes is very different between the general-purpose file server and the Hyper-V workloads. With the general-purpose file server, data churn usually takes the form of new files. With Hyper-V, data churn takes the form of modifications to the VHD file.
Because of this difference, for the general-purpose file server we normally talk about deduplication throughput in terms of time to optimize the amount of new file data added and for Hyper-V we normally talk about this in terms of time to re-optimize an entire VHD file with a percentage of changed data. The two sections below show how to do the volume size estimate for these two workloads for Method 1.
The script examples given in this section make two important assumptions:
As noted above, the deduplication of general-purpose file server workloads is primarily characterized by the optimization throughput of new data files. We have taken measurements of this throughput rate for two different hardware configurations running both Windows Server 2012 and Windows Server 2012 R2. The details of the system configurations are listed below. Since the throughput rate is primarily dependent on the overall performance of the storage subsystem, you can scale these rates according to your estimate of your system's performance compared to these reference configurations. Scale up the throughput for higher performance storage and scale down the throughput for lower performance storage.
The table below lists the typical re-optimization deduplication throughput rates for General Purpose File Server workloads for the two tested reference systems.
Deduplication throughput rates for new file data (general-purpose file server workload) |
System 1
|
System 2
|
||||||||||||||||||||
Windows Server 2012 |
~22 MB/s |
~26 MB/s |
||||||||||||||||||||
Windows Server 2012 R2 |
~23-31 MB/s |
~45-50 MB/s |
Two points to note from the measured throughput rates in the table:
Rough guidelines for estimating the typical churn rates of General Purpose File Servers are to use values in the 1% to 6% range. For the examples below, a conservative estimate of 5% is used.
Given the typical optimization throughput values from the table and using an estimate of the churn rates of the files, administrators can estimate if deduplication can keep up with their needs by using the following script to calculate a volume size recommendation.
# General Purpose File Server (GPFS) workload volume size estimation
# TotalVolumeSizeGB = total size in GB of all volumes that host data to be deduplicated
# DailyChurnPercentage = percentage of data churned (new data or modified data) daily
# OptimizationThroughputMB = measured/estimated optimization throughput in MB/s
# DailyOptimizationWindowHours = 24 hours for background mode deduplication, or daily schedule length for throughput optimization
# DeduplicationSavingsPercentage = measured/estimated deduplication savings percentage (0.00 – 1.00)
# FreeSpacePercentage = it is recommended to always leave some amount of free space on the volumes, such as 10% or twice the expected churn
write-host "GPFS workload volume size estimation"
[int] $TotalVolumeSizeGB = Read-Host 'Total Volume Size (in GB)'
$DailyChurnPercentage = Read-Host 'Percentage data churn (example 5 for 5%)'
$OptimizationThroughputMB = Read-Host 'Optimization Throughput (in MB/s)'
$DailyOptimizationWindowHours = Read-Host 'Daily Optimization Window (in hours)'
$DeduplicationSavingsPercentage = Read-Host 'Deduplication Savings percentage (example 70 for 70%)'
$FreeSpacePercentage = Read-Host 'Percentage allocated free space on volume (example 10 for 10%)'
# Convert to percentage values
$DailyChurnPercentage = $DailyChurnPercentage/100
$DeduplicationSavingsPercentage = $DeduplicationSavingsPercentage/100
$FreeSpacePercentage = $FreeSpacePercentage/100
# Total logical data size
$DataLogicalSizeGB = $TotalVolumeSizeGB * (1 - $FreeSpacePercentage) / (1 - $DeduplicationSavingsPercentage)
# Data to optimize daily
$DataToOptimizeGB = $DailyChurnPercentage * $DataLogicalSizeGB
# Time required to optimize data
$OptimizationTimeHours = ($DataToOptimizeGB / $OptimizationThroughputMB) * 1024 / 3600
# Number of volumes required
$VolumeCount = [System.Math]::Ceiling($OptimizationTimeHours / $DailyOptimizationWindowHours)
# Volume size
$VolumeSize = $TotalVolumeSizeGB / $VolumeCount
write-host
write-host "Data to optimize daily: $DataToOptimizeGB GB"
$OptimizationTimeHours = "{0:N2}" –f $OptimizationTimeHours
write-host "Hours required to optimize data: $OptimizationTimeHours"
write-host "$VolumeCount volume(s) of size $VolumeSize GB is recommended to process"
write-host
Assume a general-purpose file server with 8 TB of SAS storage available is running Windows Server 2012 R2 with deduplication enabled is scheduled to operate in throughput mode at night for 12 hours. From the Server Manager UI or the cmdlet
get-dedupvolume
, the admin sees deduplication is reporting 70% savings.
Using the table above, we get the typical optimization throughput for SAS (45 MB/s) and assume 5% file churn for the General Purpose File Server.
After plugging in these input values of the scenario into the script:
PS C:deduptest> .calculate-gpfs.ps1
GPFS workload volume size estimation
Total Volume Size (in GB): 8192
Percentage data churn (example 5 for 5%): 5
Optimization Throughput (in MB/s): 45
Daily Optimization Window (in hours): 12
Deduplication Savings percentage (example 70 for 70%): 70
Percentage allocated free space on volume (example 10 for 10%): 10
the calculation script outputs:
Data to optimize daily: 1228.8 GB
Hours required to optimize data: 7.77
1 volume(s) of size 8192 GB is recommended.
So, we can expect a server with a single 8 TB volume and 5% churn to be able to process the ~1.2 TB of changes in under 8 hours. The deduplication server should be able to complete the optimization work within the scheduled 12-hour night window.
We can also see if the same server was using SATA instead of SAS ($OptimizationThroughputMB = 23 MB/s), the script would recommend having 2 volumes to complete the optimization work within the same 12-hour window.
Data to optimize daily: 1228.8 GB
Hours required to optimize data: 15.20
2 volume(s) of size 4096 GB is recommended.
If a 17 hour optimization window were available for the same SATA hardware only a single 8 TB volume would be needed.
Data to optimize daily: 1228.8 GB
Hours required to optimize data: 15.20
1 volume(s) of size 8192 GB is recommended.
As noted above, the deduplication of Hyper-V workloads is primarily characterized by the re-optimization throughput of existing VHD files. We have taken measurements of this throughput rate for a VDI reference hardware deployment running Windows Server 2012 R2.
The table below lists the measured re-optimization deduplication throughput rates for Hyper-V VDI workloads running on the VDI reference system.
Deduplication throughput rates for VHD files (Hyper-V VDI workload) 2 |
Storage Spaces configuration (SSD, HDD) 1 |
Hyper-V (Windows Server 2012 R2) |
Re-optimization (background mode) of VHD file: ~200 MB/s Re-optimization (throughput mode) of VHD file: ~300 MB/s |
1
Using a VDI reference hardware deployment (with JBODs) detailed
here
2
Note that these rates are much larger than those listed for processing new file data for the general-purpose file server scenario in the previous section. This is not because the actual deduplication operation is faster, but rather because the full size of the file is counted when calculating the rates and for VHD files in this scenario only a small percentage of the data is new.
Note from the table that the throughput rates will typically differ depending on the scheduling mode chosen for deduplication. When the "BackgroundModeOptimization" job schedule is chosen, the optimization jobs are run at low priority with a smaller memory allocation. When the "ThroughputModeOptimization" job schedule is chosen, the optimization jobs are run at normal priority with a larger memory allocation. (For more information on configuring deduplication, refer to
Install and Configure Data Deduplication
on Microsoft TechNet.)
Rough guidelines for typical churn rates of Hyper-V VDI workloads are usually around 5-10% churn, which is reflected in the deduplication throughput rates listed. If you expect more or less churn, you can scale these values accordingly to estimate the impact on recommended volume size (where the processing rate increases with less churn and decreases with more churn).
Given the typical optimization throughput in the table, administrators can estimate if deduplication can keep up with their needs by using the following script to calculate a volume size recommendation.
# Hyper-V VDI workload volume size estimation
# TotalVolumeSizeGB = total size in GB of all volumes that host data to be deduplicated
# VHDOptimizationThroughputMB = measured/estimated optimization of VHD file throughput in MB/s
# DailyOptimizationWindowHours = 24 hours for background mode deduplication, or daily schedule length for throughput optimization
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.