Hi Everyone! Sujay here from the Program Management team for System Center Data Protection Manager (DPM) and Microsoft Azure Backup Server (MABS)
While talking to DPM and MABS customers I come across many questions around the tiered storage requirement for storage pool. With this blog post I will try to answer the most important question Why tiered storage? and some of the frequently asked questions(FAQs) about using the tiered storage.
This post is going to be a long one, but I promise to be worth your time. And if you are looking for tldr version; Use tiered storage for DPM storage pool, period.
Before I jump into the questions, I recommend you to get yourself familiarize with the following terminologies I have referenced multiple times in my blog.
Introduction to Modern Backup Storage blog to understand more about Modern Backup Storage which was introduced with DPM 2016.
Note: I might use DPM at many places but everything in this blog is applicable for MABS as well. Make sure you are using MABS v3 with UR1.
Let’s get started.
Why the backup jobs take longer time on Modern Backup Storage over the period?
During the backup process one of the important steps is to take snapshot of the replica on the DPM server to create Recovery Point. This is done using block cloning feature of ReFS. Over a period of time, as daily/weekly backup jobs run and snapshots are taken, it causes increased fragmentation of the replica. Also, with each clone operation the size of the metadata is increases The block cloning operation is metadata intensive. During each block cloning operation, the files metadata and global ref count table would be read from the disk to the memory. This would result in high amount of I/O operations to underlying storage.
Let’s take an example of 400GB replica size (the replica contains the backup data stored on DPM server for specific data source). We observed that with the worst possible fragmentation, the metadata size can go up to 4% of the replica size, which would be 16GB in this case. This would result to read of 16GB of metadata during every cloning operation. If we assume ReFS cluster size of 4K, total I/O operations required for single cloning operation would be around 4 million. With the standard disk (HDD) of around 200 IOPS, this can take almost 20K seconds or around 5 hours to just clone the replica. This is the worst-case scenario - we are assuming all IOPS are random and worst possible fragmentation. This may not be always the case with your environment.
How does the tiered storage solve this problem?
With DPM 2019, we enhanced MBS to take advantage of hybrid storage aka Tiered Storage feature in ReFS configured using Storage Spaces. ReFS divides volume into two logical storage known as tiers. These tiers can have their own drive and resilience types, allowing each tier to optimize for either performance or capacity. Once these tiers are configured, ReFS uses them to deliver fast storage for hot data and capacity-efficient storage for cold data. DPM uses this in a slightly different way to improve the overall backup performance.
DPM (with the help of ReFS), uses the performance tier (SSD) to store entire file system metadata which is required for block cloning operation. This improves the cloning performance to a great extent. If we take the same example as above and take the cheapest SSD available in the market, it offers 10K IOPS. The cloning time would cut down drastically to around 6 to 7 minutes. The tiered storage is the cost-efficient solution offered by ReFS to cater to high IOPS demanding application and DPM makes the most out of it by using just the small size of SSD.
What should be the size of SSD tier and Why?
As mentioned above in the worst possible fragmentation scenario the metadata size of the file system can grow up to 4% of the replica size. And since DPM requires SSD tier to store only ReFS metadata (which is required for cloning operation) the SSD size of 4% of total DPM storage is recommended.
Do I need to upgrade my server or external storage if it doesn’t support SSD?
The storage for DPM pool could be directly attached storage (Internal) or the external storage like SAN storage.
If you are using internal storage you need to connect SSD to the DPM server and if your server doesn’t have provision for SSD you can also use PCIe based SSDs.
If you are using external storage (like SAN device) and if your storage supports SSD you can add the SSD to external storage. If that’s not possible you can still use the SSD connected directly to the DPM server (including PCIe based SSD) in combination with HDDs from external storage to create tiered storage.
What if my DPM server is running on Virtual Machine? Can I use tiered storage on VM?
Yes. Configuring tiered storage using Windows Storage Spaces is supported on Virtual Machine.
The VM should have virtual SSD carved out of Physical SSD on the Hyper-V host. The Hyper-V host can have the physical SSDs connected locally or from external storage. While configuring the Windows Storage Space on Virtual Machine, you need to make sure that Media Type for the disk are configured correctly.
Use the following command to set the appropriate Media Type:
Please refer to our documentation here which provides the step by step guidance to configure tiered storage. Additionally, I strongly recommend you to review the pre-requisites for Storage Spaces here.
What resiliency option can be configured for tiered volume in DPM?
DPM supports all the resiliency option supported by Windows Storage Spaces: Simple, Mirror and Parity. The Storage Space pre-requisites document explains the pros and cons of each option here.
For DPM, you can also use the combination of the resiliency option to optimize performance or capacity.
Note: When you are configuring Storage Spaces, you should not configure any resiliency (RAID) at hardware layer. This is pre-requisite for the Storage Spaces.
What is the recommended sector size for tiered storage?
We strongly recommend keeping the sector size to 4096. And to avoid multiplexing / demultiplexing of IO operations, the sector size should be consistent across all the layers from ReFS to storage hardware.
Why is Windows Server 2019 recommended to install DPM 2019 or MABS v3?
As you know DPM is highly dependent on Resilient File System (ReFS). Windows Server 2019 brings various enhancements to ReFS which are not available with Windows Server 2016. Also, it is important that you always install the latest Cumulative Update for Windows Server 2019 which brings additional fixes to ReFS. For example, recently the January 23, 2020—KB4534321 update had performance improvement for ReFS block cloning operation.
Is there any alternative to improve the backup job performance?
Although we highly recommend you to migrate to tiered storage, there is another way which can help you to improve the backup performance. If the backup jobs are taking longer time only for specific data sources, to reduce the fragmentation, you can move the data source to another volume where you have enough free space. This will allow to store the replica in more contiguous space which should result in reduced fragmentation. Remember having enough free space on the volume is important to get contiguous space. While this should improve the performance but over the period as the fragmentation increases you might see degraded performance.
With DPM 2019 UR2, you can use optimized migration to migrate only active replica to new volume while retaining the older recovery points on the existing volume. Read more about how to migrate data source to new volume here.
How do I know the current fragmentation status of the replica?
We have added -CheckReplicaFragmentation parameter to Copy-DPMDatasourceReplica PowerShell cmdlet. This allows you to check the fragmentation status of the replica for the Data Source.
Note: To use this cmdlet you need to have at least UR1 for DPM 2019 or UR9 for DPM 2016.
I hope this was helpful information and now you know why using tiered storage for DPM storage will improve the backup job performance. So, start using tiered storage and let us know how your experience is. While I have tried to cover most frequently asked questions in this blog, I am sure there will be still more questions. Please post your questions in the comments and I will try to answer them as best as I can.