Large Scale Study and System Design for Primary Data Deduplication accepted by USENIX
Published Apr 10 2019 02:39 AM 437 Views
Iron Contributor
First published on TECHNET on Jul 02, 2012
Microsoft Research (MSR) and the Windows File Server team worked together to build a new Data Deduplication feature in Windows Server 2012 . This feature came from 2 years of collaboration with MSR on the design. The development of the architecture and the algorithms we use for deduplication was driven, in part, by analysis of data in a large global enterprise. The USENIX Annual Technical Conference (ATC) was held on June 13-15, and we submitted a Large Scale Study and System Design paper and gave a talk about our findings. The new paper and presentation video have just gone public on the USENIX website.

The paper describes the algorithms used to chunk data, identify unique data chunks using indexes on chunk hashes, and how to scale deduplication resources on large amounts of data, including performance evaluation numbers. The paper and talk give a review of the advanced analysis carried out on the datasets and how the insights were used to determine design points that address the challenges of primary data deduplication. Many of the design decisions for deduplication were made to create a balance of on-disk space savings, resource usage, performance, and transparency. The key feature is that deduplication can be installed on primary data volumes without impacting the server’s regular workload and still offer significant savings.

Overview:

  • A large-scale study of primary data deduplication on 7TB of data across 15 globally-distributed servers in a large enterprise.
  • Architecture overview of deduplication in Windows Server 2012 and the design decisions that were driven by data analysis.
  • How deduplication is made friendly to the server’s primary workload, how CPU, memory and disk IO resource usage for deduplication scales efficiently with the size of the data.
  • Highlights of the innovations that went into the areas of data chunking / compression, chunk indexing, data partitioning and reconciliation.

Primary data serving, reliability, and resiliency aspects of the system are not covered in this paper.

Check out the live video of the talk given by Sudipta Sengupta and Adi Oltean and download the PDF of the paper here: https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%...

Cheers,
Scott M. Johnson
Program Manager II
Data Deduplication Team

Version history
Last update:
‎Apr 10 2019 02:39 AM
Updated by: