The paper describes the algorithms used to chunk data, identify unique data chunks using indexes on chunk hashes, and how to scale deduplication resources on large amounts of data, including performance evaluation numbers. The paper and talk give a review of the advanced analysis carried out on the datasets and how the insights were used to determine design points that address the challenges of primary data deduplication. Many of the design decisions for deduplication were made to create a balance of on-disk space savings, resource usage, performance, and transparency. The key feature is that deduplication can be installed on primary data volumes without impacting the server’s regular workload and still offer significant savings.
Overview:
Primary data serving, reliability, and resiliency aspects of the system are not covered in this paper.
Check out the live video of the talk given by Sudipta Sengupta and Adi Oltean and download the PDF of the paper here: https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%...
Cheers,
Scott M. Johnson
Program Manager II
Data Deduplication Team
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.