Tony Wong, Smriti Thakkar, Kao-Feng Hsieh, Zachary Tom, Hetaben Saraiya, Philip Shilane
{"title":"Dataset Similarity Detection for Global Deduplication in the DD File System","authors":"Tony Wong, Smriti Thakkar, Kao-Feng Hsieh, Zachary Tom, Hetaben Saraiya, Philip Shilane","doi":"10.1109/ICDE55515.2023.00255","DOIUrl":null,"url":null,"abstract":"Deduplication has become a widely used technique to reduce space requirements for storage systems by replacing redundant chunks of data with references. While storage systems continue to grow in size, there remain practical limits to the size of any deduplication node, and enterprise businesses may have dozens to hundreds of nodes. It is important to place datasets on nodes in a multi-node environment to take advantage of deduplication savings globally. For customers of the DD File System (DDFS)1, we provide the Global Deduplication Service that advises customers on data placement to maximize deduplication-related space savings. This paper describes our currently shipping approach that uses a Fingerprint Dictionary to intelligently cluster customer data and generate a plan to relocate datasets to improve global deduplication. We report results from thousands of deployed systems at customer sites. We have also developed a further improvement using MinHashes that lowers resource requirements, and we provide proofs of the similarity estimates. Our results on a real-world dataset show that MinHashes improve the clustering speed up to 400X relative to our previous method and reduce memory consumption up to 260X.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE55515.2023.00255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deduplication has become a widely used technique to reduce space requirements for storage systems by replacing redundant chunks of data with references. While storage systems continue to grow in size, there remain practical limits to the size of any deduplication node, and enterprise businesses may have dozens to hundreds of nodes. It is important to place datasets on nodes in a multi-node environment to take advantage of deduplication savings globally. For customers of the DD File System (DDFS)1, we provide the Global Deduplication Service that advises customers on data placement to maximize deduplication-related space savings. This paper describes our currently shipping approach that uses a Fingerprint Dictionary to intelligently cluster customer data and generate a plan to relocate datasets to improve global deduplication. We report results from thousands of deployed systems at customer sites. We have also developed a further improvement using MinHashes that lowers resource requirements, and we provide proofs of the similarity estimates. Our results on a real-world dataset show that MinHashes improve the clustering speed up to 400X relative to our previous method and reduce memory consumption up to 260X.