Tony Wong, Smriti Thakkar, Kao-Feng Hsieh, Zachary Tom, Hetaben Saraiya, Philip Shilane
{"title":"DD文件系统全局重复数据删除的数据集相似度检测","authors":"Tony Wong, Smriti Thakkar, Kao-Feng Hsieh, Zachary Tom, Hetaben Saraiya, Philip Shilane","doi":"10.1109/ICDE55515.2023.00255","DOIUrl":null,"url":null,"abstract":"Deduplication has become a widely used technique to reduce space requirements for storage systems by replacing redundant chunks of data with references. While storage systems continue to grow in size, there remain practical limits to the size of any deduplication node, and enterprise businesses may have dozens to hundreds of nodes. It is important to place datasets on nodes in a multi-node environment to take advantage of deduplication savings globally. For customers of the DD File System (DDFS)1, we provide the Global Deduplication Service that advises customers on data placement to maximize deduplication-related space savings. This paper describes our currently shipping approach that uses a Fingerprint Dictionary to intelligently cluster customer data and generate a plan to relocate datasets to improve global deduplication. We report results from thousands of deployed systems at customer sites. We have also developed a further improvement using MinHashes that lowers resource requirements, and we provide proofs of the similarity estimates. Our results on a real-world dataset show that MinHashes improve the clustering speed up to 400X relative to our previous method and reduce memory consumption up to 260X.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dataset Similarity Detection for Global Deduplication in the DD File System\",\"authors\":\"Tony Wong, Smriti Thakkar, Kao-Feng Hsieh, Zachary Tom, Hetaben Saraiya, Philip Shilane\",\"doi\":\"10.1109/ICDE55515.2023.00255\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deduplication has become a widely used technique to reduce space requirements for storage systems by replacing redundant chunks of data with references. While storage systems continue to grow in size, there remain practical limits to the size of any deduplication node, and enterprise businesses may have dozens to hundreds of nodes. It is important to place datasets on nodes in a multi-node environment to take advantage of deduplication savings globally. For customers of the DD File System (DDFS)1, we provide the Global Deduplication Service that advises customers on data placement to maximize deduplication-related space savings. This paper describes our currently shipping approach that uses a Fingerprint Dictionary to intelligently cluster customer data and generate a plan to relocate datasets to improve global deduplication. We report results from thousands of deployed systems at customer sites. We have also developed a further improvement using MinHashes that lowers resource requirements, and we provide proofs of the similarity estimates. Our results on a real-world dataset show that MinHashes improve the clustering speed up to 400X relative to our previous method and reduce memory consumption up to 260X.\",\"PeriodicalId\":434744,\"journal\":{\"name\":\"2023 IEEE 39th International Conference on Data Engineering (ICDE)\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 39th International Conference on Data Engineering (ICDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE55515.2023.00255\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE55515.2023.00255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dataset Similarity Detection for Global Deduplication in the DD File System
Deduplication has become a widely used technique to reduce space requirements for storage systems by replacing redundant chunks of data with references. While storage systems continue to grow in size, there remain practical limits to the size of any deduplication node, and enterprise businesses may have dozens to hundreds of nodes. It is important to place datasets on nodes in a multi-node environment to take advantage of deduplication savings globally. For customers of the DD File System (DDFS)1, we provide the Global Deduplication Service that advises customers on data placement to maximize deduplication-related space savings. This paper describes our currently shipping approach that uses a Fingerprint Dictionary to intelligently cluster customer data and generate a plan to relocate datasets to improve global deduplication. We report results from thousands of deployed systems at customer sites. We have also developed a further improvement using MinHashes that lowers resource requirements, and we provide proofs of the similarity estimates. Our results on a real-world dataset show that MinHashes improve the clustering speed up to 400X relative to our previous method and reduce memory consumption up to 260X.