Yinjin Fu, Jun Su, Jiahao Ning, Jian Wu, Yutong Lu, Nong Xiao
{"title":"面向大数据的分布式重复数据删除研究综述","authors":"Yinjin Fu, Jun Su, Jiahao Ning, Jian Wu, Yutong Lu, Nong Xiao","doi":"10.1145/3735508","DOIUrl":null,"url":null,"abstract":"To address the throughput and capacity limitations of single-node centralized deduplication, distributed data deduplication has become a popular technology in big data management to save more storage space, enhance I/O performance, and improve system scalability. It includes inter-node data assignment from clients to multiple deduplication nodes by a data routing scheme, and independent intra-node redundancy suppression in individual nodes. In this paper, we first describe the background of big data deduplication. Then we summarize and classify the state-of-the-art in the key techniques of distributed data deduplication, including data partitioning, chunk fingerprinting, data routing, index lookup, data restoring, garbage collection, the security and reliability of deduplicated data. These help identify and understand the system implementation of the existing distributed data deduplication methods. Moreover, we present some representative industrial products that have successfully applied distributed data deduplication technologies. Finally, we discuss the main challenges and industry trend of distributed data deduplication, and outline the open problems and its future research directions.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"28 1","pages":""},"PeriodicalIF":28.0000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed Data Deduplication for Big Data: A Survey\",\"authors\":\"Yinjin Fu, Jun Su, Jiahao Ning, Jian Wu, Yutong Lu, Nong Xiao\",\"doi\":\"10.1145/3735508\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To address the throughput and capacity limitations of single-node centralized deduplication, distributed data deduplication has become a popular technology in big data management to save more storage space, enhance I/O performance, and improve system scalability. It includes inter-node data assignment from clients to multiple deduplication nodes by a data routing scheme, and independent intra-node redundancy suppression in individual nodes. In this paper, we first describe the background of big data deduplication. Then we summarize and classify the state-of-the-art in the key techniques of distributed data deduplication, including data partitioning, chunk fingerprinting, data routing, index lookup, data restoring, garbage collection, the security and reliability of deduplicated data. These help identify and understand the system implementation of the existing distributed data deduplication methods. Moreover, we present some representative industrial products that have successfully applied distributed data deduplication technologies. Finally, we discuss the main challenges and industry trend of distributed data deduplication, and outline the open problems and its future research directions.\",\"PeriodicalId\":50926,\"journal\":{\"name\":\"ACM Computing Surveys\",\"volume\":\"28 1\",\"pages\":\"\"},\"PeriodicalIF\":28.0000,\"publicationDate\":\"2025-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Computing Surveys\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3735508\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3735508","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Distributed Data Deduplication for Big Data: A Survey
To address the throughput and capacity limitations of single-node centralized deduplication, distributed data deduplication has become a popular technology in big data management to save more storage space, enhance I/O performance, and improve system scalability. It includes inter-node data assignment from clients to multiple deduplication nodes by a data routing scheme, and independent intra-node redundancy suppression in individual nodes. In this paper, we first describe the background of big data deduplication. Then we summarize and classify the state-of-the-art in the key techniques of distributed data deduplication, including data partitioning, chunk fingerprinting, data routing, index lookup, data restoring, garbage collection, the security and reliability of deduplicated data. These help identify and understand the system implementation of the existing distributed data deduplication methods. Moreover, we present some representative industrial products that have successfully applied distributed data deduplication technologies. Finally, we discuss the main challenges and industry trend of distributed data deduplication, and outline the open problems and its future research directions.
期刊介绍:
ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods.
ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.