MAD2: A scalable high-throughput exact deduplication approach for network backup services

Jiansheng Wei, Hong Jiang, Ke Zhou, D. Feng
{"title":"MAD2: A scalable high-throughput exact deduplication approach for network backup services","authors":"Jiansheng Wei, Hong Jiang, Ke Zhou, D. Feng","doi":"10.1109/MSST.2010.5496987","DOIUrl":null,"url":null,"abstract":"Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.","PeriodicalId":350968,"journal":{"name":"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"98","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2010.5496987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 98

Abstract

Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.
MAD2:用于网络备份服务的可扩展高吞吐量精确重复数据删除方法
重复数据删除技术被广泛应用于基于磁盘的二级存储系统中,以提高存储空间的使用效率。但是,可扩展的高吞吐量重复数据删除存储面临两个挑战。首先是重复查找磁盘瓶颈,这是由于数据索引的大小通常超过可用的RAM空间,从而限制了重复数据删除吞吐量。二是由于多个存储节点之间的重复数据难以消除而产生的存储节点孤岛效应。现有的方法无法在解决挑战的同时完全消除重复。本文提出了一种可扩展的、高吞吐量的精确重复数据删除方法MAD2,用于网络备份服务。MAD2通过采用四种技术来加速重复数据删除过程并均匀分布数据,从而消除了文件级和块级的重复数据。首先,MAD2将指纹组织到哈希桶矩阵(HBM)中,该矩阵的行可用于在备份中保留数据局部性。其次,MAD2使用Bloom Filter Array (BFA)作为快速索引来快速识别非重复的传入数据对象或指示在何处查找可能的重复。第三,MAD2集成了双缓存,有效地捕获和利用数据局部性。最后,MAD2采用基于dht的负载平衡技术,在多个存储节点的备份序列中均匀分布数据对象,从而通过负载均衡进一步提高性能。我们在B-Cloud的后端存储上评估了我们的MAD2方法,B-Cloud是一个提供网络备份服务的研究型分布式系统。实验结果表明,MAD2在重复数据删除效率方面明显优于最先进的近似重复数据删除方法,支持每个存储组件至少100MB/s的重复数据删除吞吐量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信