Extreme Binning:用于基于块的文件备份的可伸缩的并行重复数据删除

Deepavali Bhagwat, K. Eshghi, D. Long, Mark Lillibridge
{"title":"Extreme Binning:用于基于块的文件备份的可伸缩的并行重复数据删除","authors":"Deepavali Bhagwat, K. Eshghi, D. Long, Mark Lillibridge","doi":"10.1109/MASCOT.2009.5366623","DOIUrl":null,"url":null,"abstract":"Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.","PeriodicalId":275737,"journal":{"name":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"355","resultStr":"{\"title\":\"Extreme Binning: Scalable, parallel deduplication for chunk-based file backup\",\"authors\":\"Deepavali Bhagwat, K. Eshghi, D. Long, Mark Lillibridge\",\"doi\":\"10.1109/MASCOT.2009.5366623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.\",\"PeriodicalId\":275737,\"journal\":{\"name\":\"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems\",\"volume\":\"110 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"355\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MASCOT.2009.5366623\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOT.2009.5366623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 355

摘要

重复数据删除是备份系统必不可少的重要组成部分。必不可少,因为它减少了存储空间需求;至关重要,因为整个备份操作的性能取决于它的吞吐量。传统的备份工作负载由高度局部性的大数据流组成,现有的重复数据删除技术需要提供合理的吞吐量。我们介绍Extreme Binning,这是一种可扩展的重复数据删除技术,适用于非传统备份工作负载,这些工作负载由单个文件组成,在给定的时间窗口内,连续文件之间没有位置。由于缺乏局部性,现有技术在这些工作负载上的性能很差。极限分组利用文件相似性而不是局部性,并且对每个文件只进行一次磁盘访问以进行块查找,从而提供合理的吞吐量。使用Extreme bining构建的多节点备份系统可以随输入数据量优雅地扩展;可以添加更多的备份节点来提高吞吐量。每个文件使用无状态路由算法只分配给一个节点,允许最大程度的并行化,并且每个备份节点都是自治的,没有节点之间的依赖关系,从而使数据管理任务具有较低的开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Extreme Binning: Scalable, parallel deduplication for chunk-based file backup
Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信