Extreme Binning:用于基于块的文件备份的可伸缩的并行重复数据删除

2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems Pub Date : 2009-12-28 DOI:10.1109/MASCOT.2009.5366623

Deepavali Bhagwat, K. Eshghi, D. Long, Mark Lillibridge

{"title":"Extreme Binning:用于基于块的文件备份的可伸缩的并行重复数据删除","authors":"Deepavali Bhagwat, K. Eshghi, D. Long, Mark Lillibridge","doi":"10.1109/MASCOT.2009.5366623","DOIUrl":null,"url":null,"abstract":"Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.","PeriodicalId":275737,"journal":{"name":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"355","resultStr":"{\"title\":\"Extreme Binning: Scalable, parallel deduplication for chunk-based file backup\",\"authors\":\"Deepavali Bhagwat, K. Eshghi, D. Long, Mark Lillibridge\",\"doi\":\"10.1109/MASCOT.2009.5366623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.\",\"PeriodicalId\":275737,\"journal\":{\"name\":\"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems\",\"volume\":\"110 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"355\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MASCOT.2009.5366623\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOT.2009.5366623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 355

摘要

重复数据删除是备份系统必不可少的重要组成部分。必不可少，因为它减少了存储空间需求;至关重要，因为整个备份操作的性能取决于它的吞吐量。传统的备份工作负载由高度局部性的大数据流组成，现有的重复数据删除技术需要提供合理的吞吐量。我们介绍Extreme Binning，这是一种可扩展的重复数据删除技术，适用于非传统备份工作负载，这些工作负载由单个文件组成，在给定的时间窗口内，连续文件之间没有位置。由于缺乏局部性，现有技术在这些工作负载上的性能很差。极限分组利用文件相似性而不是局部性，并且对每个文件只进行一次磁盘访问以进行块查找，从而提供合理的吞吐量。使用Extreme bining构建的多节点备份系统可以随输入数据量优雅地扩展;可以添加更多的备份节点来提高吞吐量。每个文件使用无状态路由算法只分配给一个节点，允许最大程度的并行化，并且每个备份节点都是自治的，没有节点之间的依赖关系，从而使数据管理任务具有较低的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extreme Binning: Scalable, parallel deduplication for chunk-based file backup

Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems

自引率

0.00%

发文量