对备份应用中的数据集进行特征描述,以便重复数据删除

Nohhyun Park, D. Lilja
{"title":"对备份应用中的数据集进行特征描述,以便重复数据删除","authors":"Nohhyun Park, D. Lilja","doi":"10.1109/IISWC.2010.5650369","DOIUrl":null,"url":null,"abstract":"The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":"{\"title\":\"Characterizing datasets for data deduplication in backup applications\",\"authors\":\"Nohhyun Park, D. Lilja\",\"doi\":\"10.1109/IISWC.2010.5650369\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.\",\"PeriodicalId\":107589,\"journal\":{\"name\":\"IEEE International Symposium on Workload Characterization (IISWC'10)\",\"volume\":\"216 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"49\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE International Symposium on Workload Characterization (IISWC'10)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC.2010.5650369\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Workload Characterization (IISWC'10)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2010.5650369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 49

摘要

重复数据删除系统的压缩性能和吞吐量直接受到输入数据集的影响。我们为重复数据删除系统提出了两套评估指标,以及提取这些指标的方法。第一组指标表示重复数据删除系统内的段组成在五个完整备份期间的变化情况。这反过来又可以让我们更深入地了解压缩比如何随着数据的积累而变化。第二组指标表示由重复消除和到达底层存储系统的速率引起的索引表碎片。我们表明,虽然较短的唯一数据序列可能不利于索引缓存,但它们提供了更统一的到达率,从而提高了整体吞吐量。最后,我们计算了从评估数据集派生的指标,并展示了数据集在不同指标下的表现。我们的评估表明,备份数据集通常表现出随时间变化的模式,并且这些模式可以量化它们对重复数据删除过程的影响。这种量化使我们能够:1)决定重复数据删除是否适用;2)提供资源;3)调优数据重复数据删除参数;4)确定数据集的哪一部分最适合重复数据删除。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Characterizing datasets for data deduplication in backup applications
The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信