排序重复数据删除:如何处理数千个备份流

2016 32nd Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2016-05-02 DOI:10.1109/MSST.2016.7897082

J. Kaiser, Tim Süß, Lars Nagel, A. Brinkmann

{"title":"排序重复数据删除:如何处理数千个备份流","authors":"J. Kaiser, Tim Süß, Lars Nagel, A. Brinkmann","doi":"10.1109/MSST.2016.7897082","DOIUrl":null,"url":null,"abstract":"The requirements of deduplication systems have changed in the last years. Early deduplication systems had to process dozens to hundreds of backup streams at the same time while today they are able to process hundreds to thousands of them. Traditional approaches rely on stream-locality, which supports parallelism, but which easily leads to many non-contiguous disk accesses, as each stream competes with all other streams for the available resources. This paper presents a new exact deduplication approach designed for processing thousands of backup streams at the same time on the same fingerprint index. The underlying approach destroys the traditionally exploited temporal chunk locality and creates a new one by sorting fingerprints. The sorting leads to perfectly sequential disk access patterns on the backup servers, while only slightly increasing the load on the clients. In our experiments, the new approach generates up to 113 times less I/Os than the exact Data Domain deduplication file system and up to 12 times less I/Os than the approximate Sparse Indexing, while consuming less memory at the same time.","PeriodicalId":299251,"journal":{"name":"2016 32nd Symposium on Mass Storage Systems and Technologies (MSST)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Sorted deduplication: How to process thousands of backup streams\",\"authors\":\"J. Kaiser, Tim Süß, Lars Nagel, A. Brinkmann\",\"doi\":\"10.1109/MSST.2016.7897082\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The requirements of deduplication systems have changed in the last years. Early deduplication systems had to process dozens to hundreds of backup streams at the same time while today they are able to process hundreds to thousands of them. Traditional approaches rely on stream-locality, which supports parallelism, but which easily leads to many non-contiguous disk accesses, as each stream competes with all other streams for the available resources. This paper presents a new exact deduplication approach designed for processing thousands of backup streams at the same time on the same fingerprint index. The underlying approach destroys the traditionally exploited temporal chunk locality and creates a new one by sorting fingerprints. The sorting leads to perfectly sequential disk access patterns on the backup servers, while only slightly increasing the load on the clients. In our experiments, the new approach generates up to 113 times less I/Os than the exact Data Domain deduplication file system and up to 12 times less I/Os than the approximate Sparse Indexing, while consuming less memory at the same time.\",\"PeriodicalId\":299251,\"journal\":{\"name\":\"2016 32nd Symposium on Mass Storage Systems and Technologies (MSST)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 32nd Symposium on Mass Storage Systems and Technologies (MSST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSST.2016.7897082\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 32nd Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2016.7897082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

最近几年，重复数据删除系统的要求发生了变化。早期的重复数据删除系统必须同时处理数十到数百个备份流，而今天它们能够处理数百到数千个备份流。传统的方法依赖于流局部性，它支持并行性，但这很容易导致许多不连续的磁盘访问，因为每个流与所有其他流竞争可用资源。本文提出了一种新的精确重复数据删除方法，用于在同一指纹索引上同时处理数千个备份流。该方法破坏了传统上利用的时间块局部性，并通过对指纹进行排序来创建新的局部性。排序会在备份服务器上产生完全顺序的磁盘访问模式，同时只略微增加客户机上的负载。在我们的实验中，新方法产生的I/ o比精确的Data Domain重复数据删除文件系统少113倍，比近似的稀疏索引少12倍，同时消耗更少的内存。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sorted deduplication: How to process thousands of backup streams

The requirements of deduplication systems have changed in the last years. Early deduplication systems had to process dozens to hundreds of backup streams at the same time while today they are able to process hundreds to thousands of them. Traditional approaches rely on stream-locality, which supports parallelism, but which easily leads to many non-contiguous disk accesses, as each stream competes with all other streams for the available resources. This paper presents a new exact deduplication approach designed for processing thousands of backup streams at the same time on the same fingerprint index. The underlying approach destroys the traditionally exploited temporal chunk locality and creates a new one by sorting fingerprints. The sorting leads to perfectly sequential disk access patterns on the backup servers, while only slightly increasing the load on the clients. In our experiments, the new approach generates up to 113 times less I/Os than the exact Data Domain deduplication file system and up to 12 times less I/Os than the approximate Sparse Indexing, while consuming less memory at the same time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 32nd Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量