Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata

ACM/IMS transactions on data science Pub Date : 2021-04-21 DOI:10.1145/3437261

Ru Yang, Yuhui Deng, Yi Zhou, Ping Huang

{"title":"Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata","authors":"Ru Yang, Yuhui Deng, Yi Zhou, Ping Huang","doi":"10.1145/3437261","DOIUrl":null,"url":null,"abstract":"Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a deduplication system. Rewriting algorithms are used to alleviate the fragmentation problem by improving the restoring speed of a deduplication system. However, rewriting methods give birth to a big sacrifice in terms of deduplication ratio, leading to a huge storage space waste. Furthermore, traditional backup approaches treat file metadata and chunk metadata as the same, which causes frequent on-disk metadata accesses. In this article, we start by analyzing storage characteristics of backup metadata. An intriguing finding shows that with 10 million files, the file metadata merely takes up approximately 340 MB. Motivated by this finding, we propose a Classified-Metadata based Restoring method (CMR) that classifies backup metadata into file metadata and chunk metadata. Because the file metadata merely takes up a meager amount of space, CMR maintains all file metadata in memory, whereas chunk metadata are aggressively prefetched to memory in a greedy manner. A deduplication system with CMR in place exhibits three salient features: (i) It avoids rewriting algorithms’ additional overhead by reducing the number of disk reads in a restoring process, (ii) it increases the restoring throughput without sacrificing the deduplication ratio, and (iii) it thoroughly leverages the hardware resources to boost the restoring performance. To quantitatively evaluate the performance of CMR, we compare our CMR against two state-of-the-art approaches, namely, a history-aware rewriting method (HAR) and a context-based rewriting scheme (CAP). The experimental results show that compared to HAR and CAP, CMR reduces the restoring time by 27.2% and 29.3%, respectively. Moreover, the deduplication ratio is improved by 1.91% and 4.36%, respectively.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 16"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3437261","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437261","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a deduplication system. Rewriting algorithms are used to alleviate the fragmentation problem by improving the restoring speed of a deduplication system. However, rewriting methods give birth to a big sacrifice in terms of deduplication ratio, leading to a huge storage space waste. Furthermore, traditional backup approaches treat file metadata and chunk metadata as the same, which causes frequent on-disk metadata accesses. In this article, we start by analyzing storage characteristics of backup metadata. An intriguing finding shows that with 10 million files, the file metadata merely takes up approximately 340 MB. Motivated by this finding, we propose a Classified-Metadata based Restoring method (CMR) that classifies backup metadata into file metadata and chunk metadata. Because the file metadata merely takes up a meager amount of space, CMR maintains all file metadata in memory, whereas chunk metadata are aggressively prefetched to memory in a greedy manner. A deduplication system with CMR in place exhibits three salient features: (i) It avoids rewriting algorithms’ additional overhead by reducing the number of disk reads in a restoring process, (ii) it increases the restoring throughput without sacrificing the deduplication ratio, and (iii) it thoroughly leverages the hardware resources to boost the restoring performance. To quantitatively evaluate the performance of CMR, we compare our CMR against two state-of-the-art approaches, namely, a history-aware rewriting method (HAR) and a context-based rewriting scheme (CAP). The experimental results show that compared to HAR and CAP, CMR reduces the restoring time by 27.2% and 29.3%, respectively. Moreover, the deduplication ratio is improved by 1.91% and 4.36%, respectively.

查看原文本刊更多论文

通过备份元数据分类提升重删数据恢复性能

恢复数据是存储系统备份数据的主要目的。由于逻辑上连续的数据在物理上分散在不同的磁盘位置，碎片问题会对重复数据删除系统的恢复性能产生负面影响。重写算法通过提高重删系统的恢复速度来缓解分片问题。但是，重写方式在重复数据删除率方面付出了很大的代价，导致存储空间的巨大浪费。此外，传统的备份方法将文件元数据和块元数据视为相同，这导致频繁访问磁盘上的元数据。在本文中，我们首先分析备份元数据的存储特征。一个有趣的发现表明，在1000万个文件中，文件元数据仅占用约340 MB。基于这一发现，我们提出了一种基于分类元数据的恢复方法(CMR)，该方法将备份元数据分为文件元数据和块元数据。因为文件元数据只占用少量空间，所以CMR在内存中维护所有文件元数据，而块元数据则以贪婪的方式主动预取到内存中。采用CMR的重复数据删除系统具有三个显著特征:(i)通过减少恢复过程中的磁盘读取次数，避免了重写算法的额外开销;(ii)在不牺牲重复数据删除比率的情况下增加了恢复吞吐量;(iii)充分利用了硬件资源来提高恢复性能。为了定量评估CMR的性能，我们将CMR与两种最先进的方法进行了比较，即历史感知重写方法(HAR)和基于上下文的重写方案(CAP)。实验结果表明，与HAR和CAP相比，CMR的恢复时间分别缩短了27.2%和29.3%。重复数据删除率分别提高1.91%和4.36%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM/IMS transactions on data science

自引率

0.00%

发文量