Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity

2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM) Pub Date : 2022-01-03 DOI:10.1109/IMCOM53663.2022.9721761

Rakesh Gururaj, M. Moh, Teng-Sheng Moh, Philip Shilane, Bhimsen Bhanjois

{"title":"Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity","authors":"Rakesh Gururaj, M. Moh, Teng-Sheng Moh, Philip Shilane, Bhimsen Bhanjois","doi":"10.1109/IMCOM53663.2022.9721761","DOIUrl":null,"url":null,"abstract":"Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space by matching strong data hashes (e.g., fingerprints). The adoption of deduplication for primary storage has been hampered because of its complexities, such as random-access patterns to data and the need for quicker request response time. Most of the solutions designed for primary storage are offline and dependent on the concept of locality. This paper proposes an inline deduplication system with a Machine Learning based cache eviction policy to reduce the metadata overhead in the deduplication process, eliminate the redundant writes and improve the overall throughput in latency-sensitive storage workload.Caching of the fingerprints plays a vital role in improving performance during deduplication. A novel Machine Learning model for cache eviction is built based on the recency, frequency, Logical Block Address, and category of a data block. The experimental results show that 33% of redundant writes are eliminated, 54.5% of metadata overhead is reduced by exploiting block similarity, and the metadata cache hit rates based on the Machine Learning model are higher by 5.43% and 10.36% over systems with Least Recently Used eviction and Least Frequently Used eviction policy respectively. We achieved 14.4% better throughput with a workload-dependent Machine Learning-based cache eviction policy than a system with traditional cache eviction policy. The cache system learns the past evicted block I/O statistics and refines itself while choosing an eviction candidate. Our system was evaluated on real-world I/O traces in experiments.","PeriodicalId":367038,"journal":{"name":"2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMCOM53663.2022.9721761","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space by matching strong data hashes (e.g., fingerprints). The adoption of deduplication for primary storage has been hampered because of its complexities, such as random-access patterns to data and the need for quicker request response time. Most of the solutions designed for primary storage are offline and dependent on the concept of locality. This paper proposes an inline deduplication system with a Machine Learning based cache eviction policy to reduce the metadata overhead in the deduplication process, eliminate the redundant writes and improve the overall throughput in latency-sensitive storage workload.Caching of the fingerprints plays a vital role in improving performance during deduplication. A novel Machine Learning model for cache eviction is built based on the recency, frequency, Logical Block Address, and category of a data block. The experimental results show that 33% of redundant writes are eliminated, 54.5% of metadata overhead is reduced by exploiting block similarity, and the metadata cache hit rates based on the Machine Learning model are higher by 5.43% and 10.36% over systems with Least Recently Used eviction and Least Frequently Used eviction policy respectively. We achieved 14.4% better throughput with a workload-dependent Machine Learning-based cache eviction policy than a system with traditional cache eviction policy. The cache system learns the past evicted block I/O statistics and refines itself while choosing an eviction candidate. Our system was evaluated on real-world I/O traces in experiments.

查看原文本刊更多论文

利用缓存和块相似性的以性能为中心的主存储重复数据删除系统

重复数据删除是一种物理存储单个数据实例的概念，通过消除冗余副本，通过匹配强数据哈希(例如指纹)来节省存储空间。在主存储中采用重复数据删除一直受到阻碍，因为它很复杂，比如数据的随机访问模式和需要更快的请求响应时间。大多数为主存储设计的解决方案都是脱机的，并且依赖于局域性的概念。本文提出了一种内联重复数据删除系统，该系统采用基于机器学习的缓存退出策略，以减少重复数据删除过程中的元数据开销，消除冗余写，提高延迟敏感存储工作负载的整体吞吐量。指纹缓存在重复数据删除过程中对提高性能起着至关重要的作用。基于数据块的频次、频率、逻辑块地址和类别，建立了一种新的缓存清除机器学习模型。实验结果表明，通过利用块相似度，消除了33%的冗余写，减少了54.5%的元数据开销，基于机器学习模型的元数据缓存命中率比采用“最近使用最少”和“最不频繁使用”策略的系统分别提高了5.43%和10.36%。使用与工作负载相关的基于机器学习的缓存取出策略，我们的吞吐量比使用传统缓存取出策略的系统提高了14.4%。缓存系统学习过去被驱逐块的I/O统计数据，并在选择驱逐候选对象时对自身进行优化。我们的系统在实验中对实际I/O轨迹进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM)

自引率

0.00%

发文量