HyperPart: A Hypergraph-Based Abstraction for Deduplicated Storage Systems

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing Pub Date : 2024-11-19 DOI:10.1109/TCC.2024.3502464

Geyao Cheng;Junxu Xia;Lailong Luo;Haibo Mi;Deke Guo;Richard T. B. Ma

{"title":"HyperPart: A Hypergraph-Based Abstraction for Deduplicated Storage Systems","authors":"Geyao Cheng;Junxu Xia;Lailong Luo;Haibo Mi;Deke Guo;Richard T. B. Ma","doi":"10.1109/TCC.2024.3502464","DOIUrl":null,"url":null,"abstract":"Currently, deduplication techniques are utilized to minimize the space overhead by deleting redundant data blocks across large-scale servers in data centers. However, such a process exacerbates the fragmentation of data blocks, causing more cross-server file retrievals with plummeting retrieval throughput. Some attempts prefer better file retrieval performance by confining all blocks of a file to one single server, resulting in non-trivial space consumption for more replicated blocks across servers. An ideal network storage system, in effect, should take both the deduplication and retrieval performance into account by implementing reasonable assignment of the detected unique blocks. Such a fine-grained assignment requires an accurate and comprehensive abstraction of the files, blocks, and the file-block affiliation relationships. To achieve this, we innovatively design the weighted hypergraph to profile the multivariate data correlations. With this delicate abstraction in place, we propose HyperPart, which elegantly transforms this complex block allocation problem into a hypergraph partition problem. For more general scenarios with dynamic file updates, we further propose a two-phase incremental hypergraph repartition scheme, which mitigates the performance degradation with minimal migration volume. We implement a prototype system of HyperPart, and the experiment results validate that it saves around 50% of the storage space and improves the retrieval throughput by approximately 30% of state-of-the-art methods under the balance constraints.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"46-60"},"PeriodicalIF":5.0000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10758297/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Currently, deduplication techniques are utilized to minimize the space overhead by deleting redundant data blocks across large-scale servers in data centers. However, such a process exacerbates the fragmentation of data blocks, causing more cross-server file retrievals with plummeting retrieval throughput. Some attempts prefer better file retrieval performance by confining all blocks of a file to one single server, resulting in non-trivial space consumption for more replicated blocks across servers. An ideal network storage system, in effect, should take both the deduplication and retrieval performance into account by implementing reasonable assignment of the detected unique blocks. Such a fine-grained assignment requires an accurate and comprehensive abstraction of the files, blocks, and the file-block affiliation relationships. To achieve this, we innovatively design the weighted hypergraph to profile the multivariate data correlations. With this delicate abstraction in place, we propose HyperPart, which elegantly transforms this complex block allocation problem into a hypergraph partition problem. For more general scenarios with dynamic file updates, we further propose a two-phase incremental hypergraph repartition scheme, which mitigates the performance degradation with minimal migration volume. We implement a prototype system of HyperPart, and the experiment results validate that it saves around 50% of the storage space and improves the retrieval throughput by approximately 30% of state-of-the-art methods under the balance constraints.

查看原文本刊更多论文

HyperPart：基于超图的复制存储系统抽象

目前，重复数据删除技术主要通过在数据中心的大型服务器上删除冗余的数据块来减少空间开销。然而，这样的过程加剧了数据块的碎片化，导致更多的跨服务器文件检索，检索吞吐量直线下降。一些尝试通过将文件的所有块限制在单个服务器上来获得更好的文件检索性能，从而导致跨服务器复制更多块的空间消耗。实际上，理想的网络存储系统应该通过对检测到的唯一块进行合理分配，同时考虑重复数据删除性能和检索性能。这种细粒度的分配需要对文件、块和文件块关联关系进行准确而全面的抽象。为了实现这一点，我们创新地设计了加权超图来描述多变量数据的相关性。有了这个微妙的抽象，我们提出了HyperPart，它将这个复杂的块分配问题优雅地转换为超图划分问题。对于更一般的动态文件更新场景，我们进一步提出了一种两阶段增量超图重分区方案，该方案以最小的迁移量减轻了性能下降。我们实现了HyperPart的原型系统，实验结果证明，在平衡约束下，它节省了约50%的存储空间，并将检索吞吐量提高了约30%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cloud Computing Computer Science-Software

CiteScore

9.40

自引率

6.20%

发文量

167

期刊介绍： The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.