Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage

Xincheng Yuan, M. Moh, Teng-Sheng Moh
{"title":"Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage","authors":"Xincheng Yuan, M. Moh, Teng-Sheng Moh","doi":"10.1109/ASONAM55673.2022.10068661","DOIUrl":null,"url":null,"abstract":"Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc. It is commonly performed as part of data preprocessing to eliminate redundant data that requires extra storage spaces and computing power and is crucial for data storage management in cloud computing. Deduplication is essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period such as daily. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. This paper explores the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. The proposed system is named SegDup, which achieves 13% higher deduplication ratio than Extreme Binning, a state-of-the art deduplication algorithm.","PeriodicalId":423113,"journal":{"name":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM55673.2022.10068661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc. It is commonly performed as part of data preprocessing to eliminate redundant data that requires extra storage spaces and computing power and is crucial for data storage management in cloud computing. Deduplication is essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period such as daily. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. This paper explores the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. The proposed system is named SegDup, which achieves 13% higher deduplication ratio than Extreme Binning, a state-of-the art deduplication algorithm.
基于强化学习的云存储全文件块重复数据删除
重复数据删除是从在线数据库、云数据存储、本地文件系统等存储设施中删除复制数据内容的过程。它通常作为数据预处理的一部分执行,以消除需要额外存储空间和计算能力的冗余数据,并且对于云计算中的数据存储管理至关重要。重复数据删除对于文件备份系统是必不可少的,因为重复的文件可能会消耗更多的存储空间,特别是对于较短的备份周期(如每日备份)。该领域的一种常用技术涉及将文件分割成块,这些块的哈希值可以使用数据结构或集群等技术进行比较。本文探讨了利用创新的强化学习方法执行此类文件块重复数据删除的可能性,以实现高重复数据删除比率。该系统被命名为SegDup,它的重复数据删除率比最先进的重复数据删除算法Extreme Binning高出13%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信