Towards a Better Replica Management for Hadoop Distributed File System

Hilmi Egemen Ciritoglu, Takfarinas Saber, Teodora Sandra Buda, John Murphy, Christina Thorpe
{"title":"Towards a Better Replica Management for Hadoop Distributed File System","authors":"Hilmi Egemen Ciritoglu, Takfarinas Saber, Teodora Sandra Buda, John Murphy, Christina Thorpe","doi":"10.1109/BigDataCongress.2018.00021","DOIUrl":null,"url":null,"abstract":"The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2018.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
面向Hadoop分布式文件系统的副本管理
Hadoop分布式文件系统(HDFS)是大规模分布式系统的存储选择。除了高效和可扩展外,HDFS还通过数据复制提供高吞吐量和高可靠性。最近的工作通过动态改变按需数据的复制因子来利用这种复制特性,以此作为增加数据局部性和实现性能改进的一种手段。然而,据我们所知,还没有对改变复制因子的后果进行过研究。特别是,我们的工作首次表明,尽管HDFS在增加复制因子方面处理得很好,但在减少复制因子时却遇到了问题。这将导致数据不平衡、热点和性能下降。为了解决这个问题,我们提出了一种新的工作负载感知平衡副本删除算法。我们还表明,我们的算法成功地保持了数据平衡,与HDFS相比,执行时间提高了48%,而平均只增加了1.69%的开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信