Hadoop分布式文件系统中用于性能提升的复制因子研究

Hilmi Egemen Ciritoglu, Leandro Batista de Almeida, E. Almeida, Teodora Sandra Buda, John Murphy, Christina Thorpe
{"title":"Hadoop分布式文件系统中用于性能提升的复制因子研究","authors":"Hilmi Egemen Ciritoglu, Leandro Batista de Almeida, E. Almeida, Teodora Sandra Buda, John Murphy, Christina Thorpe","doi":"10.1145/3185768.3186359","DOIUrl":null,"url":null,"abstract":"The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.","PeriodicalId":10596,"journal":{"name":"Companion of the 2018 ACM/SPEC International Conference on Performance Engineering","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System\",\"authors\":\"Hilmi Egemen Ciritoglu, Leandro Batista de Almeida, E. Almeida, Teodora Sandra Buda, John Murphy, Christina Thorpe\",\"doi\":\"10.1145/3185768.3186359\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.\",\"PeriodicalId\":10596,\"journal\":{\"name\":\"Companion of the 2018 ACM/SPEC International Conference on Performance Engineering\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion of the 2018 ACM/SPEC International Conference on Performance Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3185768.3186359\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2018 ACM/SPEC International Conference on Performance Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3185768.3186359","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

数据量的巨大增长和对大数据利用的需求导致了Hadoop分布式文件系统(HDFS)解决方案的日益普及。然而,Hadoop和HDFS的性能有一些限制,并且在研究社区中仍然是一个开放的问题。我们研究的最终目标是开发一种适应性复制系统;本文介绍了工作的第一阶段——调查HDFS中使用的复制因子,以确定增加按需数据的复制因子是否可以提高系统的性能。我们为我们的实验环境构建了一个物理Hadoop集群,使用TestDFSIO以及真实世界和合成数据集NOAA和TPC-H,并使用Hive来验证我们的建议。结果表明,增加“热”数据的复制因子可以增加数据的可用性和局部性,从而减少作业的执行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System
The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信