The Evolution of the Hadoop Distributed File System

Stathis Maneas, Bianca Schroeder
{"title":"The Evolution of the Hadoop Distributed File System","authors":"Stathis Maneas, Bianca Schroeder","doi":"10.1109/WAINA.2018.00065","DOIUrl":null,"url":null,"abstract":"Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.","PeriodicalId":296466,"journal":{"name":"2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WAINA.2018.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.
Hadoop分布式文件系统的演变
大规模分布式数据处理的框架,比如Hadoop生态系统,是我们在过去十年中经历的大数据革命的核心。在本文中,我们对Hadoop分布式文件系统(HDFS)的代码演变进行了广泛的研究。我们的研究是基于官方Apache问题跟踪器(JIRA)提供的报告和补丁文件(patch),我们的目标是完全利用HDFS当时的整个历史和可用数据的丰富性。我们研究的目的是帮助开发人员改进类似系统的设计,并在一般情况下实现更可靠的系统。与之前的工作相比,我们的研究涵盖了HDFS生命周期内提交的所有报告,而不是抽样的子集。此外,我们还包括所有相关的补丁文件,这些文件已被系统开发人员验证,并根据我们开发的两级分类方案,通过手动检查前九年的所有3302份报告,以比以前更精细的粒度对问题的根本原因进行分类。这使我们能够呈现HDFS的不同视角,包括关注系统随时间的演变,以及对以前没有详细研究过的特征的详细分析。这些包括,例如,问题的范围和复杂性,根据修复它的补丁的大小和它影响的文件数量,问题暴露之前所需的时间,解决问题所需的时间以及这些随时间的变化情况。我们的结果表明,bug报告构成了最主要的类型,随着时间的推移,bug报告的比率不断增加。此外,报告和补丁文件的总体范围和复杂性在HDFS的整个生命周期中保持惊人的稳定,尽管代码库随着时间的推移经历了显著的增长。最后,作为我们工作的一部分,我们创建了一个详细的数据库,其中包括所有报告和补丁,以及我们提取的关键特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信