The Evolution of the Hadoop Distributed File System

2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA) Pub Date : 2018-05-16 DOI:10.1109/WAINA.2018.00065

Stathis Maneas, Bianca Schroeder

{"title":"The Evolution of the Hadoop Distributed File System","authors":"Stathis Maneas, Bianca Schroeder","doi":"10.1109/WAINA.2018.00065","DOIUrl":null,"url":null,"abstract":"Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.","PeriodicalId":296466,"journal":{"name":"2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WAINA.2018.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.

查看原文本刊更多论文

Hadoop分布式文件系统的演变

大规模分布式数据处理的框架，比如Hadoop生态系统，是我们在过去十年中经历的大数据革命的核心。在本文中，我们对Hadoop分布式文件系统(HDFS)的代码演变进行了广泛的研究。我们的研究是基于官方Apache问题跟踪器(JIRA)提供的报告和补丁文件(patch)，我们的目标是完全利用HDFS当时的整个历史和可用数据的丰富性。我们研究的目的是帮助开发人员改进类似系统的设计，并在一般情况下实现更可靠的系统。与之前的工作相比，我们的研究涵盖了HDFS生命周期内提交的所有报告，而不是抽样的子集。此外，我们还包括所有相关的补丁文件，这些文件已被系统开发人员验证，并根据我们开发的两级分类方案，通过手动检查前九年的所有3302份报告，以比以前更精细的粒度对问题的根本原因进行分类。这使我们能够呈现HDFS的不同视角，包括关注系统随时间的演变，以及对以前没有详细研究过的特征的详细分析。这些包括，例如，问题的范围和复杂性，根据修复它的补丁的大小和它影响的文件数量，问题暴露之前所需的时间，解决问题所需的时间以及这些随时间的变化情况。我们的结果表明，bug报告构成了最主要的类型，随着时间的推移，bug报告的比率不断增加。此外，报告和补丁文件的总体范围和复杂性在HDFS的整个生命周期中保持惊人的稳定，尽管代码库随着时间的推移经历了显著的增长。最后，作为我们工作的一部分，我们创建了一个详细的数据库，其中包括所有报告和补丁，以及我们提取的关键特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)

自引率

0.00%

发文量