{"title":"The Evolution of the Hadoop Distributed File System","authors":"Stathis Maneas, Bianca Schroeder","doi":"10.1109/WAINA.2018.00065","DOIUrl":null,"url":null,"abstract":"Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.","PeriodicalId":296466,"journal":{"name":"2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WAINA.2018.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.