{"title":"提高Hadoop中小文件访问的性能","authors":"Chatuporn Vorapongkitipun, N. Nupairoj","doi":"10.1109/JCSSE.2014.6841867","DOIUrl":null,"url":null,"abstract":"The Hadoop Distributed File System (HDFS) is an open source system which is designed to run on commodity hardware and is suitable for applications that have large data sets (terabytes). As HDFS architecture bases on single master (NameNode) to handle metadata management for multiple slaves (Datanode), NameNode often becomes bottleneck, especially when handling large number of small files. To maximize efficiency, NameNode stores the entire metadata of HDFS in its main memory. With too many small files, NameNode can be running out of memory. In this paper, we propose a mechanism based on Hadoop Archive (HAR), called New Hadoop Archive (NHAR), to improve the memory utilization for metadata and enhance the efficiency of accessing small files in HDFS. In addition, we also extend HAR capabilities to allow additional files to be inserted into the existing archive files. Our experiment results show that our approach can to improve the access efficiencies of small files drastically as it outperforms HAR up to 85.47%.","PeriodicalId":331610,"journal":{"name":"2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":"{\"title\":\"Improving performance of small-file accessing in Hadoop\",\"authors\":\"Chatuporn Vorapongkitipun, N. Nupairoj\",\"doi\":\"10.1109/JCSSE.2014.6841867\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Hadoop Distributed File System (HDFS) is an open source system which is designed to run on commodity hardware and is suitable for applications that have large data sets (terabytes). As HDFS architecture bases on single master (NameNode) to handle metadata management for multiple slaves (Datanode), NameNode often becomes bottleneck, especially when handling large number of small files. To maximize efficiency, NameNode stores the entire metadata of HDFS in its main memory. With too many small files, NameNode can be running out of memory. In this paper, we propose a mechanism based on Hadoop Archive (HAR), called New Hadoop Archive (NHAR), to improve the memory utilization for metadata and enhance the efficiency of accessing small files in HDFS. In addition, we also extend HAR capabilities to allow additional files to be inserted into the existing archive files. Our experiment results show that our approach can to improve the access efficiencies of small files drastically as it outperforms HAR up to 85.47%.\",\"PeriodicalId\":331610,\"journal\":{\"name\":\"2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"45\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JCSSE.2014.6841867\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2014.6841867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving performance of small-file accessing in Hadoop
The Hadoop Distributed File System (HDFS) is an open source system which is designed to run on commodity hardware and is suitable for applications that have large data sets (terabytes). As HDFS architecture bases on single master (NameNode) to handle metadata management for multiple slaves (Datanode), NameNode often becomes bottleneck, especially when handling large number of small files. To maximize efficiency, NameNode stores the entire metadata of HDFS in its main memory. With too many small files, NameNode can be running out of memory. In this paper, we propose a mechanism based on Hadoop Archive (HAR), called New Hadoop Archive (NHAR), to improve the memory utilization for metadata and enhance the efficiency of accessing small files in HDFS. In addition, we also extend HAR capabilities to allow additional files to be inserted into the existing archive files. Our experiment results show that our approach can to improve the access efficiencies of small files drastically as it outperforms HAR up to 85.47%.