{"title":"Balanced multifileinput split (BaMS) technique to solve small file problem in hadoop","authors":"L. Mohan, M. Elayidom","doi":"10.1109/ICIINFS.2016.8262923","DOIUrl":null,"url":null,"abstract":"Hadoop Ditributed File system is designed to process large amount of data. However, processing large number of small files seems inefficient, since Hadoop supports only block level operations. But small file processing is inevitable for real time processing, log processing etc. Hence, to rectify this performance bottleneck, we propose a Balanced MultiFileInput Split (BaMS) technique where files are merged together and stored. Data is converted to bytes and collectively stored in ArrayWritable format. To avoid the need for separate indexing, we follow a hierarchical file naming & storing scheme. The method describes how to access the merged files through Map Reduce Programs. Analysis performed on BaMS proves that it is much efficient compared to the existing methods like HAR and sequence files in terms of storage and access efficiency.","PeriodicalId":234609,"journal":{"name":"2016 11th International Conference on Industrial and Information Systems (ICIIS)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 11th International Conference on Industrial and Information Systems (ICIIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIINFS.2016.8262923","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Hadoop Ditributed File system is designed to process large amount of data. However, processing large number of small files seems inefficient, since Hadoop supports only block level operations. But small file processing is inevitable for real time processing, log processing etc. Hence, to rectify this performance bottleneck, we propose a Balanced MultiFileInput Split (BaMS) technique where files are merged together and stored. Data is converted to bytes and collectively stored in ArrayWritable format. To avoid the need for separate indexing, we follow a hierarchical file naming & storing scheme. The method describes how to access the merged files through Map Reduce Programs. Analysis performed on BaMS proves that it is much efficient compared to the existing methods like HAR and sequence files in terms of storage and access efficiency.