{"title":"Efficient information retrieval using Lucene, LIndex and HIndex in Hadoop","authors":"Anita Brigit Mathew, P. Pattnaik, S. D. M. Kumar","doi":"10.1109/AICCSA.2014.7073217","DOIUrl":null,"url":null,"abstract":"The growth of unstructured and partially-structured data in biological networks, social media, geographical information and other web-based applications present an open challenge to the cloud database community. Hence, the approach to exhaustive BigData analysis that integrates structured and unstructured data processing have become increasingly critical in today's world. MapReduce, has recently emerged as a popular framework for extensive data analytics. Use of powerful indexing techniques would allow users to significantly speed up query processing among MapReduce jobs. Currently, there are a number of indexing techniques like Hadoop++, HAIL, LIAH, Adaptive Indexing etc., but none of them provide an optimized technique for text based selection operations. This paper proposes two indexing approaches in HDFS, namely LIndex and HIndex. These indexing approaches are found to carefully perform selection operation better compared to existing Lucene index approach. A fast retrieval technique is suggested in the MapReduce framework with the new LIndex and HIndex approaches. LIndex provides a complete-text index and it informs the Hadoop implementation engine to scan only those data blocks which contain the terms of interest. LIndex also enhances the throughput (minimizes response time) and overcome some of the drawbacks like upfront cost and long idle time for index creation. This gave a better performance than Lucene but lacked in response and computation time. Hence a new index named HIndex is suggested. This scheme is found to perform better than LIndex in response and computation time.","PeriodicalId":412749,"journal":{"name":"2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA.2014.7073217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
The growth of unstructured and partially-structured data in biological networks, social media, geographical information and other web-based applications present an open challenge to the cloud database community. Hence, the approach to exhaustive BigData analysis that integrates structured and unstructured data processing have become increasingly critical in today's world. MapReduce, has recently emerged as a popular framework for extensive data analytics. Use of powerful indexing techniques would allow users to significantly speed up query processing among MapReduce jobs. Currently, there are a number of indexing techniques like Hadoop++, HAIL, LIAH, Adaptive Indexing etc., but none of them provide an optimized technique for text based selection operations. This paper proposes two indexing approaches in HDFS, namely LIndex and HIndex. These indexing approaches are found to carefully perform selection operation better compared to existing Lucene index approach. A fast retrieval technique is suggested in the MapReduce framework with the new LIndex and HIndex approaches. LIndex provides a complete-text index and it informs the Hadoop implementation engine to scan only those data blocks which contain the terms of interest. LIndex also enhances the throughput (minimizes response time) and overcome some of the drawbacks like upfront cost and long idle time for index creation. This gave a better performance than Lucene but lacked in response and computation time. Hence a new index named HIndex is suggested. This scheme is found to perform better than LIndex in response and computation time.