Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi
{"title":"Parallelization of Vertical Search Engine using Hadoop and MapReduce","authors":"Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi","doi":"10.1145/2979779.2979830","DOIUrl":null,"url":null,"abstract":"In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.","PeriodicalId":298730,"journal":{"name":"Proceedings of the International Conference on Advances in Information Communication Technology & Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Advances in Information Communication Technology & Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2979779.2979830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.