Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi
{"title":"基于Hadoop和MapReduce的垂直搜索引擎并行化","authors":"Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi","doi":"10.1145/2979779.2979830","DOIUrl":null,"url":null,"abstract":"In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.","PeriodicalId":298730,"journal":{"name":"Proceedings of the International Conference on Advances in Information Communication Technology & Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Parallelization of Vertical Search Engine using Hadoop and MapReduce\",\"authors\":\"Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi\",\"doi\":\"10.1145/2979779.2979830\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.\",\"PeriodicalId\":298730,\"journal\":{\"name\":\"Proceedings of the International Conference on Advances in Information Communication Technology & Computing\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on Advances in Information Communication Technology & Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2979779.2979830\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Advances in Information Communication Technology & Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2979779.2979830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Parallelization of Vertical Search Engine using Hadoop and MapReduce
In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.