基于Hadoop和MapReduce的垂直搜索引擎并行化

Proceedings of the International Conference on Advances in Information Communication Technology & Computing Pub Date : 2016-08-12 DOI:10.1145/2979779.2979830

Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi

{"title":"基于Hadoop和MapReduce的垂直搜索引擎并行化","authors":"Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi","doi":"10.1145/2979779.2979830","DOIUrl":null,"url":null,"abstract":"In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.","PeriodicalId":298730,"journal":{"name":"Proceedings of the International Conference on Advances in Information Communication Technology & Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Parallelization of Vertical Search Engine using Hadoop and MapReduce\",\"authors\":\"Rajat Pasari, Vaibhav Chaudhari, Atharva Borkar, Amit D. Joshi\",\"doi\":\"10.1145/2979779.2979830\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.\",\"PeriodicalId\":298730,\"journal\":{\"name\":\"Proceedings of the International Conference on Advances in Information Communication Technology & Computing\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on Advances in Information Communication Technology & Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2979779.2979830\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Advances in Information Communication Technology & Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2979779.2979830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

本文在Apache Hadoop集群上构建了一个并行垂直搜索引擎。我们的垂直搜索引擎的领域是计算机相关的术语，它需要从维基百科提取的计算机领域的种子url。这些网页随后在Apache Nutch爬虫的帮助下进行抓取和解析，并存储到Apache HBase中。在数据库中存储的网页内容上执行诸如删除停止词和词干等语言处理。构建倒排索引，并根据本文后面介绍的排序算法对结果进行排序。本文的重点是通过分布式处理，在尽可能短的时间内获得最相关的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallelization of Vertical Search Engine using Hadoop and MapReduce

In this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then crawled and parsed with the help of Apache Nutch crawler and stored into Apache HBase. Linguistic processing like removal of stop words and stemming are performed on the content of the web pages stored in the database. Inverted index is constructed and results are ranked as per the ranking algorithm described later in the paper. This paper mainly focuses on obtaining the most relevant results in shortest time possible by distributed processing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Conference on Advances in Information Communication Technology & Computing

自引率

0.00%

发文量