基于Apache Spark的搜索引擎的设计与实现

The Journal of the Korean Institute of Information and Communication Engineering Pub Date : 2017-01-31 DOI:10.6109/jkiice.2017.21.1.17

Ki-Sung Park, Jaehyun Choi, Jong-Bae Kim, Jae-won Park

{"title":"基于Apache Spark的搜索引擎的设计与实现","authors":"Ki-Sung Park, Jaehyun Choi, Jong-Bae Kim, Jae-won Park","doi":"10.6109/jkiice.2017.21.1.17","DOIUrl":null,"url":null,"abstract":"Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 키워드 : 검색엔진, 크롤러, 너치, 스파크, 솔라 Key word : Search Engine, Crawler, Nutch, Spark, Solr Received 10 October 2016, Revised 17 October 2016, Accepted 28 October 2016 * Corresponding Author Jae-Won Park(E-mail:jwpark@ssu.ac.kr, Tel:+82-2-828-7014) Graduate School of Software, Soongsil University, Seoul 06978, Korea Open Access http://doi.org/10.6109/jkiice.2017.21.1.17 print ISSN: 2234-4772 online ISSN: 2288-4165 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright C The Korea Institute of Information and Communication Engineering. Journal of the Korea Institute of Information and Communication Engineering 한국정보통신학회논문지(J. Korea Inst. Inf. Commun. Eng.) Vol. 21, No. 1 : 17~28 Jan. 2017","PeriodicalId":136663,"journal":{"name":"The Journal of the Korean Institute of Information and Communication Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Design and Implementation of a Search Engine based on Apache Spark\",\"authors\":\"Ki-Sung Park, Jaehyun Choi, Jong-Bae Kim, Jae-won Park\",\"doi\":\"10.6109/jkiice.2017.21.1.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 키워드 : 검색엔진, 크롤러, 너치, 스파크, 솔라 Key word : Search Engine, Crawler, Nutch, Spark, Solr Received 10 October 2016, Revised 17 October 2016, Accepted 28 October 2016 * Corresponding Author Jae-Won Park(E-mail:jwpark@ssu.ac.kr, Tel:+82-2-828-7014) Graduate School of Software, Soongsil University, Seoul 06978, Korea Open Access http://doi.org/10.6109/jkiice.2017.21.1.17 print ISSN: 2234-4772 online ISSN: 2288-4165 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright C The Korea Institute of Information and Communication Engineering. Journal of the Korea Institute of Information and Communication Engineering 한국정보통신학회논문지(J. Korea Inst. Inf. Commun. Eng.) Vol. 21, No. 1 : 17~28 Jan. 2017\",\"PeriodicalId\":136663,\"journal\":{\"name\":\"The Journal of the Korean Institute of Information and Communication Engineering\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of the Korean Institute of Information and Communication Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.6109/jkiice.2017.21.1.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of the Korean Institute of Information and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.6109/jkiice.2017.21.1.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

最近，由于数据的价值越来越有用，对数据的研究也在积极进行。网络爬虫是近年来备受关注的数据收集程序，因为它可以利用各个领域的优势。Web爬虫可以定义为自动遍历Web服务器，对Web页面进行分析并收集URL的工具。对于大数据的处理，基于Hadoop MapReduce的分布式Web爬虫被广泛使用。但是，它使用起来很困难，并且对性能有一定的限制。Apache spark是内存计算平台，是MapReduce的替代方案。搜索引擎，这是网络爬虫的主要目的之一，显示您的信息搜索关键字收集的网络爬虫。如果搜索引擎实现一个基于spark的网络爬虫，而不是传统的基于mapreduce的网络爬虫，它将是一个更快速的数据收集。키워드:검색엔진,크롤러,너치,스파크,솔라关键字:搜索引擎爬虫,Nutch,火花,Solr收到了2016年10月10日,修订后的2016年10月17日,接受了2016年10月28日*通讯作者Jae-Won公园(电子邮件:jwpark@ssu.ac.kr, Tel: + 82-2-828-7014)软件,研究生院的进程,06978年首尔,韩国开放访问http://doi.org/10.6109/jkiice.2017.21.1.17打印ISSN: 2234 - 4772在线ISSN:2288-4165这是一篇在知识共享署名非商业许可(http://creativecommons.org/li-censes/ by-nc/3.0/)的条款下发布的开放获取文章，该许可允许在任何媒介上不受限制的非商业使用、分发和复制，前提是正确引用原始作品。版权所有C韩国信息通信工程研究院。韩国信息通信工程学院学报[J]。韩国国际研究所。Eng)。Vol. 21, No. 1: 17~28 Jan. 2017

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Design and Implementation of a Search Engine based on Apache Spark

Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 키워드 : 검색엔진, 크롤러, 너치, 스파크, 솔라 Key word : Search Engine, Crawler, Nutch, Spark, Solr Received 10 October 2016, Revised 17 October 2016, Accepted 28 October 2016 * Corresponding Author Jae-Won Park(E-mail:jwpark@ssu.ac.kr, Tel:+82-2-828-7014) Graduate School of Software, Soongsil University, Seoul 06978, Korea Open Access http://doi.org/10.6109/jkiice.2017.21.1.17 print ISSN: 2234-4772 online ISSN: 2288-4165 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright C The Korea Institute of Information and Communication Engineering. Journal of the Korea Institute of Information and Communication Engineering 한국정보통신학회논문지(J. Korea Inst. Inf. Commun. Eng.) Vol. 21, No. 1 : 17~28 Jan. 2017

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Journal of the Korean Institute of Information and Communication Engineering

自引率

0.00%

发文量