Ki-Sung Park, Jaehyun Choi, Jong-Bae Kim, Jae-won Park
{"title":"基于Apache Spark的搜索引擎的设计与实现","authors":"Ki-Sung Park, Jaehyun Choi, Jong-Bae Kim, Jae-won Park","doi":"10.6109/jkiice.2017.21.1.17","DOIUrl":null,"url":null,"abstract":"Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 키워드 : 검색엔진, 크롤러, 너치, 스파크, 솔라 Key word : Search Engine, Crawler, Nutch, Spark, Solr Received 10 October 2016, Revised 17 October 2016, Accepted 28 October 2016 * Corresponding Author Jae-Won Park(E-mail:jwpark@ssu.ac.kr, Tel:+82-2-828-7014) Graduate School of Software, Soongsil University, Seoul 06978, Korea Open Access http://doi.org/10.6109/jkiice.2017.21.1.17 print ISSN: 2234-4772 online ISSN: 2288-4165 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright C The Korea Institute of Information and Communication Engineering. Journal of the Korea Institute of Information and Communication Engineering 한국정보통신학회논문지(J. Korea Inst. Inf. Commun. Eng.) Vol. 21, No. 1 : 17~28 Jan. 2017","PeriodicalId":136663,"journal":{"name":"The Journal of the Korean Institute of Information and Communication Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Design and Implementation of a Search Engine based on Apache Spark\",\"authors\":\"Ki-Sung Park, Jaehyun Choi, Jong-Bae Kim, Jae-won Park\",\"doi\":\"10.6109/jkiice.2017.21.1.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 키워드 : 검색엔진, 크롤러, 너치, 스파크, 솔라 Key word : Search Engine, Crawler, Nutch, Spark, Solr Received 10 October 2016, Revised 17 October 2016, Accepted 28 October 2016 * Corresponding Author Jae-Won Park(E-mail:jwpark@ssu.ac.kr, Tel:+82-2-828-7014) Graduate School of Software, Soongsil University, Seoul 06978, Korea Open Access http://doi.org/10.6109/jkiice.2017.21.1.17 print ISSN: 2234-4772 online ISSN: 2288-4165 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright C The Korea Institute of Information and Communication Engineering. Journal of the Korea Institute of Information and Communication Engineering 한국정보통신학회논문지(J. Korea Inst. Inf. Commun. Eng.) Vol. 21, No. 1 : 17~28 Jan. 2017\",\"PeriodicalId\":136663,\"journal\":{\"name\":\"The Journal of the Korean Institute of Information and Communication Engineering\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of the Korean Institute of Information and Communication Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.6109/jkiice.2017.21.1.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of the Korean Institute of Information and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.6109/jkiice.2017.21.1.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Design and Implementation of a Search Engine based on Apache Spark
Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 키워드 : 검색엔진, 크롤러, 너치, 스파크, 솔라 Key word : Search Engine, Crawler, Nutch, Spark, Solr Received 10 October 2016, Revised 17 October 2016, Accepted 28 October 2016 * Corresponding Author Jae-Won Park(E-mail:jwpark@ssu.ac.kr, Tel:+82-2-828-7014) Graduate School of Software, Soongsil University, Seoul 06978, Korea Open Access http://doi.org/10.6109/jkiice.2017.21.1.17 print ISSN: 2234-4772 online ISSN: 2288-4165 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright C The Korea Institute of Information and Communication Engineering. Journal of the Korea Institute of Information and Communication Engineering 한국정보통신학회논문지(J. Korea Inst. Inf. Commun. Eng.) Vol. 21, No. 1 : 17~28 Jan. 2017