The Research of a Lightweight Distributed Crawling System

2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA) Pub Date : 2018-06-01 DOI:10.1109/SERA.2018.8477212

Feng Ye, Zongfei Jing, Qian Huang, Yong Chen

{"title":"The Research of a Lightweight Distributed Crawling System","authors":"Feng Ye, Zongfei Jing, Qian Huang, Yong Chen","doi":"10.1109/SERA.2018.8477212","DOIUrl":null,"url":null,"abstract":"Nowadays, information on the Internet is growing at an explosive rate. The ability of the stand-alone web crawling system has come to its bottleneck, so more and more companies turn to distributed web crawling techniques. However, existing distributed web crawling systems have some shortcomings. Thread management modules for solving thread synchronization and resource competition are usually designed by using pure multithread asynchronous methods, but the execution of this kind of modules observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. To solve the problems mentioned above, this paper proposes a lightweight and practical distributed crawling system, which combines Docker and distributed computing techniques. It can make full use of the computing resources of the cluster and improve the efficiency of the crawling system effectively. Taking the data of Netease news page as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency.","PeriodicalId":161568,"journal":{"name":"2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA.2018.8477212","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Nowadays, information on the Internet is growing at an explosive rate. The ability of the stand-alone web crawling system has come to its bottleneck, so more and more companies turn to distributed web crawling techniques. However, existing distributed web crawling systems have some shortcomings. Thread management modules for solving thread synchronization and resource competition are usually designed by using pure multithread asynchronous methods, but the execution of this kind of modules observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. To solve the problems mentioned above, this paper proposes a lightweight and practical distributed crawling system, which combines Docker and distributed computing techniques. It can make full use of the computing resources of the cluster and improve the efficiency of the crawling system effectively. Taking the data of Netease news page as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency.

查看原文本刊更多论文

一种轻量级分布式爬行系统的研究

如今，互联网上的信息正以爆炸性的速度增长。单机网络爬虫系统的能力已经达到瓶颈，因此越来越多的企业转向分布式网络爬虫技术。然而，现有的分布式网络爬虫系统存在一些不足。解决线程同步和资源竞争的线程管理模块通常采用纯多线程异步方法设计，但这种模块的执行明显降低了性能。此外，重复数据删除算法在处理大型数据集时效率较低或占用大量存储空间。为了解决上述问题，本文提出了一种轻量级实用的分布式爬行系统，该系统将Docker与分布式计算技术相结合。它可以充分利用集群的计算资源，有效地提高爬行系统的效率。以网易新闻页面数据为例，实验结果表明所提出的分布式爬虫具有较高的执行效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)

自引率

0.00%

发文量