The Research of a Lightweight Distributed Crawling System

Feng Ye, Zongfei Jing, Qian Huang, Yong Chen
{"title":"The Research of a Lightweight Distributed Crawling System","authors":"Feng Ye, Zongfei Jing, Qian Huang, Yong Chen","doi":"10.1109/SERA.2018.8477212","DOIUrl":null,"url":null,"abstract":"Nowadays, information on the Internet is growing at an explosive rate. The ability of the stand-alone web crawling system has come to its bottleneck, so more and more companies turn to distributed web crawling techniques. However, existing distributed web crawling systems have some shortcomings. Thread management modules for solving thread synchronization and resource competition are usually designed by using pure multithread asynchronous methods, but the execution of this kind of modules observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. To solve the problems mentioned above, this paper proposes a lightweight and practical distributed crawling system, which combines Docker and distributed computing techniques. It can make full use of the computing resources of the cluster and improve the efficiency of the crawling system effectively. Taking the data of Netease news page as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency.","PeriodicalId":161568,"journal":{"name":"2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA.2018.8477212","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Nowadays, information on the Internet is growing at an explosive rate. The ability of the stand-alone web crawling system has come to its bottleneck, so more and more companies turn to distributed web crawling techniques. However, existing distributed web crawling systems have some shortcomings. Thread management modules for solving thread synchronization and resource competition are usually designed by using pure multithread asynchronous methods, but the execution of this kind of modules observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. To solve the problems mentioned above, this paper proposes a lightweight and practical distributed crawling system, which combines Docker and distributed computing techniques. It can make full use of the computing resources of the cluster and improve the efficiency of the crawling system effectively. Taking the data of Netease news page as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency.
一种轻量级分布式爬行系统的研究
如今,互联网上的信息正以爆炸性的速度增长。单机网络爬虫系统的能力已经达到瓶颈,因此越来越多的企业转向分布式网络爬虫技术。然而,现有的分布式网络爬虫系统存在一些不足。解决线程同步和资源竞争的线程管理模块通常采用纯多线程异步方法设计,但这种模块的执行明显降低了性能。此外,重复数据删除算法在处理大型数据集时效率较低或占用大量存储空间。为了解决上述问题,本文提出了一种轻量级实用的分布式爬行系统,该系统将Docker与分布式计算技术相结合。它可以充分利用集群的计算资源,有效地提高爬行系统的效率。以网易新闻页面数据为例,实验结果表明所提出的分布式爬虫具有较高的执行效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信