{"title":"The Research of a Lightweight Distributed Crawling System","authors":"Feng Ye, Zongfei Jing, Qian Huang, Yong Chen","doi":"10.1109/SERA.2018.8477212","DOIUrl":null,"url":null,"abstract":"Nowadays, information on the Internet is growing at an explosive rate. The ability of the stand-alone web crawling system has come to its bottleneck, so more and more companies turn to distributed web crawling techniques. However, existing distributed web crawling systems have some shortcomings. Thread management modules for solving thread synchronization and resource competition are usually designed by using pure multithread asynchronous methods, but the execution of this kind of modules observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. To solve the problems mentioned above, this paper proposes a lightweight and practical distributed crawling system, which combines Docker and distributed computing techniques. It can make full use of the computing resources of the cluster and improve the efficiency of the crawling system effectively. Taking the data of Netease news page as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency.","PeriodicalId":161568,"journal":{"name":"2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA.2018.8477212","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Nowadays, information on the Internet is growing at an explosive rate. The ability of the stand-alone web crawling system has come to its bottleneck, so more and more companies turn to distributed web crawling techniques. However, existing distributed web crawling systems have some shortcomings. Thread management modules for solving thread synchronization and resource competition are usually designed by using pure multithread asynchronous methods, but the execution of this kind of modules observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. To solve the problems mentioned above, this paper proposes a lightweight and practical distributed crawling system, which combines Docker and distributed computing techniques. It can make full use of the computing resources of the cluster and improve the efficiency of the crawling system effectively. Taking the data of Netease news page as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency.