A dynamic URL assignment method for parallel web crawler

A. Guerriero, F. Ragni, Claudio Martines
{"title":"A dynamic URL assignment method for parallel web crawler","authors":"A. Guerriero, F. Ragni, Claudio Martines","doi":"10.1109/CIMSA.2010.5611764","DOIUrl":null,"url":null,"abstract":"A web crawler is a relatively simple automated program or script that methodically scans or “crawls” through Internet pages to retrieval information from data. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. There are many different uses for a web crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. In this work we propose the model of a low cost web crawler for distributed environments based on an efficient URL assignment algorithm. The function of every module of the crawler is analyzed and main rules that crawlers must follow to maintain load balancing and robustness of system when they are searching on the web simultaneously, are discussed. The proposed a dynamic URL assignment method, based on grid computing technology and dynamic clustering, results efficient increasing web crawler performance.","PeriodicalId":162890,"journal":{"name":"2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIMSA.2010.5611764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

A web crawler is a relatively simple automated program or script that methodically scans or “crawls” through Internet pages to retrieval information from data. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. There are many different uses for a web crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. In this work we propose the model of a low cost web crawler for distributed environments based on an efficient URL assignment algorithm. The function of every module of the crawler is analyzed and main rules that crawlers must follow to maintain load balancing and robustness of system when they are searching on the web simultaneously, are discussed. The proposed a dynamic URL assignment method, based on grid computing technology and dynamic clustering, results efficient increasing web crawler performance.
并行网络爬虫的动态URL分配方法
网络爬虫是一种相对简单的自动程序或脚本,它系统地扫描或“爬行”Internet页面以从数据中检索信息。网络爬虫的备选名称包括网络蜘蛛、网络机器人、bot、爬虫和自动索引器。网络爬虫有许多不同的用途。他们的主要目的是收集数据,以便当互联网冲浪者在他们的网站上输入搜索词时,他们可以迅速为冲浪者提供相关的网站。在这项工作中,我们提出了一个基于高效URL分配算法的分布式环境下的低成本网络爬虫模型。分析了爬虫器各模块的功能,讨论了爬虫器在同时进行网络搜索时,为了保持系统的负载均衡和鲁棒性,必须遵循的主要规则。提出了一种基于网格计算技术和动态聚类的动态URL分配方法,有效地提高了网络爬虫的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信