Geographical partition for distributed web crawling

J. Exposto, J. Macedo, A. Pina, A. Alves, J. Rufino
{"title":"Geographical partition for distributed web crawling","authors":"J. Exposto, J. Macedo, A. Pina, A. Alves, J. Rufino","doi":"10.1145/1096985.1096999","DOIUrl":null,"url":null,"abstract":"This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.","PeriodicalId":167948,"journal":{"name":"Workshop on Geographic Information Retrieval","volume":"162 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Geographic Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1096985.1096999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

Abstract

This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.
分布式网络抓取的地理分区
本文通过对Web的地理分区来评估可扩展的分布式爬行。该方法基于多个分布式爬虫的存在,每个爬虫负责属于一个或多个先前确定的地理区域的页面。这项工作考虑了一个分布式爬虫,其中要访问的页面的分配是基于页面内容的地理范围。对于一个页面初始分配到一个分区,我们使用一个简单的启发式,标记一个页面在同一范围内的托管web服务器的地理位置。在下载过程中,如果对页面内容的分析建议使用不同的地理范围,则将该页转发到位置良好的web服务器。2005年提取的葡萄牙语Web页面样本用于评估:A)页面下载通信时间和b)服务器之间页面交换的开销。计算结果允许将我们的方法与传统的散列分区策略进行比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信