Aiding web crawlers; projecting web page last modification

A. Anjum, Adnan Anjum
{"title":"Aiding web crawlers; projecting web page last modification","authors":"A. Anjum, Adnan Anjum","doi":"10.1109/INMIC.2012.6511443","DOIUrl":null,"url":null,"abstract":"Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.","PeriodicalId":396084,"journal":{"name":"2012 15th International Multitopic Conference (INMIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 15th International Multitopic Conference (INMIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INMIC.2012.6511443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.
协助网络爬虫;投影网页上次修改
由于Web上有大量的数据,Web档案管理员通常使用Web爬虫进行自动收集。Internet Archive是基于爬行方法的最大组织,目的是维护整个Web的存档。Web爬虫最重要的要求,特别是当它们用于Web存档时,是要知道最后修改Web页面的日期(和时间)。这种策略有许多优点,其中最重要的优点包括:i)向最终用户显示Web页面的最新版本;ii)易于调整抓取速率,以便将来在给定日期检索Web页面的版本,或计算其刷新率。获取Web页面的这种修改信息的典型方法,即使用Last-Modified: HTTP标头,不幸的是并不是每次都提供正确的信息。在这项工作中,我们讨论了各种技术,这些技术可以在实验的帮助下用于确定网页的最后修改日期。这将有助于调整特定页面的抓取速度,还有助于向用户提供最新信息,从而允许对Web页面的未来版本进行更细致的控制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信