Aiding web crawlers; projecting web page last modification

2012 15th International Multitopic Conference (INMIC) Pub Date : 2012-12-01 DOI:10.1109/INMIC.2012.6511443

A. Anjum, Adnan Anjum

{"title":"Aiding web crawlers; projecting web page last modification","authors":"A. Anjum, Adnan Anjum","doi":"10.1109/INMIC.2012.6511443","DOIUrl":null,"url":null,"abstract":"Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.","PeriodicalId":396084,"journal":{"name":"2012 15th International Multitopic Conference (INMIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 15th International Multitopic Conference (INMIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INMIC.2012.6511443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.

查看原文本刊更多论文

协助网络爬虫;投影网页上次修改

由于Web上有大量的数据，Web档案管理员通常使用Web爬虫进行自动收集。Internet Archive是基于爬行方法的最大组织，目的是维护整个Web的存档。Web爬虫最重要的要求，特别是当它们用于Web存档时，是要知道最后修改Web页面的日期(和时间)。这种策略有许多优点，其中最重要的优点包括:i)向最终用户显示Web页面的最新版本;ii)易于调整抓取速率，以便将来在给定日期检索Web页面的版本，或计算其刷新率。获取Web页面的这种修改信息的典型方法，即使用Last-Modified: HTTP标头，不幸的是并不是每次都提供正确的信息。在这项工作中，我们讨论了各种技术，这些技术可以在实验的帮助下用于确定网页的最后修改日期。这将有助于调整特定页面的抓取速度，还有助于向用户提供最新信息，从而允许对Web页面的未来版本进行更细致的控制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 15th International Multitopic Conference (INMIC)

自引率

0.00%

发文量