协助网络爬虫;投影网页上次修改

2012 15th International Multitopic Conference (INMIC) Pub Date : 2012-12-01 DOI:10.1109/INMIC.2012.6511443

A. Anjum, Adnan Anjum

{"title":"协助网络爬虫;投影网页上次修改","authors":"A. Anjum, Adnan Anjum","doi":"10.1109/INMIC.2012.6511443","DOIUrl":null,"url":null,"abstract":"Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.","PeriodicalId":396084,"journal":{"name":"2012 15th International Multitopic Conference (INMIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Aiding web crawlers; projecting web page last modification\",\"authors\":\"A. Anjum, Adnan Anjum\",\"doi\":\"10.1109/INMIC.2012.6511443\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.\",\"PeriodicalId\":396084,\"journal\":{\"name\":\"2012 15th International Multitopic Conference (INMIC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 15th International Multitopic Conference (INMIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INMIC.2012.6511443\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 15th International Multitopic Conference (INMIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INMIC.2012.6511443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

由于Web上有大量的数据，Web档案管理员通常使用Web爬虫进行自动收集。Internet Archive是基于爬行方法的最大组织，目的是维护整个Web的存档。Web爬虫最重要的要求，特别是当它们用于Web存档时，是要知道最后修改Web页面的日期(和时间)。这种策略有许多优点，其中最重要的优点包括:i)向最终用户显示Web页面的最新版本;ii)易于调整抓取速率，以便将来在给定日期检索Web页面的版本，或计算其刷新率。获取Web页面的这种修改信息的典型方法，即使用Last-Modified: HTTP标头，不幸的是并不是每次都提供正确的信息。在这项工作中，我们讨论了各种技术，这些技术可以在实验的帮助下用于确定网页的最后修改日期。这将有助于调整特定页面的抓取速度，还有助于向用户提供最新信息，从而允许对Web页面的未来版本进行更细致的控制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Aiding web crawlers; projecting web page last modification

Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 15th International Multitopic Conference (INMIC)

自引率

0.00%

发文量