web-repository爬虫抓取策略的评估

UK Conference on Hypertext Pub Date : 2006-08-22 DOI:10.1145/1149941.1149972

F. McCown, Michael L. Nelson

{"title":"web-repository爬虫抓取策略的评估","authors":"F. McCown, Michael L. Nelson","doi":"10.1145/1149941.1149972","DOIUrl":null,"url":null,"abstract":"We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.","PeriodicalId":134809,"journal":{"name":"UK Conference on Hypertext","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":"{\"title\":\"Evaluation of crawling policies for a web-repository crawler\",\"authors\":\"F. McCown, Michael L. Nelson\",\"doi\":\"10.1145/1149941.1149972\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.\",\"PeriodicalId\":134809,\"journal\":{\"name\":\"UK Conference on Hypertext\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"33\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"UK Conference on Hypertext\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1149941.1149972\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"UK Conference on Hypertext","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1149941.1149972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

摘要

我们开发了一个网络存储库爬虫，用于在备份不可用时重建网站。我们的爬虫检索网络资源，从互联网档案，谷歌，雅虎和MSN。我们研究了爬行web存储库的挑战，并讨论了克服这些障碍的策略。我们提出了三种可用于重建网站的抓取策略。我们通过重建24个网站来评估政策的有效性，并将结果与网站的实时版本进行比较。我们总结了我们重建丢失网站的经验，并讨论了改进我们的web存储库爬虫的计划。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of crawling policies for a web-repository crawler

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

UK Conference on Hypertext

自引率

0.00%

发文量