Structure-driven crawler generation by example

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148223

Márcio L. A. Vidal, A. D. Silva, E. Moura, J. Cavalcanti

{"title":"Structure-driven crawler generation by example","authors":"Márcio L. A. Vidal, A. D. Silva, E. Moura, J. Cavalcanti","doi":"10.1145/1148170.1148223","DOIUrl":null,"url":null,"abstract":"Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1148170.1148223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 61

Abstract

Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.

查看原文本刊更多论文

通过示例生成结构驱动的爬虫

许多Web IR和数字图书馆应用程序需要一个抓取过程来收集页面，其最终目标是利用Web站点上可用的有用信息。对于其中一些应用程序，确定页面何时出现在集合中的标准与页面内容相关。但是，在某些情况下，页面的内部结构比其内容提供了更好的标准来指导爬行过程。在本文中，我们提出了一种结构驱动的方法来生成Web爬虫，它只需要用户付出最小的努力。其思想是将示例页面和Web站点的入口点作为输入，并基于导航模式生成结构驱动的爬虫，爬虫必须遵循的链接模式序列才能到达结构类似于示例页面的页面。在我们进行的实验中，由我们的新方法生成的结构驱动爬虫能够收集与给定样本匹配的所有页面，包括那些在生成后添加的页面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量