对象级垂直搜索的基于模式的选择性抓取方法

Proceedings of the 22nd ACM international conference on Information & Knowledge Management Pub Date : 2013-10-27 DOI:10.1145/2505515.2505707

Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu

{"title":"对象级垂直搜索的基于模式的选择性抓取方法","authors":"Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu","doi":"10.1145/2505515.2505707","DOIUrl":null,"url":null,"abstract":"Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A pattern-based selective recrawling approach for object-level vertical search\",\"authors\":\"Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu\",\"doi\":\"10.1145/2505515.2505707\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.\",\"PeriodicalId\":20528,\"journal\":{\"name\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2505515.2505707\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2505707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

传统的抓取方法通过学习导航模式来抓取相关的网页。然而，它们不能消除在web上发现的冗余，特别是在对象级别。为了解决这个问题，我们提出了一种新的超文本资源发现方法，称为“选择性抓取”，用于对象级垂直搜索应用。选择性重新抓取的目标是自动生成URL模式，然后选择那些覆盖范围最广、相对于预定义的垂直域不相关和冗余最少的页面。此方法只需要几个种子对象，并且可以选择覆盖最多对象的URL模式集。选择的集合可以继续使用一段时间来重新抓取网页，并且可以定期更新。这将大大节省硬件和网络资源。在本文中，我们提出了一个对象级垂直搜索的选择性抓取的详细框架。选择性抓取方法自动从初始种子对象扩展候选网站集。基于从这些网站中提取的对象，它学习一组URL模式，这些模式覆盖了最多的目标对象，并且冗余少。最后，从所选URL模式集生成的导航模式用于指导将来的爬行。在局部事件数据上的实验表明，该方法可以在保持相对对象覆盖率的同时大大减少网页下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A pattern-based selective recrawling approach for object-level vertical search

Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

自引率

0.00%

发文量