A pattern-based selective recrawling approach for object-level vertical search

Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu
{"title":"A pattern-based selective recrawling approach for object-level vertical search","authors":"Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu","doi":"10.1145/2505515.2505707","DOIUrl":null,"url":null,"abstract":"Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2505707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.
对象级垂直搜索的基于模式的选择性抓取方法
传统的抓取方法通过学习导航模式来抓取相关的网页。然而,它们不能消除在web上发现的冗余,特别是在对象级别。为了解决这个问题,我们提出了一种新的超文本资源发现方法,称为“选择性抓取”,用于对象级垂直搜索应用。选择性重新抓取的目标是自动生成URL模式,然后选择那些覆盖范围最广、相对于预定义的垂直域不相关和冗余最少的页面。此方法只需要几个种子对象,并且可以选择覆盖最多对象的URL模式集。选择的集合可以继续使用一段时间来重新抓取网页,并且可以定期更新。这将大大节省硬件和网络资源。在本文中,我们提出了一个对象级垂直搜索的选择性抓取的详细框架。选择性抓取方法自动从初始种子对象扩展候选网站集。基于从这些网站中提取的对象,它学习一组URL模式,这些模式覆盖了最多的目标对象,并且冗余少。最后,从所选URL模式集生成的导航模式用于指导将来的爬行。在局部事件数据上的实验表明,该方法可以在保持相对对象覆盖率的同时大大减少网页下载。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信