Efficient web harvesting strategies for monitoring deep web content

Mohammadreza Khelghati, D. Hiemstra, M. V. Keulen
{"title":"Efficient web harvesting strategies for monitoring deep web content","authors":"Mohammadreza Khelghati, D. Hiemstra, M. V. Keulen","doi":"10.1145/3011141.3011198","DOIUrl":null,"url":null,"abstract":"Web content changes rapidly [18]. In Focused Web Harvesting [17] which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigating a topic, you may need not only to access all the relevant information but also the recent changes and updates. General search engines like Google apply several techniques to enhance the freshness of their crawled data. However, in focused web harvesting, we lack an efficient approach that detects changes for a given topic over time. In this paper, we focus on techniques that can keep the relevant content to a given query up-to-date. To do so, we test four different approaches to efficiently harvest all the changed documents matching a given entity by querying web search engines. We define a document with changed content or a newly created or removed document as a changed document. Among the proposed change detection approaches, the FedWeb method outperforms the other approaches in finding the changed content on the web for a given query with 20 percent, on average, better performance.","PeriodicalId":247823,"journal":{"name":"Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3011141.3011198","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Web content changes rapidly [18]. In Focused Web Harvesting [17] which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigating a topic, you may need not only to access all the relevant information but also the recent changes and updates. General search engines like Google apply several techniques to enhance the freshness of their crawled data. However, in focused web harvesting, we lack an efficient approach that detects changes for a given topic over time. In this paper, we focus on techniques that can keep the relevant content to a given query up-to-date. To do so, we test four different approaches to efficiently harvest all the changed documents matching a given entity by querying web search engines. We define a document with changed content or a newly created or removed document as a changed document. Among the proposed change detection approaches, the FedWeb method outperforms the other approaches in finding the changed content on the web for a given query with 20 percent, on average, better performance.
监测深层网络内容的有效网络收获策略
网络内容变化很快。在聚焦网络收获[17]中,它的目标是实现对给定主题的完整收获,网络的这种动态特性给需要访问一组与他们感兴趣的主题相关的所有网络数据的用户带来了问题。无论你是一个喜欢偶像的粉丝,还是一个调查某个话题的记者,你可能不仅需要访问所有相关信息,还需要访问最近的变化和更新。一般的搜索引擎,比如b谷歌,会应用一些技术来提高抓取数据的新鲜度。然而,在集中的web收集中,我们缺乏一种有效的方法来检测给定主题随时间的变化。在本文中,我们关注的是能够使给定查询的相关内容保持最新的技术。为此,我们测试了四种不同的方法,通过查询web搜索引擎有效地获取与给定实体匹配的所有更改文档。我们将内容更改的文档或新创建或删除的文档定义为更改的文档。在提出的变更检测方法中,FedWeb方法在查找给定查询的web上的变更内容方面优于其他方法,平均性能提高20%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信