Learning to Discover Domain-Specific Web Content

Kien Pham, Aécio Santos, J. Freire
{"title":"Learning to Discover Domain-Specific Web Content","authors":"Kien Pham, Aécio Santos, J. Freire","doi":"10.1145/3159652.3159724","DOIUrl":null,"url":null,"abstract":"The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.","PeriodicalId":401247,"journal":{"name":"Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3159652.3159724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
学习发现特定领域的Web内容
发现与信息领域相关的所有内容的能力有许多应用,从帮助理解人道主义危机到打击人口和武器贩运。在这样的应用程序中,时间是至关重要的:最大限度地扩大覆盖范围,并在新内容可用时尽快识别新内容,以便采取适当的行动,这一点至关重要。在本文中,我们提出了一种新的方法,用于有效的特定领域的重新爬行,以最大限度地提高新内容的产量。通过学习具有高产量的页面模式,我们的方法选择了一小部分可以频繁重新爬行的页面,从而在节省资源的同时增加了覆盖率和新鲜度。与以往解决该问题的方法不同,我们的方法结合了不同的因素来优化重新爬行策略,不需要对学习步骤进行完整的快照,并且随着爬行的进行动态地调整策略。在经验评估中,我们在三个不同的领域模拟了超过600个部分抓取快照的框架。结果表明,与现有的最先进的技术相比,我们的方法可以实现150%的高覆盖率。此外,它还能够在发布后不到4小时内捕获80%的新相关内容。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信