Learning to Discover Domain-Specific Web Content

Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining Pub Date : 2018-02-02 DOI:10.1145/3159652.3159724

Kien Pham, Aécio Santos, J. Freire

{"title":"Learning to Discover Domain-Specific Web Content","authors":"Kien Pham, Aécio Santos, J. Freire","doi":"10.1145/3159652.3159724","DOIUrl":null,"url":null,"abstract":"The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.","PeriodicalId":401247,"journal":{"name":"Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3159652.3159724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.

查看原文本刊更多论文

学习发现特定领域的Web内容

发现与信息领域相关的所有内容的能力有许多应用，从帮助理解人道主义危机到打击人口和武器贩运。在这样的应用程序中，时间是至关重要的:最大限度地扩大覆盖范围，并在新内容可用时尽快识别新内容，以便采取适当的行动，这一点至关重要。在本文中，我们提出了一种新的方法，用于有效的特定领域的重新爬行，以最大限度地提高新内容的产量。通过学习具有高产量的页面模式，我们的方法选择了一小部分可以频繁重新爬行的页面，从而在节省资源的同时增加了覆盖率和新鲜度。与以往解决该问题的方法不同，我们的方法结合了不同的因素来优化重新爬行策略，不需要对学习步骤进行完整的快照，并且随着爬行的进行动态地调整策略。在经验评估中，我们在三个不同的领域模拟了超过600个部分抓取快照的框架。结果表明，与现有的最先进的技术相比，我们的方法可以实现150%的高覆盖率。此外，它还能够在发布后不到4小时内捕获80%的新相关内容。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量