Publication Date Prediction through Reverse Engineering of the Web

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining Pub Date : 2016-02-08 DOI:10.1145/2835776.2835796

L. Ostroumova, P. Prokhorenkov, E. Samosvat, P. Serdyukov

引用次数: 5

Abstract

In this paper, we focus on one of the most challenging tasks in temporal information retrieval: detection of a web page publication date. The natural approach to this problem is to find the publication date in the HTML body of a page. However, there are two fundamental problems with this approach. First, not all web pages contain the publication dates in their texts. Second, it is hard to distinguish the publication date among all the dates found in the page's text. The approach we suggest in this paper supplements methods of date extraction from the page's text with novel link-based methods of dating. Some of our link-based methods are based on a probabilistic model of the Web graph structure evolution, which relies on the publication dates of web pages as on its parameters. We use this model to estimate the publication dates of web pages: based only on the link structure currently observed, we perform a ``reverse engineering'' to reveal the whole process of the Web's evolution.

查看原文本刊更多论文

通过Web逆向工程预测出版日期

在本文中，我们关注的是时态信息检索中最具挑战性的任务之一:网页发布日期的检测。解决这个问题的自然方法是在页面的HTML正文中查找发布日期。然而，这种方法存在两个基本问题。首先，并非所有网页的文本中都包含出版日期。其次，很难从页面文本中找到的所有日期中区分出版日期。我们在本文中提出的方法补充了从页面文本中提取日期的方法，并采用了新颖的基于链接的日期提取方法。我们的一些基于链接的方法是基于网络图结构演变的概率模型，该模型依赖于网页的发布日期和其参数。我们使用这个模型来估计网页的发布日期:仅基于当前观察到的链接结构，我们执行“逆向工程”来揭示web进化的整个过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量