Source Retrieval for Web-Scale Text Reuse Detection

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management Pub Date : 2017-11-06 DOI:10.1145/3132847.3133097

Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, Benno Stein

引用次数: 17

Abstract

The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.

查看原文本刊更多论文

web规模文本重用检测的源检索

文本重用检测的第一步解决源检索问题:给定一个可疑文档，必须通过查询搜索引擎检索可能重用文本的一组候选源。然后，在第二步中，检索到的候选文档与可疑文档进行文本对齐，以便识别重用的段落。显然，在源检索步骤中没有检索到的任何真正的文本重用源都会降低重用检测器的总召回率。因此，源检索是一个面向回忆的任务，这一事实甚至被专家们忽视了:在PAN 2012-2016上，参加相应任务的20个团队中，只有3个团队设法找到了一半以上的源，最好的团队实现了约0.59的召回率。我们提出了一种新的方法，召回率达到了0.89，性能提高了51%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

自引率

0.00%

发文量