Source Retrieval for Web-Scale Text Reuse Detection

Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, Benno Stein
{"title":"Source Retrieval for Web-Scale Text Reuse Detection","authors":"Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, Benno Stein","doi":"10.1145/3132847.3133097","DOIUrl":null,"url":null,"abstract":"The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"197 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.
web规模文本重用检测的源检索
文本重用检测的第一步解决源检索问题:给定一个可疑文档,必须通过查询搜索引擎检索可能重用文本的一组候选源。然后,在第二步中,检索到的候选文档与可疑文档进行文本对齐,以便识别重用的段落。显然,在源检索步骤中没有检索到的任何真正的文本重用源都会降低重用检测器的总召回率。因此,源检索是一个面向回忆的任务,这一事实甚至被专家们忽视了:在PAN 2012-2016上,参加相应任务的20个团队中,只有3个团队设法找到了一半以上的源,最好的团队实现了约0.59的召回率。我们提出了一种新的方法,召回率达到了0.89,性能提高了51%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信