Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, Benno Stein
{"title":"web规模文本重用检测的源检索","authors":"Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, Benno Stein","doi":"10.1145/3132847.3133097","DOIUrl":null,"url":null,"abstract":"The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"197 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Source Retrieval for Web-Scale Text Reuse Detection\",\"authors\":\"Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, Benno Stein\",\"doi\":\"10.1145/3132847.3133097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.\",\"PeriodicalId\":20449,\"journal\":{\"name\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"volume\":\"197 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3132847.3133097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Source Retrieval for Web-Scale Text Reuse Detection
The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.