基于引文的抄袭检测的引文模式匹配算法:贪婪引文平铺、引文分块和最长公共引文序列

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2011-09-19 DOI:10.1145/2034691.2034741

Bela Gipp, Norman Meuschke

{"title":"基于引文的抄袭检测的引文模式匹配算法:贪婪引文平铺、引文分块和最长公共引文序列","authors":"Bela Gipp, Norman Meuschke","doi":"10.1145/2034691.2034741","DOIUrl":null,"url":null,"abstract":"Plagiarism Detection Systems have been developed to locate instances of plagiarism e.g. within scientific papers. Studies have shown that the existing approaches deliver reasonable results in identifying copy&paste plagiarism, but fail to detect more sophisticated forms such as paraphrased plagiarism, translation plagiarism or idea plagiarism. The authors of this paper demonstrated in recent studies that the detection rate can be significantly improved by not only relying on text analysis, but by additionally analyzing the citations of a document. Citations are valuable language independent markers that are similar to a fingerprint. In fact, our examinations of real world cases have shown that the order of citations in a document often remains similar even if the text has been strongly paraphrased or translated in order to disguise plagiarism.\n This paper introduces three algorithms and discusses their suitability for the purpose of citation-based plagiarism detection. Due to the numerous ways in which plagiarism can occur, these algorithms need to be versatile. They must be capable of detecting transpositions, scaling and combinations in a local and global form. The algorithms are coined Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. The evaluation showed that if these algorithms are combined, common forms of plagiarism can be detected reliably.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"242 1","pages":"249-258"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":"{\"title\":\"Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence\",\"authors\":\"Bela Gipp, Norman Meuschke\",\"doi\":\"10.1145/2034691.2034741\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Plagiarism Detection Systems have been developed to locate instances of plagiarism e.g. within scientific papers. Studies have shown that the existing approaches deliver reasonable results in identifying copy&paste plagiarism, but fail to detect more sophisticated forms such as paraphrased plagiarism, translation plagiarism or idea plagiarism. The authors of this paper demonstrated in recent studies that the detection rate can be significantly improved by not only relying on text analysis, but by additionally analyzing the citations of a document. Citations are valuable language independent markers that are similar to a fingerprint. In fact, our examinations of real world cases have shown that the order of citations in a document often remains similar even if the text has been strongly paraphrased or translated in order to disguise plagiarism.\\n This paper introduces three algorithms and discusses their suitability for the purpose of citation-based plagiarism detection. Due to the numerous ways in which plagiarism can occur, these algorithms need to be versatile. They must be capable of detecting transpositions, scaling and combinations in a local and global form. The algorithms are coined Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. The evaluation showed that if these algorithms are combined, common forms of plagiarism can be detected reliably.\",\"PeriodicalId\":91385,\"journal\":{\"name\":\"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering\",\"volume\":\"242 1\",\"pages\":\"249-258\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"76\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2034691.2034741\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2034691.2034741","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 76

摘要

抄袭检测系统已经开发出来，以定位抄袭的实例，例如在科学论文中。研究表明，现有的方法在识别复制粘贴剽窃方面取得了不错的结果，但无法检测更复杂的形式，如释义剽窃、翻译剽窃或思想剽窃。本文作者在最近的研究中证明，除了依赖文本分析之外，还可以通过对文档的引用进行分析来显著提高检测率。引文是有价值的独立于语言的标记，类似于指纹。事实上，我们对现实世界案例的研究表明，即使为了掩盖剽窃而对文本进行了强烈的释义或翻译，文档中的引用顺序通常仍然相似。本文介绍了三种算法，并讨论了它们在基于引文的抄袭检测中的适用性。由于抄袭可能以多种方式发生，这些算法需要是通用的。它们必须能够在局部和全局形式中检测换位、缩放和组合。算法包括贪心引文平铺、引文分块和最长公共引文序列。评估表明，如果将这些算法结合起来，可以可靠地检测出常见的抄袭形式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

Plagiarism Detection Systems have been developed to locate instances of plagiarism e.g. within scientific papers. Studies have shown that the existing approaches deliver reasonable results in identifying copy&paste plagiarism, but fail to detect more sophisticated forms such as paraphrased plagiarism, translation plagiarism or idea plagiarism. The authors of this paper demonstrated in recent studies that the detection rate can be significantly improved by not only relying on text analysis, but by additionally analyzing the citations of a document. Citations are valuable language independent markers that are similar to a fingerprint. In fact, our examinations of real world cases have shown that the order of citations in a document often remains similar even if the text has been strongly paraphrased or translated in order to disguise plagiarism. This paper introduces three algorithms and discusses their suitability for the purpose of citation-based plagiarism detection. Due to the numerous ways in which plagiarism can occur, these algorithms need to be versatile. They must be capable of detecting transpositions, scaling and combinations in a local and global form. The algorithms are coined Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. The evaluation showed that if these algorithms are combined, common forms of plagiarism can be detected reliably.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

自引率

0.00%

发文量