通过揭露作者语法中的不规则性来打击剽窃

2013 European Intelligence and Security Informatics Conference Pub Date : 2013-08-12 DOI:10.1109/EISIC.2013.10

Michael Tschuggnall, Günther Specht

{"title":"通过揭露作者语法中的不规则性来打击剽窃","authors":"Michael Tschuggnall, Günther Specht","doi":"10.1109/EISIC.2013.10","DOIUrl":null,"url":null,"abstract":"Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.","PeriodicalId":229195,"journal":{"name":"2013 European Intelligence and Security Informatics Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Countering Plagiarism by Exposing Irregularities in Authors' Grammar\",\"authors\":\"Michael Tschuggnall, Günther Specht\",\"doi\":\"10.1109/EISIC.2013.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.\",\"PeriodicalId\":229195,\"journal\":{\"name\":\"2013 European Intelligence and Security Informatics Conference\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 European Intelligence and Security Informatics Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EISIC.2013.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 European Intelligence and Security Informatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EISIC.2013.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

在现代社会，未经授权复制或窃取他人的知识产权是一个严重的问题。在文本抄袭的情况下，利用在线数据库提供的大量数据，越来越容易找到合适的来源。为了解决这个问题，两种主要的方法分别被分类为外部和内部抄袭检测。虽然外部算法有可能将可疑文档与众多来源进行比较，但内部算法可以单独检查可疑文档以预测抄袭，这一点非常重要，特别是在没有可用来源的情况下。本文提出了一种通过分析作者句法信息和发现句子结构中的不规则性来进行内在抄袭检测的新方法。本文的主要观点是基于这样一个假设，即作者有自己的一套大多是无意识使用的造句方法，这些方法可以用来区分作者。因此，该算法将可疑文档拆分为单个句子，用词性分类器标记每个单词，并创建表示每个句子的词性分类器序列。然后，应用改进的序列比对算法计算每对不同句子之间的距离，并将其存储到距离矩阵中。在对每个句子的平均距离使用高斯正态分布函数后，选择可疑句子，分组并预测其是否被剽窃。最后利用遗传算法对算法使用的阈值和参数进行优化。该方法已在大量英语文档的测试语料库上进行了评估，显示出令人鼓舞的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Countering Plagiarism by Exposing Irregularities in Authors' Grammar

Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 European Intelligence and Security Informatics Conference

自引率

0.00%

发文量