{"title":"通过揭露作者语法中的不规则性来打击剽窃","authors":"Michael Tschuggnall, Günther Specht","doi":"10.1109/EISIC.2013.10","DOIUrl":null,"url":null,"abstract":"Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.","PeriodicalId":229195,"journal":{"name":"2013 European Intelligence and Security Informatics Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Countering Plagiarism by Exposing Irregularities in Authors' Grammar\",\"authors\":\"Michael Tschuggnall, Günther Specht\",\"doi\":\"10.1109/EISIC.2013.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.\",\"PeriodicalId\":229195,\"journal\":{\"name\":\"2013 European Intelligence and Security Informatics Conference\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 European Intelligence and Security Informatics Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EISIC.2013.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 European Intelligence and Security Informatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EISIC.2013.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Countering Plagiarism by Exposing Irregularities in Authors' Grammar
Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.