Design and Development of a Plagiarism Corpus in Thai for Plagiarism Detection

2019 11th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2019-10-01 DOI:10.1109/KSE.2019.8919436

Santipong Thaiprayoon, P. Palingoon, Kanokorn Trakultaweekoon

{"title":"Design and Development of a Plagiarism Corpus in Thai for Plagiarism Detection","authors":"Santipong Thaiprayoon, P. Palingoon, Kanokorn Trakultaweekoon","doi":"10.1109/KSE.2019.8919436","DOIUrl":null,"url":null,"abstract":"One of the main problems of creating a plagiarism corpus in Thai is that it is quite a difficult task to acquire the plagiarized documents with real cases due to the copyright issue. To solve the problem, we present a design and development of a Thai plagiarism corpus to evaluate and compare plagiarism detection algorithms for Thai. The corpus is developed by using the simulated plagiarism method based on Thai Wikipedia articles and web page articles. For this method, we provide a Thai plagiarism annotation tool and a Thai plagiarism guideline for assisting human annotators to plagiarize text passages. Our corpus contains simulated cases of plagiarized documents based on four classes of Thai plagiarism and linguistic mechanisms including copy-based change, lexicon-based change, structure- based change, and semantic-based change. We show that the suspicious documents in the corpus are manually created by using different obfuscation strategies, which make the suspicious documents more realistic and challenging. We then believe that the corpus developed in this paper will be a valuable contribution in the development, comparison, and evaluation of plagiarism detection algorithms. Moreover, our corpus is free and publicly available for research purposes.","PeriodicalId":439841,"journal":{"name":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE.2019.8919436","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

One of the main problems of creating a plagiarism corpus in Thai is that it is quite a difficult task to acquire the plagiarized documents with real cases due to the copyright issue. To solve the problem, we present a design and development of a Thai plagiarism corpus to evaluate and compare plagiarism detection algorithms for Thai. The corpus is developed by using the simulated plagiarism method based on Thai Wikipedia articles and web page articles. For this method, we provide a Thai plagiarism annotation tool and a Thai plagiarism guideline for assisting human annotators to plagiarize text passages. Our corpus contains simulated cases of plagiarized documents based on four classes of Thai plagiarism and linguistic mechanisms including copy-based change, lexicon-based change, structure- based change, and semantic-based change. We show that the suspicious documents in the corpus are manually created by using different obfuscation strategies, which make the suspicious documents more realistic and challenging. We then believe that the corpus developed in this paper will be a valuable contribution in the development, comparison, and evaluation of plagiarism detection algorithms. Moreover, our corpus is free and publicly available for research purposes.

查看原文本刊更多论文

用于抄袭检测的泰语抄袭语料库的设计与开发

创建泰语剽窃语料库的主要问题之一是，由于版权问题，获取具有真实案例的剽窃文献是一项相当困难的任务。为了解决这个问题，我们设计并开发了一个泰语抄袭语料库来评估和比较泰语的抄袭检测算法。该语料库以泰语维基百科文章和网页文章为基础，采用模拟抄袭的方法开发。对于这种方法，我们提供了一个泰语抄袭注释工具和泰语抄袭指南，以帮助人类注释者抄袭文本段落。我们的语料库包含基于四类泰国剽窃和语言机制的剽窃案例，包括基于副本的变化、基于词典的变化、基于结构的变化和基于语义的变化。我们展示了通过使用不同的混淆策略手动创建语料库中的可疑文档，这使得可疑文档更具现实性和挑战性。因此，我们相信本文开发的语料库将对剽窃检测算法的开发、比较和评估做出有价值的贡献。此外，我们的语料库是免费的，可公开用于研究目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 11th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量