Design and Development of a Plagiarism Corpus in Thai for Plagiarism Detection

Santipong Thaiprayoon, P. Palingoon, Kanokorn Trakultaweekoon
{"title":"Design and Development of a Plagiarism Corpus in Thai for Plagiarism Detection","authors":"Santipong Thaiprayoon, P. Palingoon, Kanokorn Trakultaweekoon","doi":"10.1109/KSE.2019.8919436","DOIUrl":null,"url":null,"abstract":"One of the main problems of creating a plagiarism corpus in Thai is that it is quite a difficult task to acquire the plagiarized documents with real cases due to the copyright issue. To solve the problem, we present a design and development of a Thai plagiarism corpus to evaluate and compare plagiarism detection algorithms for Thai. The corpus is developed by using the simulated plagiarism method based on Thai Wikipedia articles and web page articles. For this method, we provide a Thai plagiarism annotation tool and a Thai plagiarism guideline for assisting human annotators to plagiarize text passages. Our corpus contains simulated cases of plagiarized documents based on four classes of Thai plagiarism and linguistic mechanisms including copy-based change, lexicon-based change, structure- based change, and semantic-based change. We show that the suspicious documents in the corpus are manually created by using different obfuscation strategies, which make the suspicious documents more realistic and challenging. We then believe that the corpus developed in this paper will be a valuable contribution in the development, comparison, and evaluation of plagiarism detection algorithms. Moreover, our corpus is free and publicly available for research purposes.","PeriodicalId":439841,"journal":{"name":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE.2019.8919436","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

One of the main problems of creating a plagiarism corpus in Thai is that it is quite a difficult task to acquire the plagiarized documents with real cases due to the copyright issue. To solve the problem, we present a design and development of a Thai plagiarism corpus to evaluate and compare plagiarism detection algorithms for Thai. The corpus is developed by using the simulated plagiarism method based on Thai Wikipedia articles and web page articles. For this method, we provide a Thai plagiarism annotation tool and a Thai plagiarism guideline for assisting human annotators to plagiarize text passages. Our corpus contains simulated cases of plagiarized documents based on four classes of Thai plagiarism and linguistic mechanisms including copy-based change, lexicon-based change, structure- based change, and semantic-based change. We show that the suspicious documents in the corpus are manually created by using different obfuscation strategies, which make the suspicious documents more realistic and challenging. We then believe that the corpus developed in this paper will be a valuable contribution in the development, comparison, and evaluation of plagiarism detection algorithms. Moreover, our corpus is free and publicly available for research purposes.
用于抄袭检测的泰语抄袭语料库的设计与开发
创建泰语剽窃语料库的主要问题之一是,由于版权问题,获取具有真实案例的剽窃文献是一项相当困难的任务。为了解决这个问题,我们设计并开发了一个泰语抄袭语料库来评估和比较泰语的抄袭检测算法。该语料库以泰语维基百科文章和网页文章为基础,采用模拟抄袭的方法开发。对于这种方法,我们提供了一个泰语抄袭注释工具和泰语抄袭指南,以帮助人类注释者抄袭文本段落。我们的语料库包含基于四类泰国剽窃和语言机制的剽窃案例,包括基于副本的变化、基于词典的变化、基于结构的变化和基于语义的变化。我们展示了通过使用不同的混淆策略手动创建语料库中的可疑文档,这使得可疑文档更具现实性和挑战性。因此,我们相信本文开发的语料库将对剽窃检测算法的开发、比较和评估做出有价值的贡献。此外,我们的语料库是免费的,可公开用于研究目的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信