Fuzzy Semantic-Based String Similarity Experiments to Detect Plagiarism in Indonesian Documents

Chonan Firda Odayakana Umareta, Siti Mariyah
{"title":"Fuzzy Semantic-Based String Similarity Experiments to Detect Plagiarism in Indonesian Documents","authors":"Chonan Firda Odayakana Umareta, Siti Mariyah","doi":"10.1109/ICICoS48119.2019.8982501","DOIUrl":null,"url":null,"abstract":"Plagiarism is a topic of concern in the world of education. One way to overcome plagiarism is to make comparisons between documents. Due to a large number of documents, extrinsic plagiarism detection frameworks are needed to make comparisons of documents in large numbers. On the other hand, there is intelligent plagiarism in which plagiarists try to hide their actions by one of them is replacing words with semantics. Therefore, this study applies an extrinsic plagiarism detection system with a Fuzzy Semantic-Based String Similarity method which is divided into three stages, namely Preprocessing, Heuristic Retrieval (HR), and Detailed Analysis (DA). In the preprocessing stage, the removal of irrelevant characters, the division of text based on sentences, stemming, tokenization, and the elimination of stopwords were performed. The search for pairs of candidate documents in the HR stage used fingerprints and Jaccard similarity. DA stage applied fuzzy semantic based-similarity. Experiments were carried out by comparing the level of document similarity between Jaccard similarity in the HR stage and fuzzy semantic-based similarity in the DA stage because both were able to produce a level of document similarity. The results show that fuzzy semantic-based similarity is better than Jaccard similarity because it can detect semantic similarities in the form of synonyms.","PeriodicalId":105407,"journal":{"name":"2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICoS48119.2019.8982501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Plagiarism is a topic of concern in the world of education. One way to overcome plagiarism is to make comparisons between documents. Due to a large number of documents, extrinsic plagiarism detection frameworks are needed to make comparisons of documents in large numbers. On the other hand, there is intelligent plagiarism in which plagiarists try to hide their actions by one of them is replacing words with semantics. Therefore, this study applies an extrinsic plagiarism detection system with a Fuzzy Semantic-Based String Similarity method which is divided into three stages, namely Preprocessing, Heuristic Retrieval (HR), and Detailed Analysis (DA). In the preprocessing stage, the removal of irrelevant characters, the division of text based on sentences, stemming, tokenization, and the elimination of stopwords were performed. The search for pairs of candidate documents in the HR stage used fingerprints and Jaccard similarity. DA stage applied fuzzy semantic based-similarity. Experiments were carried out by comparing the level of document similarity between Jaccard similarity in the HR stage and fuzzy semantic-based similarity in the DA stage because both were able to produce a level of document similarity. The results show that fuzzy semantic-based similarity is better than Jaccard similarity because it can detect semantic similarities in the form of synonyms.
基于模糊语义的字符串相似度实验检测印尼语文献中的抄袭
抄袭是教育界关注的一个话题。克服抄袭的一种方法是在文件之间进行比较。由于文献数量多,需要外部的抄袭检测框架来对大量文献进行比较。另一方面,有一种聪明的抄袭,剽窃者试图隐藏他们的行为,其中一种是用语义代替文字。因此,本研究采用了一种基于模糊语义的字符串相似度方法的外部抄袭检测系统,该系统分为预处理、启发式检索(HR)和详细分析(DA)三个阶段。在预处理阶段,进行了不相关字符的去除、基于句子的文本划分、词干提取、标记化和停止词的消除。HR阶段对候选文档的搜索使用指纹和Jaccard相似度。数据分析阶段采用模糊语义相似度。通过比较HR阶段的Jaccard相似度和DA阶段的模糊语义相似度来进行实验,因为两者都能产生一定程度的文档相似度。结果表明,模糊语义相似度比Jaccard相似度更能检测同义词形式的语义相似度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信