Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen

Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani
{"title":"Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen","authors":"Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani","doi":"10.26418/jlk.v2i1.17","DOIUrl":null,"url":null,"abstract":"Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values ​​below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values ​​from each comparative document","PeriodicalId":418646,"journal":{"name":"Jurnal Linguistik Komputasional (JLK)","volume":"6 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Linguistik Komputasional (JLK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26418/jlk.v2i1.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values ​​below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values ​​from each comparative document
检测文件匹配的cosin类似于溜离和溜离
抄袭是指以文件或文本的形式采用部分或全部思想,而不包括信息检索来源的行为。本研究旨在使用余弦相似度算法和TF-IDF加权来检测文本文档的相似度,从而可以用来确定剽窃的价值。本文比较使用的文件是印尼语摘要。研究结果表明,当词干的相似度值平均比词干的相似度值高10%时,没有进行词干处理。对于高度相似的文档,本研究得出了50%以上的相似值。而相似度低或没有抄袭的文档的相似度值低于40%。在预处理中使用的方法包括折叠案例,标记化,删除停止词和词干。预处理过程结束后,下一步是使用余弦相似度计算TF-IDF和相似度值的权重,从而得到百分比相似度值。基于余弦相似度算法的实验结果,对TF-IDF进行加权,得到各比较文档的相似度值
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信