Document Similarity Detection Using Indonesian Language Word2vec Model

Nahda Rosa Ramadhanti, Siti Mariyah
{"title":"Document Similarity Detection Using Indonesian Language Word2vec Model","authors":"Nahda Rosa Ramadhanti, Siti Mariyah","doi":"10.1109/ICICoS48119.2019.8982432","DOIUrl":null,"url":null,"abstract":"Most researches on text duplication in Bahasa uses the TF-IDF method. In this method, each word will have a different weight. The more frequencies the word appears, the greater the weight. This study aims to detect the similarity of documents by calculating cosine similarity from word vectors. The corpus was built from a collection of Indonesian Wikipedia articles. This study proposes two techniques to calculate the similarity which is simultaneous and partial comparison. Simultaneous comparison is direct comparison without dividing documents into several chapters, while partial comparison divides documents into several chapters before calculating the similarity. Similarity result from partial comparison is more accurate than simultaneous comparison. This study uses Unicheck application TF-IDF method as a benchmark. Similarity result from Unicheck and this study are different, due to the different method applied. Similarity result using TF -IDF method is smaller than using Word2vec, this is because TF-IDF can't detect paraphrase. The limitation in this study is that the Unicheck application used as a benchmark does not use the same method as the method used in this study other than that the determination of expected value is still subjective.","PeriodicalId":105407,"journal":{"name":"2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICoS48119.2019.8982432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Most researches on text duplication in Bahasa uses the TF-IDF method. In this method, each word will have a different weight. The more frequencies the word appears, the greater the weight. This study aims to detect the similarity of documents by calculating cosine similarity from word vectors. The corpus was built from a collection of Indonesian Wikipedia articles. This study proposes two techniques to calculate the similarity which is simultaneous and partial comparison. Simultaneous comparison is direct comparison without dividing documents into several chapters, while partial comparison divides documents into several chapters before calculating the similarity. Similarity result from partial comparison is more accurate than simultaneous comparison. This study uses Unicheck application TF-IDF method as a benchmark. Similarity result from Unicheck and this study are different, due to the different method applied. Similarity result using TF -IDF method is smaller than using Word2vec, this is because TF-IDF can't detect paraphrase. The limitation in this study is that the Unicheck application used as a benchmark does not use the same method as the method used in this study other than that the determination of expected value is still subjective.
基于印尼语Word2vec模型的文档相似度检测
对印尼语文本复制的研究大多采用TF-IDF方法。在这种方法中,每个单词都有不同的权重。单词出现的频率越多,权重越大。本研究旨在通过计算词向量的余弦相似度来检测文档的相似度。这个语料库是根据维基百科上印尼语文章的集合建立的。本文提出了同时比较和部分比较两种计算相似度的方法。同时比较是直接比较,不把文档分成几章,而部分比较是把文档分成几章,然后再计算相似度。部分比较得到的相似度比同时比较得到的相似度更准确。本研究以Unicheck应用TF-IDF方法为基准。由于使用的方法不同,Unicheck和本研究的相似度结果不同。使用TF-IDF方法的相似度结果小于使用Word2vec方法,这是因为TF-IDF不能检测释义。本研究的局限性在于,作为基准的Unicheck应用程序使用的方法与本研究中使用的方法不同,期望值的确定仍然是主观的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信