基于改进Rabin-Karp算法的最佳K-Gram值的选择

IJCCS Indonesian Journal of Computing and Cybernetics Systems Pub Date : 2022-01-31 DOI:10.22146/ijccs.63686

Wahyu Hidayat, Ema Utami, A. Sunyoto

{"title":"基于改进Rabin-Karp算法的最佳K-Gram值的选择","authors":"Wahyu Hidayat, Ema Utami, A. Sunyoto","doi":"10.22146/ijccs.63686","DOIUrl":null,"url":null,"abstract":"The Rabin-Karp algorithm is used to detect similarity using hashing techniques, from related studies modifications have been made in the hashing process but in previous studies have not been conducted research for the best k value in the K-Gram process. At the stage of stemming the Nazief & Adriani algorithm is used to transform the words into basic words. The researcher uses several variations of K-Gram values to determine the best K-Gram values. The analysis was performed using Ukara Enhanced public data obtained from the Kaggle with a total of 12215 data. The student essay answers data totaled to 258 data in the group A and 305 in the group B, every student essay answers data in each group will be compared with the answers of other fellow group member. Research results are the value of k = 3 has the best performance which has the highest some interpretations of 1-14% (Little degree of similarity) and 15-50% (Medium level of similarity) compared to values of k = 5, 7, and 9 which have the highest number of interpretation results 0%-0.99% (Document is different). However, if the students essay answers compared have 100% (Exactly the same) interpretations, the k value on K-Gram does not affect the results.","PeriodicalId":31625,"journal":{"name":"IJCCS Indonesian Journal of Computing and Cybernetics Systems","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Selection of the Best K-Gram Value on Modified Rabin-Karp Algorithm\",\"authors\":\"Wahyu Hidayat, Ema Utami, A. Sunyoto\",\"doi\":\"10.22146/ijccs.63686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Rabin-Karp algorithm is used to detect similarity using hashing techniques, from related studies modifications have been made in the hashing process but in previous studies have not been conducted research for the best k value in the K-Gram process. At the stage of stemming the Nazief & Adriani algorithm is used to transform the words into basic words. The researcher uses several variations of K-Gram values to determine the best K-Gram values. The analysis was performed using Ukara Enhanced public data obtained from the Kaggle with a total of 12215 data. The student essay answers data totaled to 258 data in the group A and 305 in the group B, every student essay answers data in each group will be compared with the answers of other fellow group member. Research results are the value of k = 3 has the best performance which has the highest some interpretations of 1-14% (Little degree of similarity) and 15-50% (Medium level of similarity) compared to values of k = 5, 7, and 9 which have the highest number of interpretation results 0%-0.99% (Document is different). However, if the students essay answers compared have 100% (Exactly the same) interpretations, the k value on K-Gram does not affect the results.\",\"PeriodicalId\":31625,\"journal\":{\"name\":\"IJCCS Indonesian Journal of Computing and Cybernetics Systems\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IJCCS Indonesian Journal of Computing and Cybernetics Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22146/ijccs.63686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IJCCS Indonesian Journal of Computing and Cybernetics Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22146/ijccs.63686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

使用Rabin-Karp算法使用哈希技术检测相似性，相关研究对哈希过程进行了修改，但在以往的研究中没有对k - gram过程中的最佳k值进行研究。在词干提取阶段，使用Nazief & Adriani算法将单词转化为基本单词。研究人员使用K-Gram值的几种变化来确定最佳K-Gram值。分析使用了从Kaggle获得的Ukara Enhanced公共数据，共有12215个数据。A组学生作文答案数据为258个数据，B组为305个数据，每组的每个学生作文答案数据将与其他组员的答案进行比较。研究结果表明，k = 3的值表现最好，解释结果最多，为1-14%(相似度小)和15-50%(相似度中等)，而k = 5、7和9的值解释结果最多，为0%-0.99%(文献不同)。然而，如果学生的作文答案有100%(完全相同)的解释，k - gram上的k值不影响结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Selection of the Best K-Gram Value on Modified Rabin-Karp Algorithm

The Rabin-Karp algorithm is used to detect similarity using hashing techniques, from related studies modifications have been made in the hashing process but in previous studies have not been conducted research for the best k value in the K-Gram process. At the stage of stemming the Nazief & Adriani algorithm is used to transform the words into basic words. The researcher uses several variations of K-Gram values to determine the best K-Gram values. The analysis was performed using Ukara Enhanced public data obtained from the Kaggle with a total of 12215 data. The student essay answers data totaled to 258 data in the group A and 305 in the group B, every student essay answers data in each group will be compared with the answers of other fellow group member. Research results are the value of k = 3 has the best performance which has the highest some interpretations of 1-14% (Little degree of similarity) and 15-50% (Medium level of similarity) compared to values of k = 5, 7, and 9 which have the highest number of interpretation results 0%-0.99% (Document is different). However, if the students essay answers compared have 100% (Exactly the same) interpretations, the k value on K-Gram does not affect the results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IJCCS Indonesian Journal of Computing and Cybernetics Systems

自引率

0.00%

发文量

审稿时长

12 weeks