相似度度量算法在数据预处理阶段的性能比较:马来文文本归一化

Achmad Yohni Wahyu Finansyah, Fnu Afiahayati, Vincent Michael Sutanto
{"title":"相似度度量算法在数据预处理阶段的性能比较:马来文文本归一化","authors":"Achmad Yohni Wahyu Finansyah, Fnu Afiahayati, Vincent Michael Sutanto","doi":"10.15294/sji.v9i1.30052","DOIUrl":null,"url":null,"abstract":"Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia, there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective.Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia.Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data. Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia. Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.","PeriodicalId":30781,"journal":{"name":"Scientific Journal of Informatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa\",\"authors\":\"Achmad Yohni Wahyu Finansyah, Fnu Afiahayati, Vincent Michael Sutanto\",\"doi\":\"10.15294/sji.v9i1.30052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia, there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective.Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia.Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data. Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia. Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.\",\"PeriodicalId\":30781,\"journal\":{\"name\":\"Scientific Journal of Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Journal of Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15294/sji.v9i1.30052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Journal of Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15294/sji.v9i1.30052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

用途:由于技术的发展,越来越多的数据以文本形式存储,这给文本数据的处理增加了难度。这也会给文本预处理算法带来问题,其中一个问题是当两个文本相同,但被算法认为是不同的。因此,有必要对文本进行规范化,以获得特定语言中单词的标准形式。拼写校正通常用于文本规范化,但对于印尼语,拼写校正算法的研究并不多。因此,需要对最合适的拼写纠正算法进行比较,以使规范化过程有效。方法:本研究比较了Levenshtein Distance、Jaro-Winkler Distance和Smith-Waterman三种算法。这些算法使用问卷数据和推特数据进行评估,这两种数据都是印尼语。结果:Jaro-Winkler归一化时间最快,问卷数据平均为31.01秒,tweet数据平均为59.27秒。Levenshtein Distance的准确率最高,问卷数据的准确率为44.90%,tweet数据的准确率为60.04%。新颖性:本研究的新颖性在于比较印尼语的相似度度量算法。从而得到最适合印尼语的相似度度量算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa
Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia, there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective.Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia.Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data. Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia. Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
13
审稿时长
24 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信