Analisis Komparatif Algoritme Machine Learning dan Penanganan Imbalanced Data pada Klasifikasi Kualitas Air Layak Minum

Generosa Lukhayu Pritalia
{"title":"Analisis Komparatif Algoritme Machine Learning dan Penanganan Imbalanced Data pada Klasifikasi Kualitas Air Layak Minum","authors":"Generosa Lukhayu Pritalia","doi":"10.24002/konstelasi.v2i1.5630","DOIUrl":null,"url":null,"abstract":"  \nAbstract. Water is essential for survival. Currently, there are requirements to monitor, assess, and classify water quality to understand the impact of industrialization. The water quality classification process has been carried out using traditional methods such as WQI and Storet, and machine learning methods. Imbalanced data in machine learning method can make this method have a tendency to predict the majority class and become biased. In addition, using all features in the classification process can degrade classification performance and lead to high computation time. To overcome the above-mentioned problems, this study proposes several approaches, included resampling the data to be balanced, determined the most suitable and contributing features, and compared the performance of machine learning algorithms in classifying potable water. The results of handling unbalanced data and implementing feature selection were able to provide increased work on the algorithm, especially the accuracy metric reached 24.8% from previous study. The most optimal algorithm performance was obtained from Random Forest with 87% of precision, 84% of recall, 16% of Miss rate, 85% of F-measure, and 85% of test accuracy, while used seven best features. However, another important aspect is the smallest Miss rate, which was 15%, obtained from Decision Tree algorithm. \n ","PeriodicalId":163388,"journal":{"name":"KONSTELASI: Konvergensi Teknologi dan Sistem Informasi","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"KONSTELASI: Konvergensi Teknologi dan Sistem Informasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24002/konstelasi.v2i1.5630","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

  Abstract. Water is essential for survival. Currently, there are requirements to monitor, assess, and classify water quality to understand the impact of industrialization. The water quality classification process has been carried out using traditional methods such as WQI and Storet, and machine learning methods. Imbalanced data in machine learning method can make this method have a tendency to predict the majority class and become biased. In addition, using all features in the classification process can degrade classification performance and lead to high computation time. To overcome the above-mentioned problems, this study proposes several approaches, included resampling the data to be balanced, determined the most suitable and contributing features, and compared the performance of machine learning algorithms in classifying potable water. The results of handling unbalanced data and implementing feature selection were able to provide increased work on the algorithm, especially the accuracy metric reached 24.8% from previous study. The most optimal algorithm performance was obtained from Random Forest with 87% of precision, 84% of recall, 16% of Miss rate, 85% of F-measure, and 85% of test accuracy, while used seven best features. However, another important aspect is the smallest Miss rate, which was 15%, obtained from Decision Tree algorithm.  
对可饮用水质量分类的数据补偿分析
摘要水是生存所必需的。目前,有必要对水质进行监测、评估和分类,以了解工业化的影响。水质分类过程使用WQI和Storet等传统方法以及机器学习方法进行。机器学习方法中的数据不平衡会使该方法有预测多数类的倾向,从而产生偏差。此外,在分类过程中使用所有特征会降低分类性能并导致较高的计算时间。为了克服上述问题,本研究提出了几种方法,包括对待平衡数据进行重新采样,确定最合适和最有贡献的特征,并比较机器学习算法在饮用水分类中的性能。通过对不平衡数据的处理和特征选择的实现,提高了算法的工作效率,准确率达到了24.8%。随机森林在使用7个最佳特征的情况下,获得了87%的准确率、84%的召回率、16%的缺失率、85%的F-measure和85%的测试准确率的最优算法性能。然而,另一个重要方面是决策树算法的最小缺失率为15%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信