Random Forest and CatBoost with Handling Imbalanced Class for Detection of Risk Factors Anemia in Children (5-12 Years)

International Journal of Scientific Research in Science, Engineering and Technology Pub Date : 2024-06-05 DOI:10.32628/ijsrset24113134

Ditia Yosmita Praptiwi, Anang Kurnia, Anwar Fitrianto, Fitrah Ernawati

{"title":"Random Forest and CatBoost with Handling Imbalanced Class for Detection of Risk Factors Anemia in Children (5-12 Years)","authors":"Ditia Yosmita Praptiwi, Anang Kurnia, Anwar Fitrianto, Fitrah Ernawati","doi":"10.32628/ijsrset24113134","DOIUrl":null,"url":null,"abstract":"The prevalence of anemia in children (5-12 years) remains a public health issue in Indonesia. Early detection and control of risk factors are crucial for prevention. Machine learning models can be employed to address this problem. One practical approach is using ensemble learning models. However, it is expected to encounter imbalanced class problems when analyzing health data. Therefore, this study aims to perform classification modeling using two ensemble learning models: Random Forest (RF) and CatBoost. The proposed methods for handling imbalanced class issues include Random Over Sampling, SMOTE, G-SMOTE, Random Under Sampling, Instance Hardness Threshold (IHT), and SMOTE-ENN. Additionally, SHAP is used to explain the best-performing model based on Shapley values. The research findings indicate that the ensemble learning model using the CatBoost algorithm with G-SMOTE data handling produces the best performance compared to other methods. Based on the average performance metrics from 100 replicate validation, the CatBoost G-SMOTE model produces a sensitivity of 0.7104, specificity of 0.7043, G-Mean of 0.7067, and AUC of 0.7844. Handling the imbalance class problem using the G-SMOTE method effectively increases the sensitivity value in the two proposed ensemble learning models. Meanwhile, the SMOTE-ENN method produces effective G-Mean values for the Random Forest (RF) algorithms. Based on Shapley's value, the features with the highest contribution to predicting anemia in children (5-12 years) are ferritin, vitamin A, consumption of vegetables, diagnosed pneumonia, zinc, calcium total, and consumption of soft or carbonated drinks.","PeriodicalId":14228,"journal":{"name":"International Journal of Scientific Research in Science, Engineering and Technology","volume":"51 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Scientific Research in Science, Engineering and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32628/ijsrset24113134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The prevalence of anemia in children (5-12 years) remains a public health issue in Indonesia. Early detection and control of risk factors are crucial for prevention. Machine learning models can be employed to address this problem. One practical approach is using ensemble learning models. However, it is expected to encounter imbalanced class problems when analyzing health data. Therefore, this study aims to perform classification modeling using two ensemble learning models: Random Forest (RF) and CatBoost. The proposed methods for handling imbalanced class issues include Random Over Sampling, SMOTE, G-SMOTE, Random Under Sampling, Instance Hardness Threshold (IHT), and SMOTE-ENN. Additionally, SHAP is used to explain the best-performing model based on Shapley values. The research findings indicate that the ensemble learning model using the CatBoost algorithm with G-SMOTE data handling produces the best performance compared to other methods. Based on the average performance metrics from 100 replicate validation, the CatBoost G-SMOTE model produces a sensitivity of 0.7104, specificity of 0.7043, G-Mean of 0.7067, and AUC of 0.7844. Handling the imbalance class problem using the G-SMOTE method effectively increases the sensitivity value in the two proposed ensemble learning models. Meanwhile, the SMOTE-ENN method produces effective G-Mean values for the Random Forest (RF) algorithms. Based on Shapley's value, the features with the highest contribution to predicting anemia in children (5-12 years) are ferritin, vitamin A, consumption of vegetables, diagnosed pneumonia, zinc, calcium total, and consumption of soft or carbonated drinks.

查看原文本刊更多论文

利用随机森林和 CatBoost 处理不平衡类来检测儿童（5-12 岁）贫血症的风险因素

在印度尼西亚，儿童（5-12 岁）贫血症的发病率仍然是一个公共卫生问题。早期发现和控制风险因素对于预防至关重要。机器学习模型可用于解决这一问题。一种实用的方法是使用集合学习模型。然而，在分析健康数据时，预计会遇到类不平衡的问题。因此，本研究旨在使用两种集合学习模型进行分类建模：随机森林（RF）和 CatBoost。为处理不平衡类问题而提出的方法包括随机过采样、SMOTE、G-SMOTE、随机欠采样、实例硬度阈值（IHT）和 SMOTE-ENN。此外，SHAP 被用来解释基于 Shapley 值的最佳表现模型。研究结果表明，与其他方法相比，使用 CatBoost 算法和 G-SMOTE 数据处理的集合学习模型性能最佳。根据 100 次重复验证的平均性能指标，CatBoost G-SMOTE 模型的灵敏度为 0.7104，特异度为 0.7043，G-Mean 为 0.7067，AUC 为 0.7844。使用 G-SMOTE 方法处理不平衡类问题有效地提高了两个建议的集合学习模型的灵敏度值。同时，SMOTE-ENN 方法为随机森林（RF）算法生成了有效的 G-Mean 值。根据 Shapley 值，对预测儿童（5-12 岁）贫血贡献最大的特征是铁蛋白、维生素 A、蔬菜摄入量、肺炎诊断、锌、总钙以及软饮料或碳酸饮料摄入量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Scientific Research in Science, Engineering and Technology

自引率

0.00%

发文量