Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia

IF 1.3 4区工程技术 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Latin America Transactions Pub Date : 2024-10-04 DOI:10.1109/TLA.2024.10705995

Antonieta Martinez-Velasco;Lourdes Martínez -Villaseñor;Luis Miralles-Pechuán

{"title":"Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia","authors":"Antonieta Martinez-Velasco;Lourdes Martínez -Villaseñor;Luis Miralles-Pechuán","doi":"10.1109/TLA.2024.10705995","DOIUrl":null,"url":null,"abstract":"The use of machine learning in healthcare has transformed the way diseases are diagnosed and treatments are optimized. However, medical databases often lack balanced data due to challenges in data collection caused by privacy regulations. Certain health conditions are under represented, which hampers machine learning performance. To address this problem, a hybrid approach has been proposed that combines the Synthetic Minority Oversampling Technique (SMOTE) with under sampling and uses two specific techniques tailored for imbalanced datasets. Comparative evaluations were conducted using various thresholds to reduce one class and employingBalanced Accuracy to mitigate bias toward the majority class, with popular machine learning methods. The results showed that Balanced Bagging and Balanced Random Forest consistently outperformed other methods, performing the best with an average ranking of 1.42 and 3.58 out of 32 configurations in the two datasets, respectively. Tree-based approaches such as Random Forest and Gradient Boosting demonstrated similar effectiveness, emphasizing the power of aggregating predictions from multiple trees to reduce bias. Notably, under sampling andSMOTE proved advantageous for non-tree-based models likeKNN, SVM, and Logistic Regression showcasing their usefulness across different algorithms. This study provides a robust solution for handling imbalanced datasets in healthcare, which could potentially optimize healthcare interventions and improve patient outcomes and care.","PeriodicalId":55024,"journal":{"name":"IEEE Latin America Transactions","volume":"22 10","pages":"806-820"},"PeriodicalIF":1.3000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10705995","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Latin America Transactions","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10705995/","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The use of machine learning in healthcare has transformed the way diseases are diagnosed and treatments are optimized. However, medical databases often lack balanced data due to challenges in data collection caused by privacy regulations. Certain health conditions are under represented, which hampers machine learning performance. To address this problem, a hybrid approach has been proposed that combines the Synthetic Minority Oversampling Technique (SMOTE) with under sampling and uses two specific techniques tailored for imbalanced datasets. Comparative evaluations were conducted using various thresholds to reduce one class and employingBalanced Accuracy to mitigate bias toward the majority class, with popular machine learning methods. The results showed that Balanced Bagging and Balanced Random Forest consistently outperformed other methods, performing the best with an average ranking of 1.42 and 3.58 out of 32 configurations in the two datasets, respectively. Tree-based approaches such as Random Forest and Gradient Boosting demonstrated similar effectiveness, emphasizing the power of aggregating predictions from multiple trees to reduce bias. Notably, under sampling andSMOTE proved advantageous for non-tree-based models likeKNN, SVM, and Logistic Regression showcasing their usefulness across different algorithms. This study provides a robust solution for handling imbalanced datasets in healthcare, which could potentially optimize healthcare interventions and improve patient outcomes and care.

查看原文本刊更多论文

解决医疗保健数据中的类别失衡问题：针对年龄相关性黄斑变性和先兆子痫的机器学习解决方案

机器学习在医疗保健领域的应用改变了疾病诊断和优化治疗的方式。然而，由于隐私法规对数据收集造成的挑战，医疗数据库往往缺乏均衡的数据。某些健康状况的代表性不足，影响了机器学习的性能。为解决这一问题，我们提出了一种混合方法，它将合成少数群体过度采样技术（SMOTE）与不足采样相结合，并使用两种为不平衡数据集量身定制的特定技术。我们使用不同的阈值来减少一个类别，并采用平衡精度（Balanced Accuracy）来减轻对多数类别的偏差，并与流行的机器学习方法进行了比较评估。结果表明，平衡袋装法和平衡随机森林法的表现始终优于其他方法，在两个数据集中的 32 个配置中，平均排名分别为 1.42 和 3.58，表现最佳。随机森林和梯度提升等基于树的方法也表现出了类似的效果，强调了汇总多棵树的预测结果以减少偏差的能力。值得注意的是，对于 KNN、SVM 和逻辑回归等非基于树的模型来说，欠采样和 SMOTE 被证明是有利的，这展示了它们在不同算法中的实用性。这项研究为处理医疗保健领域的不平衡数据集提供了一种稳健的解决方案，有可能优化医疗保健干预措施，改善患者的治疗效果和护理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Latin America Transactions COMPUTER SCIENCE, INFORMATION SYSTEMS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

3.50

自引率

7.70%

发文量

192

审稿时长

3-8 weeks

期刊介绍： IEEE Latin America Transactions (IEEE LATAM) is an interdisciplinary journal focused on the dissemination of original and quality research papers / review articles in Spanish and Portuguese of emerging topics in three main areas: Computing, Electric Energy and Electronics. Some of the sub-areas of the journal are, but not limited to: Automatic control, communications, instrumentation, artificial intelligence, power and industrial electronics, fault diagnosis and detection, transportation electrification, internet of things, electrical machines, circuits and systems, biomedicine and biomedical / haptic applications, secure communications, robotics, sensors and actuators, computer networks, smart grids, among others.