Optimizing machine learning algorithms for diabetes data: A metaheuristic approach to balancing and tuning classifiers parameters

Franklin Open Pub Date : 2024-08-28 DOI:10.1016/j.fraope.2024.100153

Hauwau Abdulrahman Aliyu , Ibrahim Olawale Muritala , Habeeb Bello-Salau , Salisu Mohammed , Adeiza James Onumanyi , Ore-Ofe Ajayi

{"title":"Optimizing machine learning algorithms for diabetes data: A metaheuristic approach to balancing and tuning classifiers parameters","authors":"Hauwau Abdulrahman Aliyu , Ibrahim Olawale Muritala , Habeeb Bello-Salau , Salisu Mohammed , Adeiza James Onumanyi , Ore-Ofe Ajayi","doi":"10.1016/j.fraope.2024.100153","DOIUrl":null,"url":null,"abstract":"<div><p>Diabetes mellitus poses a global health concern, prompting the development of machine learning algorithms designed to construct a model for the accurate classification of patients, enabling precise diagnoses and early-stage treatment. However, the efficacy of classifying diabetes patients through machine learning relies on datasets, often plagued by imbalance, leading to biased classification and inaccurate diagnoses. Previous research attempts, employing techniques like random sampling (under-sampling and oversampling) and the Synthetic Minority Oversampling Technique (SMOTE), have struggled to achieve optimally balanced datasets. Additionally, setting the best parameters for machine learning classifiers remains a challenging task. To address these issues, this research focuses on devising a methodological metaheuristic optimization, a machine learning algorithm tailored for diabetes data balancing, and classifier hyperparameter tuning. Leveraging Particle Swarm Optimization (PSO) algorithm for diabetes data balancing and a genetic algorithm to select the optimal architecture for various machine learning classifiers. The study compares the performance of the proposed metaheuristic data balancer and classifier architecture parameter tuner using classification metrics (F1 score, Average Precision–Recall (APR), AUC, and accuracy). The PSO balanced dataset emerges as the most effective in classifying diabetes, with an Average Percentage Improvement (API) in classification performance metrics: 20.78% accuracy, 16.79% area under the curve for receiver operating characteristics, and a significant 32.78% enhancement in APR. Moreover, the XGBOOST classifier trained with a genetic algorithm demonstrates minimal computational training time for the Centre for Disease Control and Prevention (CDC) diabetes dataset compared to the artificial neural network and random forest classifier. Notably, the imbalanced CDC diabetes dataset exhibits the least APR compared to random under-sampling and the PSO data balancing technique.</p></div>","PeriodicalId":100554,"journal":{"name":"Franklin Open","volume":"8 ","pages":"Article 100153"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2773186324000835/pdfft?md5=57a6344698bfd841a4a5715b104a987b&pid=1-s2.0-S2773186324000835-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Franklin Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773186324000835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Diabetes mellitus poses a global health concern, prompting the development of machine learning algorithms designed to construct a model for the accurate classification of patients, enabling precise diagnoses and early-stage treatment. However, the efficacy of classifying diabetes patients through machine learning relies on datasets, often plagued by imbalance, leading to biased classification and inaccurate diagnoses. Previous research attempts, employing techniques like random sampling (under-sampling and oversampling) and the Synthetic Minority Oversampling Technique (SMOTE), have struggled to achieve optimally balanced datasets. Additionally, setting the best parameters for machine learning classifiers remains a challenging task. To address these issues, this research focuses on devising a methodological metaheuristic optimization, a machine learning algorithm tailored for diabetes data balancing, and classifier hyperparameter tuning. Leveraging Particle Swarm Optimization (PSO) algorithm for diabetes data balancing and a genetic algorithm to select the optimal architecture for various machine learning classifiers. The study compares the performance of the proposed metaheuristic data balancer and classifier architecture parameter tuner using classification metrics (F1 score, Average Precision–Recall (APR), AUC, and accuracy). The PSO balanced dataset emerges as the most effective in classifying diabetes, with an Average Percentage Improvement (API) in classification performance metrics: 20.78% accuracy, 16.79% area under the curve for receiver operating characteristics, and a significant 32.78% enhancement in APR. Moreover, the XGBOOST classifier trained with a genetic algorithm demonstrates minimal computational training time for the Centre for Disease Control and Prevention (CDC) diabetes dataset compared to the artificial neural network and random forest classifier. Notably, the imbalanced CDC diabetes dataset exhibits the least APR compared to random under-sampling and the PSO data balancing technique.

查看原文本刊更多论文

针对糖尿病数据优化机器学习算法：平衡和调整分类器参数的元启发式方法

糖尿病是全球关注的健康问题，促使人们开发机器学习算法，旨在构建一个模型，对患者进行准确分类，从而实现精确诊断和早期治疗。然而，通过机器学习对糖尿病患者进行分类的有效性依赖于数据集，而数据集往往存在不平衡问题，导致分类有偏差和诊断不准确。以往的研究尝试采用了随机抽样（抽样不足和抽样过度）和合成少数群体过度抽样技术（SMOTE）等技术，但都难以实现最佳平衡的数据集。此外，为机器学习分类器设置最佳参数仍然是一项具有挑战性的任务。为了解决这些问题，本研究侧重于设计一种方法论元启发式优化、一种为糖尿病数据平衡量身定制的机器学习算法以及分类器超参数调整。利用粒子群优化（PSO）算法进行糖尿病数据平衡，并利用遗传算法为各种机器学习分类器选择最佳架构。研究使用分类指标（F1 分数、平均精度-召回率（APR）、AUC 和准确率）比较了所提出的元启发式数据平衡器和分类器架构参数调整器的性能。PSO 平衡数据集在糖尿病分类方面最为有效，在分类性能指标方面取得了平均百分比改进（API）：准确率提高了 20.78%，接收者操作特征曲线下面积提高了 16.79%，APR 显著提高了 32.78%。此外，与人工神经网络和随机森林分类器相比，使用遗传算法训练的 XGBOOST 分类器在疾病控制和预防中心（CDC）糖尿病数据集上的计算训练时间最短。值得注意的是，与随机欠采样和 PSO 数据平衡技术相比，不平衡的 CDC 糖尿病数据集表现出最少的 APR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Franklin Open

自引率

0.00%

发文量