Comparative performance analysis of Boruta, SHAP, and Borutashap for disease diagnosis: A study with multiple machine learning algorithms.

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Network-Computation in Neural Systems Pub Date : 2025-08-01 Epub Date: 2024-03-21 DOI:10.1080/0954898X.2024.2331506

Chukwuebuka Joseph Ejiyi, Zhen Qin, Chiagoziem Chima Ukwuoma, Grace Ugochi Nneji, Happy Nkanta Monday, Makuachukwu Bennedith Ejiyi, Thomas Ugochukwu Ejiyi, Uchenna Okechukwu, Olusola O Bamisile

{"title":"Comparative performance analysis of Boruta, SHAP, and Borutashap for disease diagnosis: A study with multiple machine learning algorithms.","authors":"Chukwuebuka Joseph Ejiyi, Zhen Qin, Chiagoziem Chima Ukwuoma, Grace Ugochi Nneji, Happy Nkanta Monday, Makuachukwu Bennedith Ejiyi, Thomas Ugochukwu Ejiyi, Uchenna Okechukwu, Olusola O Bamisile","doi":"10.1080/0954898X.2024.2331506","DOIUrl":null,"url":null,"abstract":"<p><p>Interpretable machine learning models are instrumental in disease diagnosis and clinical decision-making, shedding light on relevant features. Notably, Boruta, SHAP (SHapley Additive exPlanations), and BorutaShap were employed for feature selection, each contributing to the identification of crucial features. These selected features were then utilized to train six machine learning algorithms, including LR, SVM, ETC, AdaBoost, RF, and LR, using diverse medical datasets obtained from public sources after rigorous preprocessing. The performance of each feature selection technique was evaluated across multiple ML models, assessing accuracy, precision, recall, and F1-score metrics. Among these, SHAP showcased superior performance, achieving average accuracies of 80.17%, 85.13%, 90.00%, and 99.55% across diabetes, cardiovascular, statlog, and thyroid disease datasets, respectively. Notably, the LGBM emerged as the most effective algorithm, boasting an average accuracy of 91.00% for most disease states. Moreover, SHAP enhanced the interpretability of the models, providing valuable insights into the underlying mechanisms driving disease diagnosis. This comprehensive study contributes significant insights into feature selection techniques and machine learning algorithms for disease diagnosis, benefiting researchers and practitioners in the medical field. Further exploration of feature selection methods and algorithms holds promise for advancing disease diagnosis methodologies, paving the way for more accurate and interpretable diagnostic models.</p>","PeriodicalId":54735,"journal":{"name":"Network-Computation in Neural Systems","volume":" ","pages":"507-544"},"PeriodicalIF":1.6000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Network-Computation in Neural Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0954898X.2024.2331506","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/21 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Interpretable machine learning models are instrumental in disease diagnosis and clinical decision-making, shedding light on relevant features. Notably, Boruta, SHAP (SHapley Additive exPlanations), and BorutaShap were employed for feature selection, each contributing to the identification of crucial features. These selected features were then utilized to train six machine learning algorithms, including LR, SVM, ETC, AdaBoost, RF, and LR, using diverse medical datasets obtained from public sources after rigorous preprocessing. The performance of each feature selection technique was evaluated across multiple ML models, assessing accuracy, precision, recall, and F1-score metrics. Among these, SHAP showcased superior performance, achieving average accuracies of 80.17%, 85.13%, 90.00%, and 99.55% across diabetes, cardiovascular, statlog, and thyroid disease datasets, respectively. Notably, the LGBM emerged as the most effective algorithm, boasting an average accuracy of 91.00% for most disease states. Moreover, SHAP enhanced the interpretability of the models, providing valuable insights into the underlying mechanisms driving disease diagnosis. This comprehensive study contributes significant insights into feature selection techniques and machine learning algorithms for disease diagnosis, benefiting researchers and practitioners in the medical field. Further exploration of feature selection methods and algorithms holds promise for advancing disease diagnosis methodologies, paving the way for more accurate and interpretable diagnostic models.

查看原文本刊更多论文

用于疾病诊断的 Boruta、SHAP 和 Borutashap 的性能比较分析：使用多种机器学习算法的研究。

可解释的机器学习模型有助于疾病诊断和临床决策，揭示相关特征。值得注意的是，Boruta、SHAP（SHapley Additive exPlanations）和 BorutaShap 被用于特征选择，它们都有助于识别关键特征。然后，利用从公共资源获得的各种医学数据集，经过严格的预处理后，利用这些选定的特征训练六种机器学习算法，包括 LR、SVM、ETC、AdaBoost、RF 和 LR。在多个 ML 模型中对每种特征选择技术的性能进行了评估，评估指标包括准确度、精确度、召回率和 F1 分数。其中，SHAP 表现出卓越的性能，在糖尿病、心血管疾病、statlog 和甲状腺疾病数据集上的平均准确率分别达到 80.17%、85.13%、90.00% 和 99.55%。值得注意的是，LGBM 是最有效的算法，在大多数疾病状态下的平均准确率高达 91.00%。此外，SHAP 增强了模型的可解释性，为疾病诊断的内在机制提供了宝贵的见解。这项综合研究为疾病诊断的特征选择技术和机器学习算法提供了重要见解，使医学领域的研究人员和从业人员受益匪浅。对特征选择方法和算法的进一步探索有望推动疾病诊断方法的发展，为建立更准确、更可解释的诊断模型铺平道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Network-Computation in Neural Systems 工程技术-工程：电子与电气

CiteScore

3.70

自引率

1.30%

发文量

审稿时长

>12 weeks

期刊介绍： Network: Computation in Neural Systems welcomes submissions of research papers that integrate theoretical neuroscience with experimental data, emphasizing the utilization of cutting-edge technologies. We invite authors and researchers to contribute their work in the following areas: Theoretical Neuroscience: This section encompasses neural network modeling approaches that elucidate brain function. Neural Networks in Data Analysis and Pattern Recognition: We encourage submissions exploring the use of neural networks for data analysis and pattern recognition, including but not limited to image analysis and speech processing applications. Neural Networks in Control Systems: This category encompasses the utilization of neural networks in control systems, including robotics, state estimation, fault detection, and diagnosis. Analysis of Neurophysiological Data: We invite submissions focusing on the analysis of neurophysiology data obtained from experimental studies involving animals. Analysis of Experimental Data on the Human Brain: This section includes papers analyzing experimental data from studies on the human brain, utilizing imaging techniques such as MRI, fMRI, EEG, and PET. Neurobiological Foundations of Consciousness: We encourage submissions exploring the neural bases of consciousness in the brain and its simulation in machines.