Shahid Mohammad Ganie , Pijush Kanti Dutta Pramanik
{"title":"A comparative analysis of boosting algorithms for chronic liver disease prediction","authors":"Shahid Mohammad Ganie , Pijush Kanti Dutta Pramanik","doi":"10.1016/j.health.2024.100313","DOIUrl":null,"url":null,"abstract":"<div><p>Chronic liver disease (CLD) is a major health concern for millions of people all over the globe. Early prediction and identification are critical for taking appropriate action at the earliest stages of the disease. Implementing machine learning methods in predicting CLD can greatly improve medical outcomes, reduce the burden of the condition, and promote proactive and preventive healthcare practices for those at risk. However, traditional machine learning has some limitations which can be mitigated through ensemble learning. Boosting is the most advantageous ensemble learning approach. This study aims to improve the performance of the available boosting techniques for CLD prediction. Seven popular boosting algorithms of Gradient Boosting (GB), AdaBoost, LogitBoost, SGBoost, XGBoost, LightGBM, and CatBoost, and two publicly available popular CLD datasets (Liver disease patient dataset (LDPD) and Indian liver disease patient dataset (ILPD)) of dissimilar size and demography are considered in this study. The features of the datasets are ascertained by exploratory data analysis. Additionally, hyperparameter tuning, normalisation, and upsampling are used for predictive analytics. The proportional importance of every feature contributing to CLD for every algorithm is assessed. Each algorithm's performance on both datasets is assessed using k-fold cross-validation, twelve metrics, and runtime. Among the five boosting algorithms, GB emerged as the best overall performer for both datasets. It attained 98.80% and 98.29% accuracy rates for LDPD and ILPD, respectively. GB also outperformed other boosting algorithms regarding other performance metrics except runtime.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"5 ","pages":"Article 100313"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442524000157/pdfft?md5=b2782e61e17bafe88b65e8e663f21da2&pid=1-s2.0-S2772442524000157-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442524000157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Chronic liver disease (CLD) is a major health concern for millions of people all over the globe. Early prediction and identification are critical for taking appropriate action at the earliest stages of the disease. Implementing machine learning methods in predicting CLD can greatly improve medical outcomes, reduce the burden of the condition, and promote proactive and preventive healthcare practices for those at risk. However, traditional machine learning has some limitations which can be mitigated through ensemble learning. Boosting is the most advantageous ensemble learning approach. This study aims to improve the performance of the available boosting techniques for CLD prediction. Seven popular boosting algorithms of Gradient Boosting (GB), AdaBoost, LogitBoost, SGBoost, XGBoost, LightGBM, and CatBoost, and two publicly available popular CLD datasets (Liver disease patient dataset (LDPD) and Indian liver disease patient dataset (ILPD)) of dissimilar size and demography are considered in this study. The features of the datasets are ascertained by exploratory data analysis. Additionally, hyperparameter tuning, normalisation, and upsampling are used for predictive analytics. The proportional importance of every feature contributing to CLD for every algorithm is assessed. Each algorithm's performance on both datasets is assessed using k-fold cross-validation, twelve metrics, and runtime. Among the five boosting algorithms, GB emerged as the best overall performer for both datasets. It attained 98.80% and 98.29% accuracy rates for LDPD and ILPD, respectively. GB also outperformed other boosting algorithms regarding other performance metrics except runtime.