{"title":"Enhancing automatic early arteriosclerosis prediction: an explainable machine learning evidence","authors":"Eka Miranda , Suko Adiarto","doi":"10.1016/j.ceh.2024.12.003","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>This paper proposed a machine learning (ML) model to early predict patients with arteriosclerotic heart disease (AHD). We also used model-agnostic ML approaches to find and analyze informative aspects in the prediction model outcomes.</div></div><div><h3>Methods</h3><div>We employed an Electronic Health Record (EHR) for hematology that contained data on erythrocytes, hematocrit, hemoglobin, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, leukocytes, thrombocytes, age, and sex. Our investigation included Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Bagging Decision Tree (BDT), and Bagging Logistic Regression (BLR) for ML-based AHD detection. To handle imbalanced data and increase classifier accuracy, we used bagging and the Synthetic Minority Oversampling Technique (SMOTE). Following that, we used the Shapley Additive exPlanations (SHAP) framework to explain the ML model and quantify the feature contribution to predictions.</div></div><div><h3>Results</h3><div>SMOTE-balanced data with RF outperformed on practically all performance measures, including accuracy, precision, recall, f1-score, and ROCAUC, by 82.12 %, 81.31 %, 83.37 %, 82.57 %, and 89 %, respectively. According to the SHAP summary bar plot method for global feature importance, hemoglobin was the most important attribute for detecting and predicting AHD patients. Then, local interpretability in the form of a force plot illustrated the consequences of a single observation’s prediction as well as the magnitude of the SHAP value for each feature. Our findings demonstrated that hemoglobin, erythrocytes, hematocrit, hermch, khermchc, leukocytes, thrombocytes, and age all contributed positively to the prediction of class 1 (AHD patients), however gender had a negative impact on the prediction on a case-by-case basis. For class 0 (patients with no AHD), thrombocytes, hematocrit, and gender contributed positively, but leukocytes, erythrocytes, hemoglobin, and khermchc contributed adversely.</div></div><div><h3>Conclusion</h3><div>Explainable ML paved the way for early AHD prediction since it examined black-box ML models to determine how each feature contributed to the final prediction.</div></div>","PeriodicalId":100268,"journal":{"name":"Clinical eHealth","volume":"7 ","pages":"Pages 153-163"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical eHealth","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2588914124000169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
This paper proposed a machine learning (ML) model to early predict patients with arteriosclerotic heart disease (AHD). We also used model-agnostic ML approaches to find and analyze informative aspects in the prediction model outcomes.
Methods
We employed an Electronic Health Record (EHR) for hematology that contained data on erythrocytes, hematocrit, hemoglobin, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, leukocytes, thrombocytes, age, and sex. Our investigation included Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Bagging Decision Tree (BDT), and Bagging Logistic Regression (BLR) for ML-based AHD detection. To handle imbalanced data and increase classifier accuracy, we used bagging and the Synthetic Minority Oversampling Technique (SMOTE). Following that, we used the Shapley Additive exPlanations (SHAP) framework to explain the ML model and quantify the feature contribution to predictions.
Results
SMOTE-balanced data with RF outperformed on practically all performance measures, including accuracy, precision, recall, f1-score, and ROCAUC, by 82.12 %, 81.31 %, 83.37 %, 82.57 %, and 89 %, respectively. According to the SHAP summary bar plot method for global feature importance, hemoglobin was the most important attribute for detecting and predicting AHD patients. Then, local interpretability in the form of a force plot illustrated the consequences of a single observation’s prediction as well as the magnitude of the SHAP value for each feature. Our findings demonstrated that hemoglobin, erythrocytes, hematocrit, hermch, khermchc, leukocytes, thrombocytes, and age all contributed positively to the prediction of class 1 (AHD patients), however gender had a negative impact on the prediction on a case-by-case basis. For class 0 (patients with no AHD), thrombocytes, hematocrit, and gender contributed positively, but leukocytes, erythrocytes, hemoglobin, and khermchc contributed adversely.
Conclusion
Explainable ML paved the way for early AHD prediction since it examined black-box ML models to determine how each feature contributed to the final prediction.