{"title":"MultiThal-classifier, a machine learning-based multi-class model for thalassemia diagnosis and classification","authors":"WenQiang Wang, RenQing Ye, BaoJia Tang, YuYing Qi","doi":"10.1016/j.cca.2024.120025","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The differential diagnosis between iron deficiency anemia (IDA) and thalassemia trait (TT) remains a significant clinical challenge. This study aimed to develop a machine learning-based multi-class model to differentiate among Microcytic-TT(TT with low mean corpuscular volume), Normocytic-TT (TT with normal mean corpuscular volume), IDA, and healthy individuals.</div></div><div><h3>Methods</h3><div>A comprehensive dataset comprising 1,819 individuals was analyzed using six distinct machine learning algorithms. The eXtreme Gradient Boosting (XGBoost) algorithm was ultimately selected to construct the MultiThal-Classifier (M−THAL) model. SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) was employed to address data imbalance. Model performance was evaluated using various metrics, and SHAP values were applied to interpret the model’s predictions.Additionally, external validation was conducted to assess the model’s robustness and generalizability.</div></div><div><h3>Results</h3><div>After performing 1000 bootstrap resamples of the test set, the average performance metrics of M−THAL and the 95 % confidence interval(CI) were as follows, sensitivity 90.27 % (95 % CI: 84.88–95.26), specificity 97.87 % (95% CI: 97.10–98.55), PPV 93.42 % (95 % CI: 89.34–96.48), NPV 97.82% (95 % CI: 97.00–98.53), F1-score 91.50 % (95% CI: 87.29–95.34), Youden’s index 88.15 % (95 % CI: 82.33–93.70), accuracy 97.06 % (95% CI: 96.06–97.99), and AUC 94.07 % (95 % CI: 91.17–96.84).Feature importance analysis identified mean corpuscular volume(MCV), mean corpuscular hemoglobin(MCH), red cell distribution width − standard deviation(RDW-SD), and hemoglobin (HGB) were identified as the most important features. External validation confirmed the model’s robustness and generalizability.</div></div><div><h3>Conclusion</h3><div>The M−THAL effectively distinguishes Normocytic-TT, Microcytic-TT, IDA, and healthy individuals using hematological parameters, offers a rapid and cost-effective screening tool that can be readily implemented in diverse healthcare settings.</div></div>","PeriodicalId":10205,"journal":{"name":"Clinica Chimica Acta","volume":"567 ","pages":"Article 120025"},"PeriodicalIF":3.2000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinica Chimica Acta","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0009898124022782","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background
The differential diagnosis between iron deficiency anemia (IDA) and thalassemia trait (TT) remains a significant clinical challenge. This study aimed to develop a machine learning-based multi-class model to differentiate among Microcytic-TT(TT with low mean corpuscular volume), Normocytic-TT (TT with normal mean corpuscular volume), IDA, and healthy individuals.
Methods
A comprehensive dataset comprising 1,819 individuals was analyzed using six distinct machine learning algorithms. The eXtreme Gradient Boosting (XGBoost) algorithm was ultimately selected to construct the MultiThal-Classifier (M−THAL) model. SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) was employed to address data imbalance. Model performance was evaluated using various metrics, and SHAP values were applied to interpret the model’s predictions.Additionally, external validation was conducted to assess the model’s robustness and generalizability.
Results
After performing 1000 bootstrap resamples of the test set, the average performance metrics of M−THAL and the 95 % confidence interval(CI) were as follows, sensitivity 90.27 % (95 % CI: 84.88–95.26), specificity 97.87 % (95% CI: 97.10–98.55), PPV 93.42 % (95 % CI: 89.34–96.48), NPV 97.82% (95 % CI: 97.00–98.53), F1-score 91.50 % (95% CI: 87.29–95.34), Youden’s index 88.15 % (95 % CI: 82.33–93.70), accuracy 97.06 % (95% CI: 96.06–97.99), and AUC 94.07 % (95 % CI: 91.17–96.84).Feature importance analysis identified mean corpuscular volume(MCV), mean corpuscular hemoglobin(MCH), red cell distribution width − standard deviation(RDW-SD), and hemoglobin (HGB) were identified as the most important features. External validation confirmed the model’s robustness and generalizability.
Conclusion
The M−THAL effectively distinguishes Normocytic-TT, Microcytic-TT, IDA, and healthy individuals using hematological parameters, offers a rapid and cost-effective screening tool that can be readily implemented in diverse healthcare settings.
期刊介绍:
The Official Journal of the International Federation of Clinical Chemistry and Laboratory Medicine (IFCC)
Clinica Chimica Acta is a high-quality journal which publishes original Research Communications in the field of clinical chemistry and laboratory medicine, defined as the diagnostic application of chemistry, biochemistry, immunochemistry, biochemical aspects of hematology, toxicology, and molecular biology to the study of human disease in body fluids and cells.
The objective of the journal is to publish novel information leading to a better understanding of biological mechanisms of human diseases, their prevention, diagnosis, and patient management. Reports of an applied clinical character are also welcome. Papers concerned with normal metabolic processes or with constituents of normal cells or body fluids, such as reports of experimental or clinical studies in animals, are only considered when they are clearly and directly relevant to human disease. Evaluation of commercial products have a low priority for publication, unless they are novel or represent a technological breakthrough. Studies dealing with effects of drugs and natural products and studies dealing with the redox status in various diseases are not within the journal''s scope. Development and evaluation of novel analytical methodologies where applicable to diagnostic clinical chemistry and laboratory medicine, including point-of-care testing, and topics on laboratory management and informatics will also be considered. Studies focused on emerging diagnostic technologies and (big) data analysis procedures including digitalization, mobile Health, and artificial Intelligence applied to Laboratory Medicine are also of interest.