Subrata Saha , Prashant Sharma , Atul Kumar Jain , Bapi Dutta , Luis Martínez , Sarkaft Saleh , Tuphan Kanti Dolai , Anilava Kaviraj , Tanmay Sanyal , Izabela Nielsen , Reena Das
{"title":"Detection of β-Thalassemia trait from a heterogeneous population with red cell indices and parameters","authors":"Subrata Saha , Prashant Sharma , Atul Kumar Jain , Bapi Dutta , Luis Martínez , Sarkaft Saleh , Tuphan Kanti Dolai , Anilava Kaviraj , Tanmay Sanyal , Izabela Nielsen , Reena Das","doi":"10.1016/j.compbiomed.2025.110151","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>India is home to about 42 million people with <span><math><mi>β</mi></math></span>-thalassemia trait (<span><math><mi>β</mi></math></span>TT) necessitating screening of <span><math><mi>β</mi></math></span>TT to stop spread of the disease. Over the years, researchers developed discrimination formulae based on red blood cell (RBC) parameters to screen <span><math><mi>β</mi></math></span>-thalassemia trait from iron deficiency anemia (IDA). However, the screening programs often encounter normal subjects (NSs) with other hemoglobinopathy variants. Because the outcome of existing formulas is binary, they often club normal subjects (NS) or variants such as Hemoglobin E (HbE) traits with either <span><math><mi>β</mi></math></span>TT or IDA. Therefore, it is necessary to segregate <span><math><mi>β</mi></math></span>TT, IDA, HbE, and NS in mixed population data for rational screening.</div></div><div><h3>Methods:</h3><div>A test data of 2877 subjects with 1226 NS, 425 HbE, 223 IDA, and 1003 <span><math><mi>β</mi></math></span>TT were collected from the Postgraduate Institute of Medical Education and Research (PGIMER), Chandigarh, India and NRS Medical College and Hospital, Kolkata, India. First, we evaluated the performance of 25 discrimination formulae and four machine learning algorithms (MLA), Multi-Layer Perceptron (MLP), Neighborhood Components Analysis (NCA), eXtreme Gradient Boosting Classifier (XGBC), and SKope-Rules (SKR) based on seven performance measures. Based on the performance measures, we selected four discrimination formulae and two MLAs for further evaluation. The SHapley Additive exPlanations (SHAP) model was employed to explore the interpretability of outcomes. We generated four rules using the SKR algorithm to discriminate variants of hemoglobinopathies. Finally, a step-wise implementation scheme for screening is proposed.</div></div><div><h3>Results:</h3><div>Results demonstrate that a single formula cannot ensure high performance for all the performance measures. When tested on data set containing <span><math><mi>β</mi></math></span>TT and IDA samples, the best-performing formulae appear as SCS<span><math><msub><mrow></mrow><mrow><mi>β</mi><mi>T</mi><mi>T</mi></mrow></msub></math></span> in terms of sensitivity (SE) and negative predictive value (NPV); Sirachainan in terms of specificity (SP) and positive predictive value (PPV); CRUISE in terms of Youden index (YI) and RF-4 in terms of Matthews correlation coefficient (MCC) and <span><math><mi>κ</mi></math></span>-coefficient, respectively. Among MLAs, the best-performing algorithms are Skope-rule regarding SP, YI, PPV, and XGBC in the rest of the measures. When tested on a heterogeneous data set, MCC and <span><math><mi>κ</mi></math></span>-coefficient for these four formulae are decreased, but the performance of the two MLAs remains steady. The proposed scheme demonstrates around 97.33–97.62% accuracy while applied to two validation data sets collected from different sources.</div></div><div><h3>Conclusion:</h3><div>The performances of XGBC and SKR algorithms for multi-class classification remain steady while segregating different variants of hemoglobinopathies. The developed rules may be helpful for pre-screening individuals and a possible solution for screening in a mixed population with multiple variants for sustainable, cost-effective, and resource-saving screening.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"192 ","pages":"Article 110151"},"PeriodicalIF":7.0000,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482525005025","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background:
India is home to about 42 million people with -thalassemia trait (TT) necessitating screening of TT to stop spread of the disease. Over the years, researchers developed discrimination formulae based on red blood cell (RBC) parameters to screen -thalassemia trait from iron deficiency anemia (IDA). However, the screening programs often encounter normal subjects (NSs) with other hemoglobinopathy variants. Because the outcome of existing formulas is binary, they often club normal subjects (NS) or variants such as Hemoglobin E (HbE) traits with either TT or IDA. Therefore, it is necessary to segregate TT, IDA, HbE, and NS in mixed population data for rational screening.
Methods:
A test data of 2877 subjects with 1226 NS, 425 HbE, 223 IDA, and 1003 TT were collected from the Postgraduate Institute of Medical Education and Research (PGIMER), Chandigarh, India and NRS Medical College and Hospital, Kolkata, India. First, we evaluated the performance of 25 discrimination formulae and four machine learning algorithms (MLA), Multi-Layer Perceptron (MLP), Neighborhood Components Analysis (NCA), eXtreme Gradient Boosting Classifier (XGBC), and SKope-Rules (SKR) based on seven performance measures. Based on the performance measures, we selected four discrimination formulae and two MLAs for further evaluation. The SHapley Additive exPlanations (SHAP) model was employed to explore the interpretability of outcomes. We generated four rules using the SKR algorithm to discriminate variants of hemoglobinopathies. Finally, a step-wise implementation scheme for screening is proposed.
Results:
Results demonstrate that a single formula cannot ensure high performance for all the performance measures. When tested on data set containing TT and IDA samples, the best-performing formulae appear as SCS in terms of sensitivity (SE) and negative predictive value (NPV); Sirachainan in terms of specificity (SP) and positive predictive value (PPV); CRUISE in terms of Youden index (YI) and RF-4 in terms of Matthews correlation coefficient (MCC) and -coefficient, respectively. Among MLAs, the best-performing algorithms are Skope-rule regarding SP, YI, PPV, and XGBC in the rest of the measures. When tested on a heterogeneous data set, MCC and -coefficient for these four formulae are decreased, but the performance of the two MLAs remains steady. The proposed scheme demonstrates around 97.33–97.62% accuracy while applied to two validation data sets collected from different sources.
Conclusion:
The performances of XGBC and SKR algorithms for multi-class classification remain steady while segregating different variants of hemoglobinopathies. The developed rules may be helpful for pre-screening individuals and a possible solution for screening in a mixed population with multiple variants for sustainable, cost-effective, and resource-saving screening.
期刊介绍:
Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.