Frederik Christensen , Deniz Kenan Kılıç , Izabela Ewa Nielsen , Tarec Christoffer El-Galaly , Andreas Glenthøj , Jens Helby , Henrik Frederiksen , Sören Möller , Alexander Djupnes Fuglkjær
{"title":"机器学习模型对α-地中海贫血数据的分类。","authors":"Frederik Christensen , Deniz Kenan Kılıç , Izabela Ewa Nielsen , Tarec Christoffer El-Galaly , Andreas Glenthøj , Jens Helby , Henrik Frederiksen , Sören Möller , Alexander Djupnes Fuglkjær","doi":"10.1016/j.cmpb.2024.108581","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Around 7% of the global population has congenital hemoglobin disorders, with over 300,000 new cases of <span><math><mi>α</mi></math></span>-thalassemia annually. Diagnosis is costly and inaccurate in low-income regions, often relying on complete blood count (CBC) tests. This study employs machine learning (ML) to classify <span><math><mi>α</mi></math></span>-thalassemia traits based on gender and CBC, exploring the effects of grouping silent- and non-carriers.</div></div><div><h3>Methods:</h3><div>The dataset includes 288 individuals with suspected <span><math><mi>α</mi></math></span>-thalassemia from Sri Lanka. It was classified using eleven discriminant formulae and nine ML models. Outliers were removed using Mahalanobis distance, and resampling was conducted with the synthetic minority oversampling technique (SMOTE) and SMOTE-nominal continuous (NC). The Mann–Whitney U test handled feature extraction and class grouping. ML performance was evaluated with eight criteria.</div></div><div><h3>Results:</h3><div>The Ehsani formula achieved an area under the receiver operating characteristic curve (ROC-AUC) of 0.66 by grouping silent- and non-carriers. The convolutional neural network (CNN) without feature extraction demonstrated better performance, with an accuracy of 0.85, sensitivity of 0.8, specificity of 0.86, and ROC-AUC of 0.95/0.93 (micro/macro). Performance was maintained even without preprocessing.</div></div><div><h3>Conclusion:</h3><div>ML models outperformed classical discriminant formulae in classifying <span><math><mi>α</mi></math></span>-thalassemia using sex and CBC features. A larger dataset could enhance ML model generalization and the impact of feature extraction. Grouping silent- and non-carriers improved ML results, especially with resampling. The silent carriers were not separable from non-carriers regarding the available features.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"260 ","pages":"Article 108581"},"PeriodicalIF":4.9000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classification of α-thalassemia data using machine learning models\",\"authors\":\"Frederik Christensen , Deniz Kenan Kılıç , Izabela Ewa Nielsen , Tarec Christoffer El-Galaly , Andreas Glenthøj , Jens Helby , Henrik Frederiksen , Sören Möller , Alexander Djupnes Fuglkjær\",\"doi\":\"10.1016/j.cmpb.2024.108581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background:</h3><div>Around 7% of the global population has congenital hemoglobin disorders, with over 300,000 new cases of <span><math><mi>α</mi></math></span>-thalassemia annually. Diagnosis is costly and inaccurate in low-income regions, often relying on complete blood count (CBC) tests. This study employs machine learning (ML) to classify <span><math><mi>α</mi></math></span>-thalassemia traits based on gender and CBC, exploring the effects of grouping silent- and non-carriers.</div></div><div><h3>Methods:</h3><div>The dataset includes 288 individuals with suspected <span><math><mi>α</mi></math></span>-thalassemia from Sri Lanka. It was classified using eleven discriminant formulae and nine ML models. Outliers were removed using Mahalanobis distance, and resampling was conducted with the synthetic minority oversampling technique (SMOTE) and SMOTE-nominal continuous (NC). The Mann–Whitney U test handled feature extraction and class grouping. ML performance was evaluated with eight criteria.</div></div><div><h3>Results:</h3><div>The Ehsani formula achieved an area under the receiver operating characteristic curve (ROC-AUC) of 0.66 by grouping silent- and non-carriers. The convolutional neural network (CNN) without feature extraction demonstrated better performance, with an accuracy of 0.85, sensitivity of 0.8, specificity of 0.86, and ROC-AUC of 0.95/0.93 (micro/macro). Performance was maintained even without preprocessing.</div></div><div><h3>Conclusion:</h3><div>ML models outperformed classical discriminant formulae in classifying <span><math><mi>α</mi></math></span>-thalassemia using sex and CBC features. A larger dataset could enhance ML model generalization and the impact of feature extraction. Grouping silent- and non-carriers improved ML results, especially with resampling. The silent carriers were not separable from non-carriers regarding the available features.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"260 \",\"pages\":\"Article 108581\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260724005741\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260724005741","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Classification of α-thalassemia data using machine learning models
Background:
Around 7% of the global population has congenital hemoglobin disorders, with over 300,000 new cases of -thalassemia annually. Diagnosis is costly and inaccurate in low-income regions, often relying on complete blood count (CBC) tests. This study employs machine learning (ML) to classify -thalassemia traits based on gender and CBC, exploring the effects of grouping silent- and non-carriers.
Methods:
The dataset includes 288 individuals with suspected -thalassemia from Sri Lanka. It was classified using eleven discriminant formulae and nine ML models. Outliers were removed using Mahalanobis distance, and resampling was conducted with the synthetic minority oversampling technique (SMOTE) and SMOTE-nominal continuous (NC). The Mann–Whitney U test handled feature extraction and class grouping. ML performance was evaluated with eight criteria.
Results:
The Ehsani formula achieved an area under the receiver operating characteristic curve (ROC-AUC) of 0.66 by grouping silent- and non-carriers. The convolutional neural network (CNN) without feature extraction demonstrated better performance, with an accuracy of 0.85, sensitivity of 0.8, specificity of 0.86, and ROC-AUC of 0.95/0.93 (micro/macro). Performance was maintained even without preprocessing.
Conclusion:
ML models outperformed classical discriminant formulae in classifying -thalassemia using sex and CBC features. A larger dataset could enhance ML model generalization and the impact of feature extraction. Grouping silent- and non-carriers improved ML results, especially with resampling. The silent carriers were not separable from non-carriers regarding the available features.
期刊介绍:
To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine.
Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.