机器学习模型对α-地中海贫血数据的分类。

IF 4.9 2区 医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Frederik Christensen , Deniz Kenan Kılıç , Izabela Ewa Nielsen , Tarec Christoffer El-Galaly , Andreas Glenthøj , Jens Helby , Henrik Frederiksen , Sören Möller , Alexander Djupnes Fuglkjær
{"title":"机器学习模型对α-地中海贫血数据的分类。","authors":"Frederik Christensen ,&nbsp;Deniz Kenan Kılıç ,&nbsp;Izabela Ewa Nielsen ,&nbsp;Tarec Christoffer El-Galaly ,&nbsp;Andreas Glenthøj ,&nbsp;Jens Helby ,&nbsp;Henrik Frederiksen ,&nbsp;Sören Möller ,&nbsp;Alexander Djupnes Fuglkjær","doi":"10.1016/j.cmpb.2024.108581","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Around 7% of the global population has congenital hemoglobin disorders, with over 300,000 new cases of <span><math><mi>α</mi></math></span>-thalassemia annually. Diagnosis is costly and inaccurate in low-income regions, often relying on complete blood count (CBC) tests. This study employs machine learning (ML) to classify <span><math><mi>α</mi></math></span>-thalassemia traits based on gender and CBC, exploring the effects of grouping silent- and non-carriers.</div></div><div><h3>Methods:</h3><div>The dataset includes 288 individuals with suspected <span><math><mi>α</mi></math></span>-thalassemia from Sri Lanka. It was classified using eleven discriminant formulae and nine ML models. Outliers were removed using Mahalanobis distance, and resampling was conducted with the synthetic minority oversampling technique (SMOTE) and SMOTE-nominal continuous (NC). The Mann–Whitney U test handled feature extraction and class grouping. ML performance was evaluated with eight criteria.</div></div><div><h3>Results:</h3><div>The Ehsani formula achieved an area under the receiver operating characteristic curve (ROC-AUC) of 0.66 by grouping silent- and non-carriers. The convolutional neural network (CNN) without feature extraction demonstrated better performance, with an accuracy of 0.85, sensitivity of 0.8, specificity of 0.86, and ROC-AUC of 0.95/0.93 (micro/macro). Performance was maintained even without preprocessing.</div></div><div><h3>Conclusion:</h3><div>ML models outperformed classical discriminant formulae in classifying <span><math><mi>α</mi></math></span>-thalassemia using sex and CBC features. A larger dataset could enhance ML model generalization and the impact of feature extraction. Grouping silent- and non-carriers improved ML results, especially with resampling. The silent carriers were not separable from non-carriers regarding the available features.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"260 ","pages":"Article 108581"},"PeriodicalIF":4.9000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classification of α-thalassemia data using machine learning models\",\"authors\":\"Frederik Christensen ,&nbsp;Deniz Kenan Kılıç ,&nbsp;Izabela Ewa Nielsen ,&nbsp;Tarec Christoffer El-Galaly ,&nbsp;Andreas Glenthøj ,&nbsp;Jens Helby ,&nbsp;Henrik Frederiksen ,&nbsp;Sören Möller ,&nbsp;Alexander Djupnes Fuglkjær\",\"doi\":\"10.1016/j.cmpb.2024.108581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background:</h3><div>Around 7% of the global population has congenital hemoglobin disorders, with over 300,000 new cases of <span><math><mi>α</mi></math></span>-thalassemia annually. Diagnosis is costly and inaccurate in low-income regions, often relying on complete blood count (CBC) tests. This study employs machine learning (ML) to classify <span><math><mi>α</mi></math></span>-thalassemia traits based on gender and CBC, exploring the effects of grouping silent- and non-carriers.</div></div><div><h3>Methods:</h3><div>The dataset includes 288 individuals with suspected <span><math><mi>α</mi></math></span>-thalassemia from Sri Lanka. It was classified using eleven discriminant formulae and nine ML models. Outliers were removed using Mahalanobis distance, and resampling was conducted with the synthetic minority oversampling technique (SMOTE) and SMOTE-nominal continuous (NC). The Mann–Whitney U test handled feature extraction and class grouping. ML performance was evaluated with eight criteria.</div></div><div><h3>Results:</h3><div>The Ehsani formula achieved an area under the receiver operating characteristic curve (ROC-AUC) of 0.66 by grouping silent- and non-carriers. The convolutional neural network (CNN) without feature extraction demonstrated better performance, with an accuracy of 0.85, sensitivity of 0.8, specificity of 0.86, and ROC-AUC of 0.95/0.93 (micro/macro). Performance was maintained even without preprocessing.</div></div><div><h3>Conclusion:</h3><div>ML models outperformed classical discriminant formulae in classifying <span><math><mi>α</mi></math></span>-thalassemia using sex and CBC features. A larger dataset could enhance ML model generalization and the impact of feature extraction. Grouping silent- and non-carriers improved ML results, especially with resampling. The silent carriers were not separable from non-carriers regarding the available features.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"260 \",\"pages\":\"Article 108581\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260724005741\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260724005741","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

背景:全球约7%的人口患有先天性血红蛋白疾病,每年有超过30万新发α-地中海贫血病例。诊断是昂贵和不准确在低收入地区,经常依靠全血细胞计数(CBC)测试。本研究采用机器学习(ML)技术,基于性别和CBC对α-地中海贫血特征进行分类,探讨沉默携带者和非携带者分组的效果。方法:收集来自斯里兰卡的288例疑似α-地中海贫血患者。使用11个判别公式和9个ML模型对其进行分类。使用Mahalanobis距离离群值被移除,重新取样进行合成少数过采样技术(打)和SMOTE-nominal连续(NC)。Mann-Whitney U测试处理特征提取和类分组。用8个标准评价ML的性能。结果:采用Ehsani公式对沉默携带者和非携带者进行分组,所得受试者工作特征曲线下面积(ROC-AUC)为0.66。不进行特征提取的卷积神经网络(CNN)表现出更好的性能,准确率为0.85,灵敏度为0.8,特异性为0.86,ROC-AUC为0.95/0.93(微观/宏观)。即使没有预处理,性能也保持不变。结论:ML模型在根据性别和CBC特征对α-地中海贫血进行分类方面优于经典判别公式。更大的数据集可以增强机器学习模型的泛化和特征提取的影响。对沉默携带者和非携带者进行分组改善了机器学习结果,特别是在重新采样时。就可用的功能而言,沉默的载体与非载体是不可分离的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Classification of α-thalassemia data using machine learning models

Classification of α-thalassemia data using machine learning models

Background:

Around 7% of the global population has congenital hemoglobin disorders, with over 300,000 new cases of α-thalassemia annually. Diagnosis is costly and inaccurate in low-income regions, often relying on complete blood count (CBC) tests. This study employs machine learning (ML) to classify α-thalassemia traits based on gender and CBC, exploring the effects of grouping silent- and non-carriers.

Methods:

The dataset includes 288 individuals with suspected α-thalassemia from Sri Lanka. It was classified using eleven discriminant formulae and nine ML models. Outliers were removed using Mahalanobis distance, and resampling was conducted with the synthetic minority oversampling technique (SMOTE) and SMOTE-nominal continuous (NC). The Mann–Whitney U test handled feature extraction and class grouping. ML performance was evaluated with eight criteria.

Results:

The Ehsani formula achieved an area under the receiver operating characteristic curve (ROC-AUC) of 0.66 by grouping silent- and non-carriers. The convolutional neural network (CNN) without feature extraction demonstrated better performance, with an accuracy of 0.85, sensitivity of 0.8, specificity of 0.86, and ROC-AUC of 0.95/0.93 (micro/macro). Performance was maintained even without preprocessing.

Conclusion:

ML models outperformed classical discriminant formulae in classifying α-thalassemia using sex and CBC features. A larger dataset could enhance ML model generalization and the impact of feature extraction. Grouping silent- and non-carriers improved ML results, especially with resampling. The silent carriers were not separable from non-carriers regarding the available features.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer methods and programs in biomedicine
Computer methods and programs in biomedicine 工程技术-工程:生物医学
CiteScore
12.30
自引率
6.60%
发文量
601
审稿时长
135 days
期刊介绍: To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine. Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信