Classification and Explanation of Iron Deficiency Anemia from Complete Blood Count Data Using Machine Learning

BioMedInformatics Pub Date : 2024-03-01 DOI:10.3390/biomedinformatics4010036

Siddartha Pullakhandam, S. McRoy

{"title":"Classification and Explanation of Iron Deficiency Anemia from Complete Blood Count Data Using Machine Learning","authors":"Siddartha Pullakhandam, S. McRoy","doi":"10.3390/biomedinformatics4010036","DOIUrl":null,"url":null,"abstract":"Background: Currently, discriminating Iron Deficiency Anemia (IDA) from other anemia requires an expensive test (serum ferritin). Complete Blood Count (CBC) tests are less costly and more widely available. Machine learning models have not yet been applied to discriminating IDA but do well for similar tasks. Methods: We constructed multiple machine learning methods to classify IDA from CBC data using a US NHANES dataset of over 19,000 instances, calculating accuracy, precision, recall, and precision AUC (PR AUC). We validated the results using an unseen dataset from Kenya, using the same model. We calculated ranked feature importance to explain the global behavior of the model. Results: Our model classifies IDA with a PR AUC of 0.87 and recall/sensitivity of 0.98 and 0.89 for the original dataset and an unseen Kenya dataset, respectively. The explanations indicate that low blood level of hemoglobin, higher age, and higher Red Blood Cell distribution width were most critical. We also found that optimization made only minor changes to the explanations and that the features used remained consistent with professional practice. Conclusions: The overall high performance and consistency of the results suggest that the approach would be acceptable to health professionals and would support enhancements to current automated CBC analyzers.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":"73 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioMedInformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biomedinformatics4010036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Currently, discriminating Iron Deficiency Anemia (IDA) from other anemia requires an expensive test (serum ferritin). Complete Blood Count (CBC) tests are less costly and more widely available. Machine learning models have not yet been applied to discriminating IDA but do well for similar tasks. Methods: We constructed multiple machine learning methods to classify IDA from CBC data using a US NHANES dataset of over 19,000 instances, calculating accuracy, precision, recall, and precision AUC (PR AUC). We validated the results using an unseen dataset from Kenya, using the same model. We calculated ranked feature importance to explain the global behavior of the model. Results: Our model classifies IDA with a PR AUC of 0.87 and recall/sensitivity of 0.98 and 0.89 for the original dataset and an unseen Kenya dataset, respectively. The explanations indicate that low blood level of hemoglobin, higher age, and higher Red Blood Cell distribution width were most critical. We also found that optimization made only minor changes to the explanations and that the features used remained consistent with professional practice. Conclusions: The overall high performance and consistency of the results suggest that the approach would be acceptable to health professionals and would support enhancements to current automated CBC analyzers.

查看原文本刊更多论文

利用机器学习从全血计数数据中对缺铁性贫血进行分类和解释

背景：目前，区分缺铁性贫血（IDA）和其他贫血症需要进行昂贵的检测（血清铁蛋白）。全血细胞计数（CBC）检测成本较低，而且更容易获得。机器学习模型尚未应用于鉴别 IDA，但在类似任务中表现良好。方法：我们构建了多种机器学习方法，使用包含 19,000 多个实例的美国 NHANES 数据集对 CBC 数据中的 IDA 进行分类，计算准确率、精确率、召回率和精确率 AUC（PR AUC）。我们使用来自肯尼亚的未见数据集，并使用相同的模型对结果进行了验证。我们计算了特征重要性排名，以解释模型的整体行为。结果对于原始数据集和未见过的肯尼亚数据集，我们的模型对 IDA 进行分类的 PR AUC 为 0.87，召回/灵敏度分别为 0.98 和 0.89。结果表明，低血红蛋白水平、高年龄和高红细胞分布宽度最为关键。我们还发现，优化只对解释做了很小的改动，所使用的特征与专业实践保持一致。结论：结果的整体高性能和一致性表明，卫生专业人员可以接受这种方法，并支持对当前的自动 CBC 分析仪进行改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BioMedInformatics

CiteScore

1.70

自引率

0.00%

发文量