优化的数据分析管道,用于使用集成学习改进医疗保健诊断

Q1 Medicine
Lomat Haider Chowdhury , Shaira Tabassum , Swakkhar Shatabda , Ashir Ahmed
{"title":"优化的数据分析管道,用于使用集成学习改进医疗保健诊断","authors":"Lomat Haider Chowdhury ,&nbsp;Shaira Tabassum ,&nbsp;Swakkhar Shatabda ,&nbsp;Ashir Ahmed","doi":"10.1016/j.imu.2025.101623","DOIUrl":null,"url":null,"abstract":"<div><div>Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"53 ","pages":"Article 101623"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning\",\"authors\":\"Lomat Haider Chowdhury ,&nbsp;Shaira Tabassum ,&nbsp;Swakkhar Shatabda ,&nbsp;Ashir Ahmed\",\"doi\":\"10.1016/j.imu.2025.101623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.</div></div>\",\"PeriodicalId\":13953,\"journal\":{\"name\":\"Informatics in Medicine Unlocked\",\"volume\":\"53 \",\"pages\":\"Article 101623\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics in Medicine Unlocked\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352914825000115\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

摘要

医疗诊断是医生在给病人开处方之前遵循的一个过程。医生可以通过观察身体体征和症状做出早期预测。在没有正确诊断的情况下强行治疗并不能保证治愈,有时还可能导致患者出现更有害的情况。然而,医疗诊断的成本使人们对这一过程漠不关心。大数据和机器学习已经被用于为医疗保健诊断部门提供可用数据,这些数据通过系统的数字化得到了极大的增长。然而,困难仍然存在,因为原始数据包含包括缺失值、异常值和样本数量不平衡在内的噪声。数据集中的这些属性使得实现任何诊断模型都具有挑战性。由于缺失值,无法生成完整的患者概况,这可能会影响最终的预测。医疗数据集中的异常值代表极端情况和罕见情况,或者它们甚至可能是由于数据输入错误而产生的。过多的异常值可能导致预测偏差和不正确。不平衡的数据集使得适当地识别少数类具有挑战性,并且通常会为大多数类实例生成有偏见的模型。需要结合先进的预处理技术和可靠的模型选择来有效地解决这些挑战。本文提出了一种基于便携式健康诊所数据集的数据分析管道。本文系统地评价了缺失值输入、异常值检测和数据平衡的不同预处理方法,并提供了一个全面的预处理框架。随后,五个最先进的医疗保健诊断集成模型与提出的集成机器学习模型KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R)一起实现。该模型的准确率达到97.03%,超过了所有其他最先进的模型。为了验证我们的模型的正确性,我们在另一个COVID-19常规血液检测数据集上进行了实验。在这两种情况下,我们提出的模型在不同的性能度量方面获得了更好的结果。在二级数据集上验证该方法增强了所提出方法的鲁棒性。推荐的预处理和建模方法可用于增强诊断系统和改善患者预后。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning
Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Informatics in Medicine Unlocked
Informatics in Medicine Unlocked Medicine-Health Informatics
CiteScore
9.50
自引率
0.00%
发文量
282
审稿时长
39 days
期刊介绍: Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信