{"title":"优化的数据分析管道,用于使用集成学习改进医疗保健诊断","authors":"Lomat Haider Chowdhury , Shaira Tabassum , Swakkhar Shatabda , Ashir Ahmed","doi":"10.1016/j.imu.2025.101623","DOIUrl":null,"url":null,"abstract":"<div><div>Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"53 ","pages":"Article 101623"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning\",\"authors\":\"Lomat Haider Chowdhury , Shaira Tabassum , Swakkhar Shatabda , Ashir Ahmed\",\"doi\":\"10.1016/j.imu.2025.101623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.</div></div>\",\"PeriodicalId\":13953,\"journal\":{\"name\":\"Informatics in Medicine Unlocked\",\"volume\":\"53 \",\"pages\":\"Article 101623\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics in Medicine Unlocked\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352914825000115\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning
Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.
期刊介绍:
Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.