An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-02-17 DOI:10.1186/s12911-025-02922-y

Hongnian Wang, Mingyang Zhang, Liyi Mai, Xin Li, Abdelouahab Bellou, Lijuan Wu

{"title":"An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.","authors":"Hongnian Wang, Mingyang Zhang, Liyi Mai, Xin Li, Abdelouahab Bellou, Lijuan Wu","doi":"10.1186/s12911-025-02922-y","DOIUrl":null,"url":null,"abstract":"Background: Identifying key variables is essential for developing clinical outcome prediction models based on high-dimensional electronic medical records (EMR). However, despite the abundance of feature selection (FS) methods available, challenges remain in choosing the most appropriate method, deciding how many top-ranked variables to include, and ensuring these selections are meaningful from a medical perspective.Methods: We developed a practical multi-step feature selection (FS) framework that integrates data-driven statistical inference with a knowledge verification strategy. This framework was validated using two distinct EMR datasets targeting different clinical outcomes. The first cohort, sourced from the Medical Information Mart for Intensive Care III (MIMIC-III), focused on predicting acute kidney injury (AKI) in ICU patients. The second cohort, drawn from the MIMIC-IV Emergency Department (MIMIC-IV-ED), aimed to estimate in-hospital mortality (IHM) for patients transferred from the ED to the ICU. We employed various machine learning (ML) methods and conducted a comparative analysis considering accuracy, stability, similarity, and interpretability. The effectiveness of our FS framework was evaluated using discrimination and calibration metrics, with SHAP applied to enhance the interpretability of model decisions.Results: Cohort 1 comprised 48,780 ICU encounters, of which 8,883 (18.21%) developed AKI. Cohort 2 included 29,197 transfers from the ED to the ICU, with 3,219 (11.03%) resulting in IHM. Among the ten ML methods evaluated, the tree-based ensemble method achieved the highest accuracy. As the number of top-ranking features increased, the models' accuracy began to stabilize, while feature subset stability (considering sample variations) and inter-method feature similarity reached optimal levels, confirming the validity of the FS framework. The integration of interpretative methods and expert knowledge in the final step further improved feature interpretability. The FS framework effectively reduced the number of features (e.g., from 380 to 35 for Cohort 1, and from 273 to 54 for Cohort 2) without significantly affecting prediction performance (Delong test, p > 0.05).Conclusion: The multi-step FS method developed in this study successfully reduces the dimensionality of features in EMR while preserving the accuracy of clinical outcome prediction. Furthermore, it improves the interpretability of risk factors by incorporating expert knowledge validation.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"84"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11834488/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-02922-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Identifying key variables is essential for developing clinical outcome prediction models based on high-dimensional electronic medical records (EMR). However, despite the abundance of feature selection (FS) methods available, challenges remain in choosing the most appropriate method, deciding how many top-ranked variables to include, and ensuring these selections are meaningful from a medical perspective.

Methods: We developed a practical multi-step feature selection (FS) framework that integrates data-driven statistical inference with a knowledge verification strategy. This framework was validated using two distinct EMR datasets targeting different clinical outcomes. The first cohort, sourced from the Medical Information Mart for Intensive Care III (MIMIC-III), focused on predicting acute kidney injury (AKI) in ICU patients. The second cohort, drawn from the MIMIC-IV Emergency Department (MIMIC-IV-ED), aimed to estimate in-hospital mortality (IHM) for patients transferred from the ED to the ICU. We employed various machine learning (ML) methods and conducted a comparative analysis considering accuracy, stability, similarity, and interpretability. The effectiveness of our FS framework was evaluated using discrimination and calibration metrics, with SHAP applied to enhance the interpretability of model decisions.

Results: Cohort 1 comprised 48,780 ICU encounters, of which 8,883 (18.21%) developed AKI. Cohort 2 included 29,197 transfers from the ED to the ICU, with 3,219 (11.03%) resulting in IHM. Among the ten ML methods evaluated, the tree-based ensemble method achieved the highest accuracy. As the number of top-ranking features increased, the models' accuracy began to stabilize, while feature subset stability (considering sample variations) and inter-method feature similarity reached optimal levels, confirming the validity of the FS framework. The integration of interpretative methods and expert knowledge in the final step further improved feature interpretability. The FS framework effectively reduced the number of features (e.g., from 380 to 35 for Cohort 1, and from 273 to 54 for Cohort 2) without significantly affecting prediction performance (Delong test, p > 0.05).

Conclusion: The multi-step FS method developed in this study successfully reduces the dimensionality of features in EMR while preserving the accuracy of clinical outcome prediction. Furthermore, it improves the interpretability of risk factors by incorporating expert knowledge validation.

查看原文本刊更多论文

一个有效的多步骤特征选择框架，用于使用电子病历进行临床结果预测。

背景：确定关键变量对于开发基于高维电子病历（EMR）的临床结果预测模型至关重要。然而，尽管有大量可用的特征选择（FS）方法，但在选择最合适的方法、决定要包括多少排名靠前的变量以及确保这些选择从医学角度来看是有意义的方面仍然存在挑战。方法：我们开发了一个实用的多步骤特征选择（FS）框架，该框架将数据驱动的统计推断与知识验证策略相结合。针对不同的临床结果，使用两个不同的EMR数据集验证了该框架。第一个队列来自重症监护医学信息市场III (MIMIC-III)，重点是预测ICU患者的急性肾损伤（AKI）。第二个队列来自MIMIC-IV急诊科（MIMIC-IV-ED），目的是估计从ED转到ICU的患者的住院死亡率（IHM）。我们采用了各种机器学习（ML）方法，并对准确性、稳定性、相似性和可解释性进行了比较分析。我们的FS框架的有效性是使用判别和校准指标来评估的，并应用SHAP来增强模型决策的可解释性。结果：队列1包括48,780例ICU就诊，其中8,883例（18.21%）发生AKI。队列2包括29,197例从急诊科转至ICU的患者，其中3,219例（11.03%）导致IHM。在评估的10种机器学习方法中，基于树的集成方法的准确率最高。随着排名最高的特征数量的增加，模型的准确率开始趋于稳定，而特征子集稳定性（考虑样本变化）和方法间特征相似度达到最佳水平，证实了FS框架的有效性。最后一步将解释方法与专家知识相结合，进一步提高了特征的可解释性。FS框架有效地减少了特征的数量（例如，队列1从380个减少到35个，队列2从273个减少到54个），而没有显著影响预测性能（Delong检验，p > 0.05）。结论：本研究开发的多步FS方法成功地降低了EMR中特征的维数，同时保持了临床预后预测的准确性。此外，通过引入专家知识验证，提高了风险因素的可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.