将从电子病历中收集的实验室信息数据纳入临床预测模型。

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Minghui Sun, Matthew M Engelhard, Armando D Bedoya, Benjamin A Goldstein
{"title":"将从电子病历中收集的实验室信息数据纳入临床预测模型。","authors":"Minghui Sun, Matthew M Engelhard, Armando D Bedoya, Benjamin A Goldstein","doi":"10.1186/s12911-024-02612-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance.</p><p><strong>Methods: </strong>We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals.</p><p><strong>Results: </strong>We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results.</p><p><strong>Conclusion: </strong>Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11270887/pdf/","citationCount":"0","resultStr":"{\"title\":\"Incorporating informatively collected laboratory data from EHR in clinical prediction models.\",\"authors\":\"Minghui Sun, Matthew M Engelhard, Armando D Bedoya, Benjamin A Goldstein\",\"doi\":\"10.1186/s12911-024-02612-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance.</p><p><strong>Methods: </strong>We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals.</p><p><strong>Results: </strong>We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results.</p><p><strong>Conclusion: </strong>Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11270887/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-024-02612-1\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02612-1","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

背景:电子健康记录(EHR)被广泛用于开发临床预测模型(CPM)。然而,面临的挑战之一是往往存在一定程度的信息缺失数据。例如,实验室测量通常是在临床医生认为有必要时进行的。当数据是所谓的非随机缺失(NMAR)时,基于其他缺失机制的分析策略就不合适了。在这项工作中,我们试图比较不同的缺失数据处理策略对 CPM 性能的影响:方法:我们将一个住院病人病情快速恶化的预测模型作为实施范例。该模型包含 12 个具有不同缺失率的实验室指标。其中五个实验室的缺失率约为 50%,另外七个实验室的缺失率约为 90%。我们之所以将它们纳入模型,是因为我们认为它们的缺失率水平对预测具有很高的参考价值。在我们的研究中,我们明确比较了各种缺失数据策略:平均估算、正常值估算、条件估算、分类编码和缺失嵌入。其中一些策略还与最后观察结果结转(LOCF)相结合。我们采用了逻辑 LASSO 回归、多层感知器(MLP)和长短期记忆(LSTM)模型作为下游分类器。我们比较了测试数据的 AUROC,并使用引导法构建了 95% 的置信区间:我们有 105,198 个住院病例,其中 4.7% 的病例出现了相关的恶化结果。LSTM 模型的表现普遍优于其他横截面模型,其中嵌入方法和分类编码取得了最佳结果。对于横截面模型,使用 LOCF 的正态值估算产生了最佳结果:结论:考虑到 NMAR 数据缺失可能性的策略比不考虑 NMAR 数据缺失可能性的策略能产生更好的模型性能。嵌入法的优势在于不需要事先了解临床知识。使用 LOCF 可以提高横截面模型的性能,但对 LSTM 模型有反作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Incorporating informatively collected laboratory data from EHR in clinical prediction models.

Background: Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance.

Methods: We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals.

Results: We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results.

Conclusion: Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信