ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION.

IF 1.4 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics Pub Date : 2020-06-01 Epub Date: 2020-06-29 DOI:10.1214/20-aoas1343

Mark J Giganti, Pamela A Shaw, Guanhua Chen, Sally S Bebawy, Megan M Turner, Timothy R Sterling, Bryan E Shepherd

{"title":"ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION.","authors":"Mark J Giganti, Pamela A Shaw, Guanhua Chen, Sally S Bebawy, Megan M Turner, Timothy R Sterling, Bryan E Shepherd","doi":"10.1214/20-aoas1343","DOIUrl":null,"url":null,"abstract":"Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm3 increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 2","pages":"1045-1061"},"PeriodicalIF":1.4000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523695/pdf/nihms-1619324.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/20-aoas1343","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/6/29 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm³ increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.

查看原文本刊更多论文

利用电子健康记录、验证样本和多重估算，计算预测因子和事件发生时间结果的因果误差。

电子健康记录（EHR）中的数据很容易出错，这些错误往往与多个变量相关。如果分析变量是由两个或多个易出错变量的函数导出的，那么误差结构就会变得更加复杂。这些误差会严重影响估算结果，但我们还不知道有什么方法能同时考虑协变量和时间到事件结果的误差。使用来自 4217 名患者的电子病历数据，在开始接受抗逆转录病毒疗法时，CD4 细胞数每增加 100 cells/mm3 与艾滋病定义事件相关的危险比在使用未验证数据时为 0.74（95%CI：0.68-0.80），在使用完全验证数据时为 0.60（95%CI：0.53-0.68）。我们的目标是在验证随机记录子集后获得无偏且有效的估计值。我们建议对已验证的子样本拟合离散故障时间模型，然后对未验证的记录进行乘法归因。我们展示了这种方法如何同时解决预测因子、事件发生时间结果和纳入标准中的依赖性误差。我们将完全验证的数据集作为黄金标准，比较了我们的估计值与未验证数据集和相应的纯子样本数据集在不同子样本规模下的平均平方误差。通过纳入合理规模的验证子样本和适当的估算模型，我们的方法比天真的分析和仅使用验证子样本的分析都改进了估算结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Applied Statistics 社会科学-统计学与概率论

CiteScore

3.10

自引率

5.60%

发文量

131

审稿时长

6-12 weeks

期刊介绍： Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.