Mark J Giganti, Pamela A Shaw, Guanhua Chen, Sally S Bebawy, Megan M Turner, Timothy R Sterling, Bryan E Shepherd
{"title":"ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION.","authors":"Mark J Giganti, Pamela A Shaw, Guanhua Chen, Sally S Bebawy, Megan M Turner, Timothy R Sterling, Bryan E Shepherd","doi":"10.1214/20-aoas1343","DOIUrl":null,"url":null,"abstract":"<p><p>Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm<sup>3</sup> increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 2","pages":"1045-1061"},"PeriodicalIF":1.3000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523695/pdf/nihms-1619324.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/20-aoas1343","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/6/29 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm3 increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.
期刊介绍:
Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.