电子健康记录数据中种族和民族的估算方法。

IF 3.2 2区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Health Services Research Pub Date : 2025-05-27 DOI:10.1111/1475-6773.14649

Sarah Conderino, Jasmin Divers, John A. Dodson, Lorna E. Thorpe, Mark G. Weiner, Samrachana Adhikari

{"title":"电子健康记录数据中种族和民族的估算方法。","authors":"Sarah Conderino, Jasmin Divers, John A. Dodson, Lorna E. Thorpe, Mark G. Weiner, Samrachana Adhikari","doi":"10.1111/1475-6773.14649","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Objective</h3>\n \n <p>To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets.</p>\n </section>\n \n <section>\n \n <h3> Study Setting and Design</h3>\n \n <p>In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (<i>κ</i>). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.</p>\n </section>\n \n <section>\n \n <h3> Data Sources and Analytic Sample</h3>\n \n <p>Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.</p>\n </section>\n \n <section>\n \n <h3> Principal Findings</h3>\n \n <p>Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (<i>κ</i>\n <sub>single</sub> = 0.25, <i>κ</i>\n <sub>MICE</sub> = 0.25, <i>κ</i>\n <sub>randomforest</sub> = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.</p>\n </section>\n </div>","PeriodicalId":55065,"journal":{"name":"Health Services Research","volume":"60 5","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1475-6773.14649","citationCount":"0","resultStr":"{\"title\":\"Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data\",\"authors\":\"Sarah Conderino, Jasmin Divers, John A. Dodson, Lorna E. Thorpe, Mark G. Weiner, Samrachana Adhikari\",\"doi\":\"10.1111/1475-6773.14649\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Objective</h3>\\n \\n <p>To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Study Setting and Design</h3>\\n \\n <p>In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (<i>κ</i>). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Data Sources and Analytic Sample</h3>\\n \\n <p>Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Principal Findings</h3>\\n \\n <p>Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (<i>κ</i>\\n <sub>single</sub> = 0.25, <i>κ</i>\\n <sub>MICE</sub> = 0.25, <i>κ</i>\\n <sub>randomforest</sub> = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusions</h3>\\n \\n <p>BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.</p>\\n </section>\\n </div>\",\"PeriodicalId\":55065,\"journal\":{\"name\":\"Health Services Research\",\"volume\":\"60 5\",\"pages\":\"\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1475-6773.14649\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health Services Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/1475-6773.14649\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Services Research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1475-6773.14649","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

目的：比较在基于电子健康记录（EHR）的慢性病负担描述性研究中，匿名和非匿名方法对种族和民族的归因。研究背景与设计：在纽约市的研究中，我们首先在不同缺失数据机制下进行了模拟分析，以评估贝叶斯改进姓氏地理编码（BISG）、使用邻域多数信息的单次输入、随机森林输入和链式方程（MICE）的多重输入的性能。用灵敏度、精密度和总体准确度来测量插补性能；用Cohen’s kappa （κ）来衡量与自我报告的种族和民族的一致性。然后，我们应用这些方法在两个基于ehr的数据源中推算种族和民族，并通过各种推算方法按种族和民族比较慢性病负担（95% ci）。数据来源和分析样本：我们的数据来源包括纽约大学朗格尼健康中心和INSIGHT临床研究网络2016年3月6日至2020年3月7日的电子病历数据，提取自纽约市患有多种慢性疾病的老年人的父母研究。主要发现：在模拟分析中，非匿名BISG输入提供了最准确的种族和民族分类，在缺失的数据机制中，其范围从66%到73%。匿名方法对缺失数据机制更敏感，当种族和民族非随机缺失（MNAR）时，一致性下降（κsingle = 0.25, κMICE = 0.25, κrandomforest = 0.33）。然而，当这些方法应用于纽约大学和INSIGHT队列时，种族和民族分布和慢性病负担在所有归算方法中是一致的。与完整的案例分析相比，在所有归算方法下观察到的估计精度略有提高。结论：BISG代入比使用匿名协变量的单一或多重代入提供更准确的种族和民族分类，特别是如果缺失的数据机制是MNAR。疾病负担的描述性研究可能对输入缺失数据的方法不敏感。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data

查看原文本刊更多论文

Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data

Objective

To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets.

Study Setting and Design

In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.

Data Sources and Analytic Sample

Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.

Principal Findings

Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κ _single = 0.25, κ _MICE = 0.25, κ _randomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.

Conclusions

BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Health Services Research 医学-卫生保健

CiteScore

4.80

自引率

5.90%

发文量

193

审稿时长

4-8 weeks

期刊介绍： Health Services Research (HSR) is a peer-reviewed scholarly journal that provides researchers and public and private policymakers with the latest research findings, methods, and concepts related to the financing, organization, delivery, evaluation, and outcomes of health services. Rated as one of the top journals in the fields of health policy and services and health care administration, HSR publishes outstanding articles reporting the findings of original investigations that expand knowledge and understanding of the wide-ranging field of health care and that will help to improve the health of individuals and communities.