Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study.

Sarah Conderino, Lorna E Thorpe, Jasmin Divers, Sandra S Albrecht, Shannon M Farley, David C Lee, Rebecca Anthopolos
{"title":"Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study.","authors":"Sarah Conderino, Lorna E Thorpe, Jasmin Divers, Sandra S Albrecht, Shannon M Farley, David C Lee, Rebecca Anthopolos","doi":"10.1136/bmjph-2024-001666","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>There is growing interest in using electronic health records (EHRs) for chronic disease surveillance. However, these data are convenience samples of in-care individuals, which are not representative of target populations for public health surveillance, generally defined, for the relevant period, as resident populations within city, state, or other jurisdictions. We focus on using EHR data for estimation of diabetes prevalence among young adults in New York City, as rising diabetes burden in younger ages call for better surveillance capacity.</p><p><strong>Methods: </strong>This article applies common nonprobability sampling methods, including raking, post-stratification, and multilevel regression with post-stratification, to real and simulated data for the cross-sectional estimation of diabetes prevalence among those aged 18-44 years. Within real data analyses, we externally validate city- and neighborhood-level EHR-based estimates to gold-standard estimates from a local health survey. Within data simulations, we probe the extent to which residual biases remain when selection into the EHR sample is non-ignorable.</p><p><strong>Results: </strong>Within the real data analyses, these methods reduced the impact of selection biases in the citywide prevalence estimate compared to gold standard. Residual biases remained at the neighborhood-level, where prevalence tended to be overestimated, especially in neighborhoods where a higher proportion of residents were captured in the sample. Simulation results demonstrated these methods may be sufficient, except when selection into the EHR is non-ignorable, depending on unmeasured factors or on diabetes status.</p><p><strong>Conclusions: </strong>While EHRs offer potential to innovate on chronic disease surveillance, care is needed when estimating prevalence for small geographies or when selection is non-ignorable.</p>","PeriodicalId":101362,"journal":{"name":"BMJ public health","volume":"2 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11578099/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ public health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjph-2024-001666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: There is growing interest in using electronic health records (EHRs) for chronic disease surveillance. However, these data are convenience samples of in-care individuals, which are not representative of target populations for public health surveillance, generally defined, for the relevant period, as resident populations within city, state, or other jurisdictions. We focus on using EHR data for estimation of diabetes prevalence among young adults in New York City, as rising diabetes burden in younger ages call for better surveillance capacity.

Methods: This article applies common nonprobability sampling methods, including raking, post-stratification, and multilevel regression with post-stratification, to real and simulated data for the cross-sectional estimation of diabetes prevalence among those aged 18-44 years. Within real data analyses, we externally validate city- and neighborhood-level EHR-based estimates to gold-standard estimates from a local health survey. Within data simulations, we probe the extent to which residual biases remain when selection into the EHR sample is non-ignorable.

Results: Within the real data analyses, these methods reduced the impact of selection biases in the citywide prevalence estimate compared to gold standard. Residual biases remained at the neighborhood-level, where prevalence tended to be overestimated, especially in neighborhoods where a higher proportion of residents were captured in the sample. Simulation results demonstrated these methods may be sufficient, except when selection into the EHR is non-ignorable, depending on unmeasured factors or on diabetes status.

Conclusions: While EHRs offer potential to innovate on chronic disease surveillance, care is needed when estimating prevalence for small geographies or when selection is non-ignorable.

解决电子健康记录数据中的选择偏差,估算纽约市年轻成年人的糖尿病患病率:一项横断面研究
导言:人们对使用电子健康记录(EHR)进行慢性病监测越来越感兴趣。然而,这些数据都是方便抽取的在诊个人样本,并不能代表公共卫生监测的目标人群,在相关时期,目标人群一般被定义为城市、州或其他辖区内的常住人口。我们将重点放在使用电子病历数据估算纽约市年轻成年人的糖尿病患病率上,因为年轻人的糖尿病负担日益加重,需要更好的监测能力:本文在真实数据和模拟数据中应用了常见的非概率抽样方法,包括耙取、后分层和带后分层的多层次回归,对 18-44 岁人群的糖尿病患病率进行横截面估算。在真实数据分析中,我们从外部验证了基于城市和社区电子病历的估计值与当地健康调查的黄金标准估计值。在数据模拟中,我们探究了当电子健康记录样本的选择不可忽略时,残余偏差的程度:结果:在真实数据分析中,与黄金标准相比,这些方法减少了选择偏差对全市流行率估计值的影响。残余偏差仍然存在于邻里层面,流行率往往被高估,尤其是在样本中居民比例较高的邻里。模拟结果表明,这些方法可能是足够的,除非电子病历的选择是不可忽略的,这取决于未测量的因素或糖尿病状态:虽然电子健康记录为慢性病监测提供了创新潜力,但在估算小范围地区的患病率或选择不可忽略时仍需谨慎。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信