To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice.

IF 4.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Maxwell Salvatore, Ritoban Kundu, Xu Shi, Christopher R Friese, Seunggeun Lee, Lars G Fritsche, Alison M Mondul, David Hanauer, Celeste Leigh Pearce, Bhramar Mukherjee
{"title":"To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice.","authors":"Maxwell Salvatore, Ritoban Kundu, Xu Shi, Christopher R Friese, Seunggeun Lee, Lars G Fritsche, Alison M Mondul, David Hanauer, Celeste Leigh Pearce, Bhramar Mukherjee","doi":"10.1093/jamia/ocae098","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data.</p><p><strong>Materials and methods: </strong>We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results.</p><p><strong>Results: </strong>For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates.</p><p><strong>Discussion: </strong>Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis.</p><p><strong>Conclusion: </strong>EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":null,"pages":null},"PeriodicalIF":4.7000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11187425/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae098","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data.

Materials and methods: We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results.

Results: For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates.

Discussion: Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis.

Conclusion: EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.

加权还是不加权?3 个大型电子健康记录链接生物库中选择偏差的影响及实践建议。
目的:针对使用电子健康记录(EHR)链接生物库数据进行的常见分析,提出使用权重减少选择偏差的建议:针对使用电子健康记录(EHR)链接生物库数据进行的常见分析,提出有关使用权重减少选择偏倚的建议:我们将诊断(ICD 代码)数据与 3 个与电子病历关联的生物库中的标准化嗜血代码进行了映射,这些生物库的招募策略各不相同:All of Us (AOU; n = 244 071)、Michigan Genomics Initiative (MGI; n = 81 243)和UK Biobank (UKB; n = 401 167)。利用 2019 年全国健康访谈调查数据,我们为 AOU 和 MGI 建立了选择权重,以更多地代表美国成年人口。我们使用之前为英国生物样本库开发的权重来代表符合英国生物样本库资格的人群。我们对未加权和加权结果进行了 4 项共同分析:结果:对于 AOU 和 MGI,加权后估计的嗜铬细胞编码流行率有所下降(加权-非加权中位嗜铬细胞编码流行率比 [MPR]:0.82 和 0.61),而 UKB 估计值有所上升(MPR:1.06)。加权对潜在表型维度估计的影响很小。比较加权与未加权的结直肠癌全表型关联研究,最强的关联没有改变,显著的关联有相当大的重叠。加权影响了性别和结直肠癌的估计对数比,使其更接近于基于国家登记的估计值:讨论:加权对维度估计和大规模假设检验的影响有限,但对流行率和关联估计有影响。当对效应大小的估计感兴趣时,应通过加权分析来跟进非目标关联分析的特定信号:结论:与电子病历相连的生物库应报告招募和筛选机制,并提供界定目标人群的筛选权重。研究人员应考虑他们的预期估计值,指定来源和目标人群,并相应地对与电子病历关联的生物库分析进行加权。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信