Jose Miguel Acitores Cortina, Yasaman Fatapour, Kathleen LaRow Brown, Undina Gisladottir, Michael Zietz, Oliver John Bear Don't Walk Iv, Danner Peter, Jacob S Berkowitz, Nadine A Friedrich, Sophia Kivelson, Aditi Kuchi, Hongyu Liu, Apoorva Srinivasan, Kevin K Tsang, Nicholas P Tatonetti
{"title":"Biases in Race and Ethnicity Introduced by Filtering Electronic Health Records for \"Complete Data\": Observational Clinical Data Analysis.","authors":"Jose Miguel Acitores Cortina, Yasaman Fatapour, Kathleen LaRow Brown, Undina Gisladottir, Michael Zietz, Oliver John Bear Don't Walk Iv, Danner Peter, Jacob S Berkowitz, Nadine A Friedrich, Sophia Kivelson, Aditi Kuchi, Hongyu Liu, Apoorva Srinivasan, Kevin K Tsang, Nicholas P Tatonetti","doi":"10.2196/67591","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity.</p><p><strong>Objective: </strong>In this study, we examined the race and ethnicity biases introduced by applying common filters to 4 clinical records databases. Specifically, we evaluated whether these filters introduce biases that disproportionately exclude minoritized groups.</p><p><strong>Methods: </strong>We applied 19 commonly used data filters to electronic health record datasets from 4 geographically varied locations comprising close to 12 million patients to understand how using these filters introduces sample bias along racial and ethnic groupings. These filters covered a range of information, including demographics, medication records, visit details, and observation periods. We observed the variation in sample drop-off between self-reported ethnic and racial groups for each site as we applied each filter individually.</p><p><strong>Results: </strong>Applying the observation period filter substantially reduced data availability across all races and ethnicities in all 4 datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black or African American group was the most impacted by each filter on these 3 datasets: Cedars-Sinai dataset, UK Biobank, and Columbia University dataset. Among the 4 distinct datasets, only applying the filters to the All of Us dataset resulted in minimal deviation from the baseline, with most racial and ethnic groups following a similar pattern.</p><p><strong>Conclusions: </strong>Our findings underscore the importance of using only necessary filters, as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or adjusted cohort selection methods. Additionally, we recommend disclosing sample sizes for racial and ethnic groups both before and after data filters are applied to aid the reader in understanding the generalizability of the results. Future work should focus on exploring the effects of filters on downstream analyses.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e67591"},"PeriodicalIF":3.1000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11967746/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/67591","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity.
Objective: In this study, we examined the race and ethnicity biases introduced by applying common filters to 4 clinical records databases. Specifically, we evaluated whether these filters introduce biases that disproportionately exclude minoritized groups.
Methods: We applied 19 commonly used data filters to electronic health record datasets from 4 geographically varied locations comprising close to 12 million patients to understand how using these filters introduces sample bias along racial and ethnic groupings. These filters covered a range of information, including demographics, medication records, visit details, and observation periods. We observed the variation in sample drop-off between self-reported ethnic and racial groups for each site as we applied each filter individually.
Results: Applying the observation period filter substantially reduced data availability across all races and ethnicities in all 4 datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black or African American group was the most impacted by each filter on these 3 datasets: Cedars-Sinai dataset, UK Biobank, and Columbia University dataset. Among the 4 distinct datasets, only applying the filters to the All of Us dataset resulted in minimal deviation from the baseline, with most racial and ethnic groups following a similar pattern.
Conclusions: Our findings underscore the importance of using only necessary filters, as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or adjusted cohort selection methods. Additionally, we recommend disclosing sample sizes for racial and ethnic groups both before and after data filters are applied to aid the reader in understanding the generalizability of the results. Future work should focus on exploring the effects of filters on downstream analyses.
背景:来自国家生物库的综合临床数据库提高了疾病研究的能力。在构建临床队列时使用数据质量和完整性过滤器来解决数据缺失的限制。然而,当这些过滤器与种族和民族相关时,它们可能会无意中引入系统性偏见。目的:在本研究中,我们通过对4个临床记录数据库应用普通过滤器来检查种族和民族偏见。具体来说,我们评估了这些过滤器是否引入了不成比例地排除少数群体的偏见。方法:我们将19个常用的数据过滤器应用于来自4个地理位置不同的电子健康记录数据集,包括近1200万患者,以了解使用这些过滤器如何引入种族和民族群体的样本偏差。这些过滤器涵盖了一系列信息,包括人口统计、用药记录、访问详细信息和观察期。当我们单独应用每个过滤器时,我们观察到每个站点自我报告的种族和种族群体之间样本下降的变化。结果:应用观察期过滤器大大降低了所有4个数据集中所有种族和民族的数据可用性。然而,在这些被检查的人中,在应用每个过滤器后,白人组的数据可用性始终高于其他种族组。相反,黑人或非裔美国人群体受到Cedars-Sinai数据集、UK Biobank和哥伦比亚大学数据集上每个过滤器的影响最大。在4个不同的数据集中,只有将过滤器应用于All of Us数据集导致与基线的偏差最小,大多数种族和民族群体遵循类似的模式。结论:我们的研究结果强调了只使用必要的过滤器的重要性,因为它们可能不成比例地影响少数种族和民族人口的数据可用性。在进行数据驱动的研究时,研究人员必须考虑这些无意的偏差,并探索最小化这些过滤器影响的技术,如概率方法或调整队列选择方法。此外,我们建议在应用数据过滤器之前和之后披露种族和民族群体的样本量,以帮助读者理解结果的普遍性。未来的工作应侧重于探索过滤器对下游分析的影响。
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.