Priyanka Anand, Yinzhu Jin, Jun Liu, Joyce Lii, Shruti Belitkar, Kueiyu Joshua Lin
{"title":"Minimizing Racial Algorithmic Bias when Predicting Electronic Health Record Data Completeness.","authors":"Priyanka Anand, Yinzhu Jin, Jun Liu, Joyce Lii, Shruti Belitkar, Kueiyu Joshua Lin","doi":"10.1002/cpt.3758","DOIUrl":null,"url":null,"abstract":"<p><p>The previously developed algorithm for identifying subjects with high electronic health record (EHR)-continuity performed suboptimally in racially diverse populations. We aimed to improve the performance by optimizing the race modeling strategy. We randomly divided TriNetX claims-linked EHR dataset from 11 US-based healthcare organizations into training (70%) and testing data (30%) to develop and test models with and without race interactions and race-specific models. We held out a Medicaid-linked EHR dataset as validation data. Study subjects were ≥18 years with ≥365 days of continuous insurance enrollment overlapping an EHR encounter. We used cross-validated least absolute shrinkage and selection operator (LASSO) to select predictors of high EHR-continuity. We compared the model performance using area under receiver operating curve (AUC). There were 550,859, 236,089, and 65,956 subjects in the training, testing, and validation datasets, respectively. In the validation set, the introduction of race-interaction terms resulted in improved model performance in Black (AUC 0.821 vs. 0.812, P < 0.001) and other non-White race (AUC 0.828 vs. 0.812, P < 0.001) subgroups. The performance of the race-specific models did not differ substantially from that of the models with race-interaction terms in the race subgroups. Using the race interactions model, subjects in the top 50% of predicted EHR-continuity had 2-3-fold lesser misclassification of 40 comparative effectiveness research (CER) relevant variables. The inclusion of race-interaction terms improved model performance in the race subgroups. Using the EHR-continuity prediction algorithm with race-interaction terms can potentially reduce algorithmic bias for racial minorities.</p>","PeriodicalId":153,"journal":{"name":"Clinical Pharmacology & Therapeutics","volume":" ","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Pharmacology & Therapeutics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/cpt.3758","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0
Abstract
The previously developed algorithm for identifying subjects with high electronic health record (EHR)-continuity performed suboptimally in racially diverse populations. We aimed to improve the performance by optimizing the race modeling strategy. We randomly divided TriNetX claims-linked EHR dataset from 11 US-based healthcare organizations into training (70%) and testing data (30%) to develop and test models with and without race interactions and race-specific models. We held out a Medicaid-linked EHR dataset as validation data. Study subjects were ≥18 years with ≥365 days of continuous insurance enrollment overlapping an EHR encounter. We used cross-validated least absolute shrinkage and selection operator (LASSO) to select predictors of high EHR-continuity. We compared the model performance using area under receiver operating curve (AUC). There were 550,859, 236,089, and 65,956 subjects in the training, testing, and validation datasets, respectively. In the validation set, the introduction of race-interaction terms resulted in improved model performance in Black (AUC 0.821 vs. 0.812, P < 0.001) and other non-White race (AUC 0.828 vs. 0.812, P < 0.001) subgroups. The performance of the race-specific models did not differ substantially from that of the models with race-interaction terms in the race subgroups. Using the race interactions model, subjects in the top 50% of predicted EHR-continuity had 2-3-fold lesser misclassification of 40 comparative effectiveness research (CER) relevant variables. The inclusion of race-interaction terms improved model performance in the race subgroups. Using the EHR-continuity prediction algorithm with race-interaction terms can potentially reduce algorithmic bias for racial minorities.
先前开发的用于识别具有高电子健康记录(EHR)连续性的受试者的算法在不同种族的人群中表现不佳。我们的目标是通过优化比赛建模策略来提高性能。我们随机将来自11家美国医疗机构的TriNetX索赔相关电子病历数据集分为培训(70%)和测试数据(30%),以开发和测试有或没有种族相互作用和种族特定模型的模型。我们提供了一个与医疗补助相关的电子病历数据集作为验证数据。研究对象年龄≥18岁,连续保险登记≥365天,与EHR相遇重叠。我们使用交叉验证的最小绝对收缩和选择算子(LASSO)来选择高ehr连续性的预测因子。我们使用接收者工作曲线下面积(AUC)来比较模型的性能。在训练、测试和验证数据集中分别有550,859、236,089和65,956名受试者。在验证集中,种族相互作用术语的引入提高了Black的模型性能(AUC 0.821 vs. 0.812, P
期刊介绍:
Clinical Pharmacology & Therapeutics (CPT) is the authoritative cross-disciplinary journal in experimental and clinical medicine devoted to publishing advances in the nature, action, efficacy, and evaluation of therapeutics. CPT welcomes original Articles in the emerging areas of translational, predictive and personalized medicine; new therapeutic modalities including gene and cell therapies; pharmacogenomics, proteomics and metabolomics; bioinformation and applied systems biology complementing areas of pharmacokinetics and pharmacodynamics, human investigation and clinical trials, pharmacovigilence, pharmacoepidemiology, pharmacometrics, and population pharmacology.