评估数据偏差对算法公平性和机器学习模型用于阿片类药物长期使用预测的临床效用的影响。

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-09-30 eCollection Date: 2025-10-01 DOI:10.1093/jamiaopen/ooaf115

Behzad Naderalvojoud, Catherine Curtin, Steven M Asch, Keith Humphreys, Tina Hernandez-Boussard

{"title":"评估数据偏差对算法公平性和机器学习模型用于阿片类药物长期使用预测的临床效用的影响。","authors":"Behzad Naderalvojoud, Catherine Curtin, Steven M Asch, Keith Humphreys, Tina Hernandez-Boussard","doi":"10.1093/jamiaopen/ooaf115","DOIUrl":null,"url":null,"abstract":"Objectives: The growing use of machine learning (ML) in healthcare raises concerns about how data biases affect real-world model performance. While existing frameworks evaluate algorithmic fairness, they often overlook the impact of bias on generalizability and clinical utility, which are critical for safe deployment. Building on prior methods, this study extends bias analysis to include clinical utility, addressing a key gap between fairness evaluation and decision-making.Materials and methods: We applied a 3-phase evaluation to a previously developed model predicting prolonged opioid use (POU), validated on Veterans Health Administration (VHA) data. The analysis included internal and external validation, model retraining on VHA data, and subgroup evaluation across demographic, vulnerable, risk, and comorbidity groups. We assessed performance using area under the receiver operating characteristic curve (AUROC), calibration, and decision curve analysis, incorporating standardized net-benefits to evaluate clinical utility alongside fairness and generalizability.Results: The internal cohort (N = 41 929) had a 14.7% POU prevalence, compared to 34.3% in the external VHA cohort (N = 397 150). The model's AUROC decreased from 0.74 in the internal test cohort to 0.70 in the full external cohort. Subgroup-level performance averaged 0.69 (SD = 0.01), showing minimal deviation from the external cohort overall. Retraining on VHA data improved AUROCs to 0.82. Clinical utility analysis showed systematic shifts in net-benefit across threshold probabilities.Discussion: While the POU model showed generalizability and fairness internally, external validation and retraining revealed performance and utility shifts across subgroups.Conclusion: Population-specific biases affect clinical utility-an often-overlooked dimension in fairness evaluation-a key need to ensure equitable benefits across diverse patient groups.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf115"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12483547/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the impact of data biases on algorithmic fairness and clinical utility of machine learning models for prolonged opioid use prediction.\",\"authors\":\"Behzad Naderalvojoud, Catherine Curtin, Steven M Asch, Keith Humphreys, Tina Hernandez-Boussard\",\"doi\":\"10.1093/jamiaopen/ooaf115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objectives: The growing use of machine learning (ML) in healthcare raises concerns about how data biases affect real-world model performance. While existing frameworks evaluate algorithmic fairness, they often overlook the impact of bias on generalizability and clinical utility, which are critical for safe deployment. Building on prior methods, this study extends bias analysis to include clinical utility, addressing a key gap between fairness evaluation and decision-making.Materials and methods: We applied a 3-phase evaluation to a previously developed model predicting prolonged opioid use (POU), validated on Veterans Health Administration (VHA) data. The analysis included internal and external validation, model retraining on VHA data, and subgroup evaluation across demographic, vulnerable, risk, and comorbidity groups. We assessed performance using area under the receiver operating characteristic curve (AUROC), calibration, and decision curve analysis, incorporating standardized net-benefits to evaluate clinical utility alongside fairness and generalizability.Results: The internal cohort (N = 41 929) had a 14.7% POU prevalence, compared to 34.3% in the external VHA cohort (N = 397 150). The model's AUROC decreased from 0.74 in the internal test cohort to 0.70 in the full external cohort. Subgroup-level performance averaged 0.69 (SD = 0.01), showing minimal deviation from the external cohort overall. Retraining on VHA data improved AUROCs to 0.82. Clinical utility analysis showed systematic shifts in net-benefit across threshold probabilities.Discussion: While the POU model showed generalizability and fairness internally, external validation and retraining revealed performance and utility shifts across subgroups.Conclusion: Population-specific biases affect clinical utility-an often-overlooked dimension in fairness evaluation-a key need to ensure equitable benefits across diverse patient groups.\",\"PeriodicalId\":36278,\"journal\":{\"name\":\"JAMIA Open\",\"volume\":\"8 5\",\"pages\":\"ooaf115\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12483547/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JAMIA Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamiaopen/ooaf115\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/10/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

目的：机器学习（ML）在医疗保健领域的日益普及引发了人们对数据偏差如何影响现实世界模型性能的担忧。虽然现有的框架评估算法的公平性，但它们往往忽略了偏见对概括性和临床效用的影响，这对安全部署至关重要。在先前方法的基础上，本研究将偏倚分析扩展到包括临床效用，解决公平评估和决策之间的关键差距。材料和方法：我们对先前开发的预测阿片类药物长期使用（POU）的模型进行了三个阶段的评估，该模型在退伍军人健康管理局（VHA）的数据上得到了验证。分析包括内部和外部验证，VHA数据的模型再训练，以及人口统计学，易感，风险和合并症组的亚组评估。我们使用受试者工作特征曲线下面积（AUROC）、校准和决策曲线分析来评估效果，结合标准化净效益来评估临床效用、公平性和普遍性。结果：内部队列（N = 41 929）的POU患病率为14.7%，而外部VHA队列（N = 397 150）的POU患病率为34.3%。该模型的AUROC从内部测试队列的0.74下降到完整外部队列的0.70。亚组水平的平均表现为0.69 (SD = 0.01)，与外部队列的总体偏差最小。对VHA数据的再训练将auroc提高到0.82。临床效用分析显示净效益在阈值概率上的系统性变化。讨论：虽然POU模型在内部显示出普遍性和公平性，但外部验证和再培训揭示了子组之间的性能和效用变化。结论：人群特异性偏倚影响临床效用，这是公平性评估中经常被忽视的一个维度，也是确保不同患者群体公平获益的关键需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the impact of data biases on algorithmic fairness and clinical utility of machine learning models for prolonged opioid use prediction.

Objectives: The growing use of machine learning (ML) in healthcare raises concerns about how data biases affect real-world model performance. While existing frameworks evaluate algorithmic fairness, they often overlook the impact of bias on generalizability and clinical utility, which are critical for safe deployment. Building on prior methods, this study extends bias analysis to include clinical utility, addressing a key gap between fairness evaluation and decision-making.

Materials and methods: We applied a 3-phase evaluation to a previously developed model predicting prolonged opioid use (POU), validated on Veterans Health Administration (VHA) data. The analysis included internal and external validation, model retraining on VHA data, and subgroup evaluation across demographic, vulnerable, risk, and comorbidity groups. We assessed performance using area under the receiver operating characteristic curve (AUROC), calibration, and decision curve analysis, incorporating standardized net-benefits to evaluate clinical utility alongside fairness and generalizability.

Results: The internal cohort (N = 41 929) had a 14.7% POU prevalence, compared to 34.3% in the external VHA cohort (N = 397 150). The model's AUROC decreased from 0.74 in the internal test cohort to 0.70 in the full external cohort. Subgroup-level performance averaged 0.69 (SD = 0.01), showing minimal deviation from the external cohort overall. Retraining on VHA data improved AUROCs to 0.82. Clinical utility analysis showed systematic shifts in net-benefit across threshold probabilities.

Discussion: While the POU model showed generalizability and fairness internally, external validation and retraining revealed performance and utility shifts across subgroups.

Conclusion: Population-specific biases affect clinical utility-an often-overlooked dimension in fairness evaluation-a key need to ensure equitable benefits across diverse patient groups.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JAMIA Open Medicine-Health Informatics

CiteScore

4.10

自引率

4.80%

发文量

102

审稿时长

16 weeks