Quantitative assessment of impact of technical and population-based factors on fairness of AI models for chest X-ray scans.

IF 6.3 2区医学 Q1 BIOLOGY

Computers in biology and medicine Pub Date : 2025-10-06 DOI:10.1016/j.compbiomed.2025.111147

Dmitry Cherezov, Pingfu Fu, Anant Madabhushi

{"title":"Quantitative assessment of impact of technical and population-based factors on fairness of AI models for chest X-ray scans.","authors":"Dmitry Cherezov, Pingfu Fu, Anant Madabhushi","doi":"10.1016/j.compbiomed.2025.111147","DOIUrl":null,"url":null,"abstract":"<p><p>Ensuring fairness in diagnostic AI models is essential for their safe deployment in clinical practice. This study investigates fairness by jointly analyzing population-based factors (sex and race) and technical factors (imaging site and X-ray energy) using chest X-ray data. A total of 49 datasets covering over 321,000 patients and 960,000 images were used. Six experiments were conducted to evaluate the effect of these factors on model performance across classification scores, class activation maps (CAMs), and deep features (DFs). Fairness was assessed using effect sizes derived from Kolmogorov-Smirnov statistics. Within single datasets, performance differences between demographic groups were generally small, with effect sizes below 0.1 for classification scores and CAMs, and up to 0.2 for deep features by sex. However, much larger discrepancies were observed when comparing the same patient group across different imaging sites, with effect sizes ranging from 0.1 to 0.6 across all metrics. Our findings suggest that technical variability has a greater impact on model behavior than population-based factors. Notably, deep features revealed more substantial group differences than surface-level outputs like diagnostic probability scores or CAMs. The findings emphasize the need to evaluate fairness not only within datasets but also across institutions, comparing model performance on training versus external populations, thereby helping to identify fairness limitations that might not be visible through single-cohort analyses.</p>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"198 Pt A","pages":"111147"},"PeriodicalIF":6.3000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1016/j.compbiomed.2025.111147","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Ensuring fairness in diagnostic AI models is essential for their safe deployment in clinical practice. This study investigates fairness by jointly analyzing population-based factors (sex and race) and technical factors (imaging site and X-ray energy) using chest X-ray data. A total of 49 datasets covering over 321,000 patients and 960,000 images were used. Six experiments were conducted to evaluate the effect of these factors on model performance across classification scores, class activation maps (CAMs), and deep features (DFs). Fairness was assessed using effect sizes derived from Kolmogorov-Smirnov statistics. Within single datasets, performance differences between demographic groups were generally small, with effect sizes below 0.1 for classification scores and CAMs, and up to 0.2 for deep features by sex. However, much larger discrepancies were observed when comparing the same patient group across different imaging sites, with effect sizes ranging from 0.1 to 0.6 across all metrics. Our findings suggest that technical variability has a greater impact on model behavior than population-based factors. Notably, deep features revealed more substantial group differences than surface-level outputs like diagnostic probability scores or CAMs. The findings emphasize the need to evaluate fairness not only within datasets but also across institutions, comparing model performance on training versus external populations, thereby helping to identify fairness limitations that might not be visible through single-cohort analyses.

查看原文本刊更多论文

定量评估技术和人口因素对胸部x线扫描人工智能模型公平性的影响。

确保诊断人工智能模型的公平性对于其在临床实践中的安全部署至关重要。本研究利用胸片数据，通过联合分析基于人群的因素（性别和种族）和技术因素（成像地点和x射线能量）来调查公平性。总共使用了49个数据集，涵盖超过321,000名患者和960,000张图像。通过六个实验来评估这些因素对分类分数、类别激活图（CAMs）和深度特征（DFs）的模型性能的影响。使用Kolmogorov-Smirnov统计得出的效应量来评估公平性。在单个数据集中，人口统计组之间的表现差异通常很小，分类得分和CAMs的效应值低于0.1，而按性别划分的深度特征的效应值高达0.2。然而，当比较不同成像部位的同一患者组时，观察到更大的差异，所有指标的效应值从0.1到0.6不等。我们的研究结果表明，技术变异性比基于人口的因素对模型行为的影响更大。值得注意的是，深层特征比表面水平的输出（如诊断概率分数或cam）显示出更大的群体差异。研究结果强调，不仅需要在数据集内评估公平性，还需要在机构间评估公平性，将模型在培训和外部人群中的表现进行比较，从而有助于识别通过单队列分析可能无法看到的公平性限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers in biology and medicine 工程技术-工程：生物医学

CiteScore

11.70

自引率

10.40%

发文量

1086

审稿时长

74 days

期刊介绍： Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.