Assessing the generalisation of artificial intelligence across mammography manufacturers.

IF 7.7

PLOS digital health Pub Date : 2025-08-12 eCollection Date: 2025-08-01 DOI:10.1371/journal.pdig.0000973

Alistair J Hickman, Sandra Gomes, Lucy M Warren, Nadia A S Smith, Caroline Shenton-Taylor

{"title":"Assessing the generalisation of artificial intelligence across mammography manufacturers.","authors":"Alistair J Hickman, Sandra Gomes, Lucy M Warren, Nadia A S Smith, Caroline Shenton-Taylor","doi":"10.1371/journal.pdig.0000973","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study was to determine whether differences between manufacturer of mammogram images effects performance of artificial intelligence tools for classifying breast density. Processed mammograms from 10,156 women were used to train and validate three deep learning algorithms using three retrospective datasets: Hologic, General Electric, Mixed (equal numbers of Hologic, General Electric and Siemens images) and tested on four independent witheld test sets (Hologic, General Electric, Mixed and Siemens). The area under the receiver operating characteristic curve (AUC) was compared. Women aged 47-73 with normal breasts (routine recall - no cancer) and Volpara ground truth were selected from the OPTIMAM Mammography Image Database for the years 2012-2015. 95 % confidence intervals are used for significance testing in the results with a Bayesian Signed Rank test used to rank the overall performance of the models. Best single test performance is seen when a model is trained and tested on images from a single manufacturer (Hologic train/test: 0.98 and General Electric train/test: 0.97), however the same models performed significantly worse on any other manufacturer images (General Electric AUCs: 0.68 & 0.63; Hologic AUCs: 0.56 & 0.90). The model trained on the mixed dataset exhibited the best overall performance. Better performance occurs when training and test sets contain the same manufacturer distributions and better generalisation occurs when more manufacturers are included in training. Models in clinical use should be trained on data representing the different vendors of mammogram machines used across screening programs. This is clinically relevant as models will be impacted by changes and upgrades to mammogram machines in screening centres.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 8","pages":"e0000973"},"PeriodicalIF":7.7000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342238/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The aim of this study was to determine whether differences between manufacturer of mammogram images effects performance of artificial intelligence tools for classifying breast density. Processed mammograms from 10,156 women were used to train and validate three deep learning algorithms using three retrospective datasets: Hologic, General Electric, Mixed (equal numbers of Hologic, General Electric and Siemens images) and tested on four independent witheld test sets (Hologic, General Electric, Mixed and Siemens). The area under the receiver operating characteristic curve (AUC) was compared. Women aged 47-73 with normal breasts (routine recall - no cancer) and Volpara ground truth were selected from the OPTIMAM Mammography Image Database for the years 2012-2015. 95 % confidence intervals are used for significance testing in the results with a Bayesian Signed Rank test used to rank the overall performance of the models. Best single test performance is seen when a model is trained and tested on images from a single manufacturer (Hologic train/test: 0.98 and General Electric train/test: 0.97), however the same models performed significantly worse on any other manufacturer images (General Electric AUCs: 0.68 & 0.63; Hologic AUCs: 0.56 & 0.90). The model trained on the mixed dataset exhibited the best overall performance. Better performance occurs when training and test sets contain the same manufacturer distributions and better generalisation occurs when more manufacturers are included in training. Models in clinical use should be trained on data representing the different vendors of mammogram machines used across screening programs. This is clinically relevant as models will be impacted by changes and upgrades to mammogram machines in screening centres.

查看原文本刊更多论文

评估人工智能在乳房x光检查制造商中的推广。

本研究的目的是确定乳房x线照片制造商之间的差异是否会影响人工智能工具对乳腺密度分类的性能。来自10,156名女性的处理后的乳房x光照片用于训练和验证三种深度学习算法，使用三个回顾性数据集：Hologic、General Electric、Mixed （Hologic、General Electric和Siemens图像的数量相等），并在四个独立的保留测试集（Hologic、General Electric、Mixed和Siemens）上进行测试。比较了受者工作特征曲线下面积。从2012-2015年OPTIMAM乳房摄影图像数据库中选择年龄在47-73岁，乳房正常（常规回忆-无癌症）和Volpara基础真相的女性。95%置信区间用于结果的显著性检验，贝叶斯符号秩检验用于对模型的整体性能进行排序。当模型在来自单一制造商的图像上进行训练和测试时，可以看到最佳的单次测试性能（Hologic训练/测试：0.98，通用电气训练/测试：0.97），但是相同的模型在任何其他制造商的图像上表现明显较差(通用电气auc: 0.68和0.63；Hologic auc: 0.56 & 0.90)。在混合数据集上训练的模型表现出最好的综合性能。当训练集和测试集包含相同的制造商分布时，会产生更好的性能；当训练中包含更多的制造商时，会产生更好的泛化。临床使用的模型应该根据代表筛查项目中使用的乳房x光机的不同供应商的数据进行训练。这与临床相关，因为模型将受到筛查中心乳房x光机的变化和升级的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量