Reproducibility and Explainability of Deep Learning in Mammography: A Systematic Review of Literature

IF 0.9 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Indian Journal of Radiology and Imaging Pub Date : 2023-10-10 DOI:10.1055/s-0043-1775737

Deeksha Bhalla, Krithika Rangarajan, Tany Chandra, Subhashis Banerjee, Chetan Arora

{"title":"Reproducibility and Explainability of Deep Learning in Mammography: A Systematic Review of Literature","authors":"Deeksha Bhalla, Krithika Rangarajan, Tany Chandra, Subhashis Banerjee, Chetan Arora","doi":"10.1055/s-0043-1775737","DOIUrl":null,"url":null,"abstract":"Abstract Background Although abundant literature is currently available on the use of deep learning for breast cancer detection in mammography, the quality of such literature is widely variable. Purpose To evaluate published literature on breast cancer detection in mammography for reproducibility and to ascertain best practices for model design. Methods The PubMed and Scopus databases were searched to identify records that described the use of deep learning to detect lesions or classify images into cancer or noncancer. A modification of Quality Assessment of Diagnostic Accuracy Studies (mQUADAS-2) tool was developed for this review and was applied to the included studies. Results of reported studies (area under curve [AUC] of receiver operator curve [ROC] curve, sensitivity, specificity) were recorded. Results A total of 12,123 records were screened, of which 107 fit the inclusion criteria. Training and test datasets, key idea behind model architecture, and results were recorded for these studies. Based on mQUADAS-2 assessment, 103 studies had high risk of bias due to nonrepresentative patient selection. Four studies were of adequate quality, of which three trained their own model, and one used a commercial network. Ensemble models were used in two of these. Common strategies used for model training included patch classifiers, image classification networks (ResNet in 67%), and object detection networks (RetinaNet in 67%). The highest reported AUC was 0.927 ± 0.008 on a screening dataset, while it reached 0.945 (0.919–0.968) on an enriched subset. Higher values of AUC (0.955) and specificity (98.5%) were reached when combined radiologist and Artificial Intelligence readings were used than either of them alone. None of the studies provided explainability beyond localization accuracy. None of the studies have studied interaction between AI and radiologist in a real world setting. Conclusion While deep learning holds much promise in mammography interpretation, evaluation in a reproducible clinical setting and explainable networks are the need of the hour.","PeriodicalId":51597,"journal":{"name":"Indian Journal of Radiology and Imaging","volume":"84 1","pages":"0"},"PeriodicalIF":0.9000,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Journal of Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/s-0043-1775737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Background Although abundant literature is currently available on the use of deep learning for breast cancer detection in mammography, the quality of such literature is widely variable. Purpose To evaluate published literature on breast cancer detection in mammography for reproducibility and to ascertain best practices for model design. Methods The PubMed and Scopus databases were searched to identify records that described the use of deep learning to detect lesions or classify images into cancer or noncancer. A modification of Quality Assessment of Diagnostic Accuracy Studies (mQUADAS-2) tool was developed for this review and was applied to the included studies. Results of reported studies (area under curve [AUC] of receiver operator curve [ROC] curve, sensitivity, specificity) were recorded. Results A total of 12,123 records were screened, of which 107 fit the inclusion criteria. Training and test datasets, key idea behind model architecture, and results were recorded for these studies. Based on mQUADAS-2 assessment, 103 studies had high risk of bias due to nonrepresentative patient selection. Four studies were of adequate quality, of which three trained their own model, and one used a commercial network. Ensemble models were used in two of these. Common strategies used for model training included patch classifiers, image classification networks (ResNet in 67%), and object detection networks (RetinaNet in 67%). The highest reported AUC was 0.927 ± 0.008 on a screening dataset, while it reached 0.945 (0.919–0.968) on an enriched subset. Higher values of AUC (0.955) and specificity (98.5%) were reached when combined radiologist and Artificial Intelligence readings were used than either of them alone. None of the studies provided explainability beyond localization accuracy. None of the studies have studied interaction between AI and radiologist in a real world setting. Conclusion While deep learning holds much promise in mammography interpretation, evaluation in a reproducible clinical setting and explainable networks are the need of the hour.

查看原文本刊更多论文

乳房x线照相术中深度学习的可重复性和可解释性:文献系统综述

虽然目前有大量关于在乳房x光检查中使用深度学习进行乳腺癌检测的文献，但这些文献的质量参差不齐。目的评价已发表的关于乳房x线摄影中乳腺癌检测的文献的可重复性，并确定模型设计的最佳实践。方法检索PubMed和Scopus数据库，找出描述使用深度学习检测病变或将图像分类为癌症或非癌症的记录。本综述开发了诊断准确性研究质量评估(mQUADAS-2)工具的修改版，并应用于纳入的研究。记录已报道的研究结果(受试者操作曲线曲线下面积(AUC)、敏感性、特异性)。结果共筛选12123份记录，其中107份符合纳入标准。这些研究记录了训练和测试数据集、模型架构背后的关键思想和结果。基于mQUADAS-2评估，103项研究由于患者选择不具有代表性而存在高偏倚风险。四项研究具有足够的质量，其中三项研究训练了自己的模型，一项研究使用了商业网络。在其中的两个项目中使用了集成模型。用于模型训练的常用策略包括补丁分类器、图像分类网络(67%的ResNet)和目标检测网络(67%的RetinaNet)。在筛选数据集上报道的最高AUC为0.927±0.008，而在富集子集上报道的最高AUC为0.945(0.919-0.968)。联合使用放射科医生和人工智能读数时，AUC(0.955)和特异性(98.5%)高于单独使用任何一种读数。除了定位准确性之外，没有一项研究提供可解释性。这些研究都没有研究人工智能和放射科医生在现实世界中的互动。结论:虽然深度学习在乳房x线摄影解释中具有很大的前景，但在可重复的临床环境和可解释的网络中进行评估是当前的需要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊