An Exploration of Discrepant Recalls Between AI and Human Readers of Malignant Lesions in Digital Mammography Screening.

IF 3 3区医学 Q1 MEDICINE, GENERAL & INTERNAL

Diagnostics Pub Date : 2025-06-19 DOI:10.3390/diagnostics15121566

Suzanne L van Winkel, Ioannis Sechopoulos, Alejandro Rodríguez-Ruiz, Wouter J H Veldkamp, Gisella Gennaro, Margarita Chevalier, Thomas H Helbich, Tianyu Zhang, Matthew G Wallis, Ritse M Mann

{"title":"An Exploration of Discrepant Recalls Between AI and Human Readers of Malignant Lesions in Digital Mammography Screening.","authors":"Suzanne L van Winkel, Ioannis Sechopoulos, Alejandro Rodríguez-Ruiz, Wouter J H Veldkamp, Gisella Gennaro, Margarita Chevalier, Thomas H Helbich, Tianyu Zhang, Matthew G Wallis, Ritse M Mann","doi":"10.3390/diagnostics15121566","DOIUrl":null,"url":null,"abstract":"Background: The integration of artificial intelligence (AI) in digital mammography (DM) screening holds promise for early breast cancer detection, potentially enhancing accuracy and efficiency. However, AI performance is not identical to that of human observers. We aimed to identify common morphological image characteristics of true cancers that are missed by either AI or human screening when their interpretations are discrepant. Methods: Twenty-six breast cancer-positive cases, identified from a large retrospective multi-institutional digital mammography dataset based on discrepant AI and human interpretations, were included in a reader study. Ground truth was confirmed by histopathology or ≥1-year follow-up. Fourteen radiologists assessed lesion visibility, morphological features, and likelihood of malignancy. AI performance was evaluated using receiver operating characteristic (ROC) analysis and area under the curve (AUC). The reader study results were analyzed using interobserver agreement measures and descriptive statistics. Results: AI demonstrated high discriminative capability in the full dataset, with AUCs ranging from 0.903 (95% CI: 0.862-0.944) to 0.946 (95% CI: 0.896-0.996). Cancers missed by AI had a significantly smaller median size (9.0 mm, IQR 6.5-12.0) compared to those missed by human readers (21.0 mm, IQR 10.5-41.0) (p = 0.0014). Cancers in discrepant cases were often described as having 'low visibility', 'indistinct margins', or 'irregular shape'. Calcifications were observed in 27% of human-missed cancers (42/154) versus 18% of AI-missed cancers (38/210). A very high likelihood of malignancy was assigned in 32.5% (50/154) of human-missed cancers compared to 19.5% (41/210) of AI-missed cancers. Overall inter-rater agreement was poor to fair (<0.40), indicating interpretation challenges of the selected images. Among the human-missed cancers, calcifications were more frequent (42/154; 27%) than among the AI-missed cancers (38/210; 18%) (p = 0.396). Furthermore, 50/154 (32.5%) human-missed cancers were deemed to have a very high likelihood of malignancy, compared to 41/210 (19.5%) AI-missed cancers (p = 0.8). Overall inter-rater agreement on the items assessed during the reader study was poor to fair (<0.40), suggesting that interpretation of the selected images was challenging. Conclusions: Lesions missed by AI were smaller and less often calcified than cancers missed by human readers. Cancers missed by AI tended to show lower levels of suspicion than those missed by human readers. While definitive conclusions are premature, the findings highlight the complementary roles of AI and human readers in mammographic interpretation.","PeriodicalId":11225,"journal":{"name":"Diagnostics","volume":"15 12","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191860/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/diagnostics15121566","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The integration of artificial intelligence (AI) in digital mammography (DM) screening holds promise for early breast cancer detection, potentially enhancing accuracy and efficiency. However, AI performance is not identical to that of human observers. We aimed to identify common morphological image characteristics of true cancers that are missed by either AI or human screening when their interpretations are discrepant. Methods: Twenty-six breast cancer-positive cases, identified from a large retrospective multi-institutional digital mammography dataset based on discrepant AI and human interpretations, were included in a reader study. Ground truth was confirmed by histopathology or ≥1-year follow-up. Fourteen radiologists assessed lesion visibility, morphological features, and likelihood of malignancy. AI performance was evaluated using receiver operating characteristic (ROC) analysis and area under the curve (AUC). The reader study results were analyzed using interobserver agreement measures and descriptive statistics. Results: AI demonstrated high discriminative capability in the full dataset, with AUCs ranging from 0.903 (95% CI: 0.862-0.944) to 0.946 (95% CI: 0.896-0.996). Cancers missed by AI had a significantly smaller median size (9.0 mm, IQR 6.5-12.0) compared to those missed by human readers (21.0 mm, IQR 10.5-41.0) (p = 0.0014). Cancers in discrepant cases were often described as having 'low visibility', 'indistinct margins', or 'irregular shape'. Calcifications were observed in 27% of human-missed cancers (42/154) versus 18% of AI-missed cancers (38/210). A very high likelihood of malignancy was assigned in 32.5% (50/154) of human-missed cancers compared to 19.5% (41/210) of AI-missed cancers. Overall inter-rater agreement was poor to fair (<0.40), indicating interpretation challenges of the selected images. Among the human-missed cancers, calcifications were more frequent (42/154; 27%) than among the AI-missed cancers (38/210; 18%) (p = 0.396). Furthermore, 50/154 (32.5%) human-missed cancers were deemed to have a very high likelihood of malignancy, compared to 41/210 (19.5%) AI-missed cancers (p = 0.8). Overall inter-rater agreement on the items assessed during the reader study was poor to fair (<0.40), suggesting that interpretation of the selected images was challenging. Conclusions: Lesions missed by AI were smaller and less often calcified than cancers missed by human readers. Cancers missed by AI tended to show lower levels of suspicion than those missed by human readers. While definitive conclusions are premature, the findings highlight the complementary roles of AI and human readers in mammographic interpretation.

查看原文本刊更多论文

数字乳房x线摄影筛查中人工智能与人类读者对恶性病变记忆差异的探讨。

背景：人工智能（AI）在数字乳房x线摄影（DM）筛查中的集成有望用于早期乳腺癌检测，潜在地提高准确性和效率。然而，人工智能的表现与人类观察者的表现并不相同。我们的目标是确定人工智能或人类筛查在解释不一致时遗漏的真正癌症的共同形态学图像特征。方法：从大型回顾性多机构数字乳房x线照相术数据集中识别出26例乳腺癌阳性病例，这些病例基于不同的人工智能和人类解读，并纳入读者研究。组织病理学或≥1年的随访证实了基本事实。14名放射科医生评估了病变的可见性、形态学特征和恶性肿瘤的可能性。采用受试者工作特征（ROC）分析和曲线下面积（AUC）评估人工智能的性能。读者研究结果分析使用观察者间协议措施和描述性统计。结果：人工智能在整个数据集中表现出较高的判别能力，auc范围为0.903 （95% CI: 0.862-0.944）至0.946 （95% CI: 0.896-0.996）。与人类读者遗漏的癌症（21.0 mm, IQR 10.5-41.0）相比，人工智能遗漏的癌症的中位尺寸（9.0 mm, IQR 6.5-12.0）明显更小（p = 0.0014）。不同病例的癌症通常被描述为“低可见度”、“边缘模糊”或“形状不规则”。27%的人为漏诊癌症（42/154）和18%的人工智能漏诊癌症（38/210）观察到钙化。32.5%（50/154）的人类漏诊癌症被认为是恶性肿瘤，而人工智能漏诊癌症的这一比例为19.5%（41/210）。总体而言，评分者之间的一致性从差到公平（p = 0.396）。此外，50/154（32.5%）人类漏诊的癌症被认为有很高的恶性可能性，而41/210（19.5%）人工智能漏诊的癌症被认为有很高的恶性可能性（p = 0.8）。在读者研究中评估的项目上，评分者之间的总体一致性很差，不公平(结论：人工智能遗漏的病变比人类读者遗漏的癌症更小，更少钙化。与人类读者相比，人工智能遗漏的癌症往往表现出更低的怀疑程度。虽然明确的结论尚不成熟，但研究结果强调了人工智能和人类读者在乳房x光检查解释中的互补作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnostics Biochemistry, Genetics and Molecular Biology-Clinical Biochemistry

CiteScore

4.70

自引率

8.30%

发文量

2699

审稿时长

19.64 days

期刊介绍： Diagnostics (ISSN 2075-4418) is an international scholarly open access journal on medical diagnostics. It publishes original research articles, reviews, communications and short notes on the research and development of medical diagnostics. There is no restriction on the length of the papers. Our aim is to encourage scientists to publish their experimental and theoretical research in as much detail as possible. Full experimental and/or methodological details must be provided for research articles.