Suzanne L van Winkel, Ioannis Sechopoulos, Alejandro Rodríguez-Ruiz, Wouter J H Veldkamp, Gisella Gennaro, Margarita Chevalier, Thomas H Helbich, Tianyu Zhang, Matthew G Wallis, Ritse M Mann
{"title":"An Exploration of Discrepant Recalls Between AI and Human Readers of Malignant Lesions in Digital Mammography Screening.","authors":"Suzanne L van Winkel, Ioannis Sechopoulos, Alejandro Rodríguez-Ruiz, Wouter J H Veldkamp, Gisella Gennaro, Margarita Chevalier, Thomas H Helbich, Tianyu Zhang, Matthew G Wallis, Ritse M Mann","doi":"10.3390/diagnostics15121566","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background:</b> The integration of artificial intelligence (AI) in digital mammography (DM) screening holds promise for early breast cancer detection, potentially enhancing accuracy and efficiency. However, AI performance is not identical to that of human observers. We aimed to identify common morphological image characteristics of true cancers that are missed by either AI or human screening when their interpretations are discrepant. <b>Methods:</b> Twenty-six breast cancer-positive cases, identified from a large retrospective multi-institutional digital mammography dataset based on discrepant AI and human interpretations, were included in a reader study. Ground truth was confirmed by histopathology or ≥1-year follow-up. Fourteen radiologists assessed lesion visibility, morphological features, and likelihood of malignancy. AI performance was evaluated using receiver operating characteristic (ROC) analysis and area under the curve (AUC). The reader study results were analyzed using interobserver agreement measures and descriptive statistics. <b>Results:</b> AI demonstrated high discriminative capability in the full dataset, with AUCs ranging from 0.903 (95% CI: 0.862-0.944) to 0.946 (95% CI: 0.896-0.996). Cancers missed by AI had a significantly smaller median size (9.0 mm, IQR 6.5-12.0) compared to those missed by human readers (21.0 mm, IQR 10.5-41.0) (<i>p</i> = 0.0014). Cancers in discrepant cases were often described as having 'low visibility', 'indistinct margins', or 'irregular shape'. Calcifications were observed in 27% of human-missed cancers (42/154) versus 18% of AI-missed cancers (38/210). A very high likelihood of malignancy was assigned in 32.5% (50/154) of human-missed cancers compared to 19.5% (41/210) of AI-missed cancers. Overall inter-rater agreement was poor to fair (<0.40), indicating interpretation challenges of the selected images. Among the human-missed cancers, calcifications were more frequent (42/154; 27%) than among the AI-missed cancers (38/210; 18%) (<i>p</i> = 0.396). Furthermore, 50/154 (32.5%) human-missed cancers were deemed to have a very high likelihood of malignancy, compared to 41/210 (19.5%) AI-missed cancers (<i>p</i> = 0.8). Overall inter-rater agreement on the items assessed during the reader study was poor to fair (<0.40), suggesting that interpretation of the selected images was challenging. <b>Conclusions:</b> Lesions missed by AI were smaller and less often calcified than cancers missed by human readers. Cancers missed by AI tended to show lower levels of suspicion than those missed by human readers. While definitive conclusions are premature, the findings highlight the complementary roles of AI and human readers in mammographic interpretation.</p>","PeriodicalId":11225,"journal":{"name":"Diagnostics","volume":"15 12","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191860/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/diagnostics15121566","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The integration of artificial intelligence (AI) in digital mammography (DM) screening holds promise for early breast cancer detection, potentially enhancing accuracy and efficiency. However, AI performance is not identical to that of human observers. We aimed to identify common morphological image characteristics of true cancers that are missed by either AI or human screening when their interpretations are discrepant. Methods: Twenty-six breast cancer-positive cases, identified from a large retrospective multi-institutional digital mammography dataset based on discrepant AI and human interpretations, were included in a reader study. Ground truth was confirmed by histopathology or ≥1-year follow-up. Fourteen radiologists assessed lesion visibility, morphological features, and likelihood of malignancy. AI performance was evaluated using receiver operating characteristic (ROC) analysis and area under the curve (AUC). The reader study results were analyzed using interobserver agreement measures and descriptive statistics. Results: AI demonstrated high discriminative capability in the full dataset, with AUCs ranging from 0.903 (95% CI: 0.862-0.944) to 0.946 (95% CI: 0.896-0.996). Cancers missed by AI had a significantly smaller median size (9.0 mm, IQR 6.5-12.0) compared to those missed by human readers (21.0 mm, IQR 10.5-41.0) (p = 0.0014). Cancers in discrepant cases were often described as having 'low visibility', 'indistinct margins', or 'irregular shape'. Calcifications were observed in 27% of human-missed cancers (42/154) versus 18% of AI-missed cancers (38/210). A very high likelihood of malignancy was assigned in 32.5% (50/154) of human-missed cancers compared to 19.5% (41/210) of AI-missed cancers. Overall inter-rater agreement was poor to fair (<0.40), indicating interpretation challenges of the selected images. Among the human-missed cancers, calcifications were more frequent (42/154; 27%) than among the AI-missed cancers (38/210; 18%) (p = 0.396). Furthermore, 50/154 (32.5%) human-missed cancers were deemed to have a very high likelihood of malignancy, compared to 41/210 (19.5%) AI-missed cancers (p = 0.8). Overall inter-rater agreement on the items assessed during the reader study was poor to fair (<0.40), suggesting that interpretation of the selected images was challenging. Conclusions: Lesions missed by AI were smaller and less often calcified than cancers missed by human readers. Cancers missed by AI tended to show lower levels of suspicion than those missed by human readers. While definitive conclusions are premature, the findings highlight the complementary roles of AI and human readers in mammographic interpretation.
DiagnosticsBiochemistry, Genetics and Molecular Biology-Clinical Biochemistry
CiteScore
4.70
自引率
8.30%
发文量
2699
审稿时长
19.64 days
期刊介绍:
Diagnostics (ISSN 2075-4418) is an international scholarly open access journal on medical diagnostics. It publishes original research articles, reviews, communications and short notes on the research and development of medical diagnostics. There is no restriction on the length of the papers. Our aim is to encourage scientists to publish their experimental and theoretical research in as much detail as possible. Full experimental and/or methodological details must be provided for research articles.