Iris E Chen, Melissa Joines, Nina Capiro, Reema Dawar, Christopher Sears, James Sayre, James Chalfant, Cheryce Fischer, Anne C Hoyt, William Hsu, Hannah S Milch
{"title":"商业人工智能与放射科医生:基于人群的数字乳房x线照相术和断层合成筛查乳房x线照相术队列的NPV和召回率。","authors":"Iris E Chen, Melissa Joines, Nina Capiro, Reema Dawar, Christopher Sears, James Sayre, James Chalfant, Cheryce Fischer, Anne C Hoyt, William Hsu, Hannah S Milch","doi":"10.2214/AJR.25.32889","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background:</b> By reliably classifying screening mammograms as negative, artificial intelligence (AI) could minimize radiologists' time spent reviewing high volumes of normal examinations and help prioritize examinations with high likelihood of malignancy. <b>Objective:</b> To compare performance of AI, classified as positive at different thresholds, with that of radiologists, focusing on NPV and recall rates, in large population-based digital mammography (DM) and digital breast tomosynthesis (DBT) screening cohorts. <b>Methods:</b> This retrospective single-institution study included women enrolled in the observational population-based Athena Breast Health Network. Stratified random sampling was used to identify cohorts of DM and DBT screening examinations performed from January 2010 through December 2019. Radiologists' interpretations were extracted from clinical reports. A commercial AI system classified examinations as low, intermediate, or elevated risk. Breast cancer diagnoses within 1 year after screening examinations were identified from a state cancer registry. AI and radiologist performance were compared. <b>Results:</b> The DM cohort included 26,693 examinations in 20,409 women (mean age, 58.1 years). AI classified 58.2%, 27.7%, and 14.0% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate and NPV for radiologists were 88.6%, 93.3%, 7.2%, and 99.9%; for AI (defining positive as elevated risk), 74.4%, 86.3%, 14.0%, and 99.8%; and for AI (defining positive as intermediate/elevated risk), 94.0%, 58.6%, 41.8%, and 99.9%. The DBT cohort included 4824 examinations in 4379 women (mean age, 61.3 years). AI classified 68.1%, 19.8%, and 12.1% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate, and NPV for radiologists were 83.8%, 93.7%, 6.9%, and 99.9%; for AI (defining positive results as elevated risk), 78.4%, 88.4%, 12.1%, and 99.8%; and for AI (defining positive results as intermediate/elevated risk), 89.2%, 68.5%, 31.9%, and 99.8%. <b>Conclusion:</b> In large DM and DBT cohorts, AI at either diagnostic threshold achieved high NPV but had higher recall rates than radiologists. Defining positive AI results to include intermediate-risk examinations, versus only elevated-risk examinations, detected additional cancers but yielded markedly increased recall rates. <b>Clinical Impact:</b> The findings support AI's potential to aid radiologists' workflow efficiency. Yet, strategies are needed to address frequent false-positive results, particularly in the intermediate-risk category.</p>","PeriodicalId":55529,"journal":{"name":"American Journal of Roentgenology","volume":" ","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Commercial Artificial Intelligence Versus Radiologists: NPV and Recall Rate in Large Population-Based Digital Mammography and Tomosynthesis Screening Mammography Cohorts.\",\"authors\":\"Iris E Chen, Melissa Joines, Nina Capiro, Reema Dawar, Christopher Sears, James Sayre, James Chalfant, Cheryce Fischer, Anne C Hoyt, William Hsu, Hannah S Milch\",\"doi\":\"10.2214/AJR.25.32889\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Background:</b> By reliably classifying screening mammograms as negative, artificial intelligence (AI) could minimize radiologists' time spent reviewing high volumes of normal examinations and help prioritize examinations with high likelihood of malignancy. <b>Objective:</b> To compare performance of AI, classified as positive at different thresholds, with that of radiologists, focusing on NPV and recall rates, in large population-based digital mammography (DM) and digital breast tomosynthesis (DBT) screening cohorts. <b>Methods:</b> This retrospective single-institution study included women enrolled in the observational population-based Athena Breast Health Network. Stratified random sampling was used to identify cohorts of DM and DBT screening examinations performed from January 2010 through December 2019. Radiologists' interpretations were extracted from clinical reports. A commercial AI system classified examinations as low, intermediate, or elevated risk. Breast cancer diagnoses within 1 year after screening examinations were identified from a state cancer registry. AI and radiologist performance were compared. <b>Results:</b> The DM cohort included 26,693 examinations in 20,409 women (mean age, 58.1 years). AI classified 58.2%, 27.7%, and 14.0% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate and NPV for radiologists were 88.6%, 93.3%, 7.2%, and 99.9%; for AI (defining positive as elevated risk), 74.4%, 86.3%, 14.0%, and 99.8%; and for AI (defining positive as intermediate/elevated risk), 94.0%, 58.6%, 41.8%, and 99.9%. The DBT cohort included 4824 examinations in 4379 women (mean age, 61.3 years). AI classified 68.1%, 19.8%, and 12.1% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate, and NPV for radiologists were 83.8%, 93.7%, 6.9%, and 99.9%; for AI (defining positive results as elevated risk), 78.4%, 88.4%, 12.1%, and 99.8%; and for AI (defining positive results as intermediate/elevated risk), 89.2%, 68.5%, 31.9%, and 99.8%. <b>Conclusion:</b> In large DM and DBT cohorts, AI at either diagnostic threshold achieved high NPV but had higher recall rates than radiologists. Defining positive AI results to include intermediate-risk examinations, versus only elevated-risk examinations, detected additional cancers but yielded markedly increased recall rates. <b>Clinical Impact:</b> The findings support AI's potential to aid radiologists' workflow efficiency. Yet, strategies are needed to address frequent false-positive results, particularly in the intermediate-risk category.</p>\",\"PeriodicalId\":55529,\"journal\":{\"name\":\"American Journal of Roentgenology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Roentgenology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2214/AJR.25.32889\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Roentgenology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2214/AJR.25.32889","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Commercial Artificial Intelligence Versus Radiologists: NPV and Recall Rate in Large Population-Based Digital Mammography and Tomosynthesis Screening Mammography Cohorts.
Background: By reliably classifying screening mammograms as negative, artificial intelligence (AI) could minimize radiologists' time spent reviewing high volumes of normal examinations and help prioritize examinations with high likelihood of malignancy. Objective: To compare performance of AI, classified as positive at different thresholds, with that of radiologists, focusing on NPV and recall rates, in large population-based digital mammography (DM) and digital breast tomosynthesis (DBT) screening cohorts. Methods: This retrospective single-institution study included women enrolled in the observational population-based Athena Breast Health Network. Stratified random sampling was used to identify cohorts of DM and DBT screening examinations performed from January 2010 through December 2019. Radiologists' interpretations were extracted from clinical reports. A commercial AI system classified examinations as low, intermediate, or elevated risk. Breast cancer diagnoses within 1 year after screening examinations were identified from a state cancer registry. AI and radiologist performance were compared. Results: The DM cohort included 26,693 examinations in 20,409 women (mean age, 58.1 years). AI classified 58.2%, 27.7%, and 14.0% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate and NPV for radiologists were 88.6%, 93.3%, 7.2%, and 99.9%; for AI (defining positive as elevated risk), 74.4%, 86.3%, 14.0%, and 99.8%; and for AI (defining positive as intermediate/elevated risk), 94.0%, 58.6%, 41.8%, and 99.9%. The DBT cohort included 4824 examinations in 4379 women (mean age, 61.3 years). AI classified 68.1%, 19.8%, and 12.1% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate, and NPV for radiologists were 83.8%, 93.7%, 6.9%, and 99.9%; for AI (defining positive results as elevated risk), 78.4%, 88.4%, 12.1%, and 99.8%; and for AI (defining positive results as intermediate/elevated risk), 89.2%, 68.5%, 31.9%, and 99.8%. Conclusion: In large DM and DBT cohorts, AI at either diagnostic threshold achieved high NPV but had higher recall rates than radiologists. Defining positive AI results to include intermediate-risk examinations, versus only elevated-risk examinations, detected additional cancers but yielded markedly increased recall rates. Clinical Impact: The findings support AI's potential to aid radiologists' workflow efficiency. Yet, strategies are needed to address frequent false-positive results, particularly in the intermediate-risk category.
期刊介绍:
Founded in 1907, the monthly American Journal of Roentgenology (AJR) is the world’s longest continuously published general radiology journal. AJR is recognized as among the specialty’s leading peer-reviewed journals and has a worldwide circulation of close to 25,000. The journal publishes clinically-oriented articles across all radiology subspecialties, seeking relevance to radiologists’ daily practice. The journal publishes hundreds of articles annually with a diverse range of formats, including original research, reviews, clinical perspectives, editorials, and other short reports. The journal engages its audience through a spectrum of social media and digital communication activities.