Cary J G Oberije, Rachel Currie, Alice Leaver, Alan Redman, William Teh, Nisha Sharma, Georgia Fox, Ben Glocker, Galvin Khara, Jonathan Nash, Annie Y Ng, Peter D Kecskemethy
{"title":"通过306839张不同地理区域、年龄、乳腺密度和种族的乳房x光片分层结果评估人工智能在乳房筛查中的应用:一项评估筛查(ARIES)的回顾性调查研究。","authors":"Cary J G Oberije, Rachel Currie, Alice Leaver, Alan Redman, William Teh, Nisha Sharma, Georgia Fox, Ben Glocker, Galvin Khara, Jonathan Nash, Annie Y Ng, Peter D Kecskemethy","doi":"10.1136/bmjhci-2024-101318","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Evaluate an Artificial Intelligence (AI) system in breast screening through stratified results across age, breast density, ethnicity and screening centres, from different UK regions.</p><p><strong>Methods: </strong>A large-scale retrospective study evaluating two variations of using AI as an independent second reader in double reading was executed. Stratifications were conducted for clinical and operational metrics. Data from 306 839 mammography cases screened between 2017 and 2021 were used and included three different UK regions.The impact on safety and effectiveness was assessed using clinical metrics: cancer detection rate and positive predictive value, stratified according to age, breast density and ethnicity. Operational impact was assessed through reading workload and recall rate, measured overall and per centre.Non-inferiority was tested for AI workflows compared with human double reading, and when passed, superiority was tested. AI interval cancer (IC) flag rate was assessed to estimate additional cancer detection opportunity with AI that cannot be assessed retrospectively.</p><p><strong>Results: </strong>The AI workflows passed non-inferiority or superiority tests for every metric across all subgroups, with workload savings between 38.3% and 43.7%. The AI standalone flagged 41.2% of ICs overall, ranging between 33.3% and 46.8% across subgroups, with the highest detection rate for dense breasts.</p><p><strong>Discussion: </strong>Human double reading and AI workflows showed the same performance disparities across subgroups. The AI integrations maintained or improved performance at all metrics for all subgroups while achieving significant workload reduction. Moreover, complementing these integrations with AI as an additional reader can improve cancer detection.</p><p><strong>Conclusion: </strong>The granularity of assessment showed that screening with the AI-system integrations was as safe as standard double reading across heterogeneous populations.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083354/pdf/","citationCount":"0","resultStr":"{\"title\":\"Assessing artificial intelligence in breast screening with stratified results on 306 839 mammograms across geographic regions, age, breast density and ethnicity: A Retrospective Investigation Evaluating Screening (ARIES) study.\",\"authors\":\"Cary J G Oberije, Rachel Currie, Alice Leaver, Alan Redman, William Teh, Nisha Sharma, Georgia Fox, Ben Glocker, Galvin Khara, Jonathan Nash, Annie Y Ng, Peter D Kecskemethy\",\"doi\":\"10.1136/bmjhci-2024-101318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Evaluate an Artificial Intelligence (AI) system in breast screening through stratified results across age, breast density, ethnicity and screening centres, from different UK regions.</p><p><strong>Methods: </strong>A large-scale retrospective study evaluating two variations of using AI as an independent second reader in double reading was executed. Stratifications were conducted for clinical and operational metrics. Data from 306 839 mammography cases screened between 2017 and 2021 were used and included three different UK regions.The impact on safety and effectiveness was assessed using clinical metrics: cancer detection rate and positive predictive value, stratified according to age, breast density and ethnicity. Operational impact was assessed through reading workload and recall rate, measured overall and per centre.Non-inferiority was tested for AI workflows compared with human double reading, and when passed, superiority was tested. AI interval cancer (IC) flag rate was assessed to estimate additional cancer detection opportunity with AI that cannot be assessed retrospectively.</p><p><strong>Results: </strong>The AI workflows passed non-inferiority or superiority tests for every metric across all subgroups, with workload savings between 38.3% and 43.7%. The AI standalone flagged 41.2% of ICs overall, ranging between 33.3% and 46.8% across subgroups, with the highest detection rate for dense breasts.</p><p><strong>Discussion: </strong>Human double reading and AI workflows showed the same performance disparities across subgroups. The AI integrations maintained or improved performance at all metrics for all subgroups while achieving significant workload reduction. Moreover, complementing these integrations with AI as an additional reader can improve cancer detection.</p><p><strong>Conclusion: </strong>The granularity of assessment showed that screening with the AI-system integrations was as safe as standard double reading across heterogeneous populations.</p>\",\"PeriodicalId\":9050,\"journal\":{\"name\":\"BMJ Health & Care Informatics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083354/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Health & Care Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjhci-2024-101318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2024-101318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Assessing artificial intelligence in breast screening with stratified results on 306 839 mammograms across geographic regions, age, breast density and ethnicity: A Retrospective Investigation Evaluating Screening (ARIES) study.
Objectives: Evaluate an Artificial Intelligence (AI) system in breast screening through stratified results across age, breast density, ethnicity and screening centres, from different UK regions.
Methods: A large-scale retrospective study evaluating two variations of using AI as an independent second reader in double reading was executed. Stratifications were conducted for clinical and operational metrics. Data from 306 839 mammography cases screened between 2017 and 2021 were used and included three different UK regions.The impact on safety and effectiveness was assessed using clinical metrics: cancer detection rate and positive predictive value, stratified according to age, breast density and ethnicity. Operational impact was assessed through reading workload and recall rate, measured overall and per centre.Non-inferiority was tested for AI workflows compared with human double reading, and when passed, superiority was tested. AI interval cancer (IC) flag rate was assessed to estimate additional cancer detection opportunity with AI that cannot be assessed retrospectively.
Results: The AI workflows passed non-inferiority or superiority tests for every metric across all subgroups, with workload savings between 38.3% and 43.7%. The AI standalone flagged 41.2% of ICs overall, ranging between 33.3% and 46.8% across subgroups, with the highest detection rate for dense breasts.
Discussion: Human double reading and AI workflows showed the same performance disparities across subgroups. The AI integrations maintained or improved performance at all metrics for all subgroups while achieving significant workload reduction. Moreover, complementing these integrations with AI as an additional reader can improve cancer detection.
Conclusion: The granularity of assessment showed that screening with the AI-system integrations was as safe as standard double reading across heterogeneous populations.