Evaluation of ChatGPT-4o in Breast Cancer Screening: Insights from the 5th Edition BI-RADS Atlas and ACR Guidelines.

Bilgen Mehpare Özer, Eda Nur Korkmaz
{"title":"Evaluation of ChatGPT-4o in Breast Cancer Screening: Insights from the 5th Edition BI-RADS Atlas and ACR Guidelines.","authors":"Bilgen Mehpare Özer, Eda Nur Korkmaz","doi":"10.1007/s10278-025-01663-8","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study is to evaluate the potential, reliability, and limitations of ChatGPT-4o in text-based questions and its effectiveness in clinical decision support processes based on the 5th edition of the BI-RADS Atlas and ACR breast cancer screening guidelines. In this study, a total of 100 questions-50 multiple-choice and 50 true/false-prepared by two radiologists were submitted to ChatGPT-4o between November 5 and 19. The answers provided by ChatGPT-4o were evaluated at baseline and 14 days later by both radiologists for accuracy and comprehensiveness using a Likert scale. Group comparisons were performed using Mann-Whitney U, Wilcoxon tests; response consistency was evaluated with Cohen's Kappa, and overall accuracy differences with a two-proportion z-test. The increase in overall accuracy from 86 to 95% was statistically significant according to the two-proportion z-test (p = .030). Comparisons between the two sessions revealed statistically significant increases in the accuracy (p = .013, r = .35, 95% CI [0.09, 0.61]) and comprehensiveness (p = .014, r = .35, 95% CI [0.09, 0.61]) rates of true/false questions. On the other hand, no significant difference was found between the accuracy (p = .180, r = .19, 95% CI [- 0.09, 0.47]) and comprehensiveness (p = .180, r = .19, 95% CI [- 0.09, 0.47]) rates of multiple-choice questions. In addition, group comparisons evaluating the effect of different question formats on performance revealed no significant difference in terms of accuracy (p = .661, r =  - 0.04, 95% CI [- 0.23, 0.16]) and comprehensiveness (p = .708, r =  - 0.04, 95% CI [- 0.23, 0.16]). The consistency of ChatGPT-4o responses was supported by Cohen's Kappa coefficients, all statistically significant (p < .001), with 95% confidence intervals ranging from - .038 to 1.084. ChatGPT-4o demonstrated remarkable performance in answering multiple-choice and true-false questions with overall accuracy improving from 86% in the first test to 95% after 14 days. ChatGPT-4o holds significant potential as a clinical decision support tool for healthcare professionals.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-025-01663-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this study is to evaluate the potential, reliability, and limitations of ChatGPT-4o in text-based questions and its effectiveness in clinical decision support processes based on the 5th edition of the BI-RADS Atlas and ACR breast cancer screening guidelines. In this study, a total of 100 questions-50 multiple-choice and 50 true/false-prepared by two radiologists were submitted to ChatGPT-4o between November 5 and 19. The answers provided by ChatGPT-4o were evaluated at baseline and 14 days later by both radiologists for accuracy and comprehensiveness using a Likert scale. Group comparisons were performed using Mann-Whitney U, Wilcoxon tests; response consistency was evaluated with Cohen's Kappa, and overall accuracy differences with a two-proportion z-test. The increase in overall accuracy from 86 to 95% was statistically significant according to the two-proportion z-test (p = .030). Comparisons between the two sessions revealed statistically significant increases in the accuracy (p = .013, r = .35, 95% CI [0.09, 0.61]) and comprehensiveness (p = .014, r = .35, 95% CI [0.09, 0.61]) rates of true/false questions. On the other hand, no significant difference was found between the accuracy (p = .180, r = .19, 95% CI [- 0.09, 0.47]) and comprehensiveness (p = .180, r = .19, 95% CI [- 0.09, 0.47]) rates of multiple-choice questions. In addition, group comparisons evaluating the effect of different question formats on performance revealed no significant difference in terms of accuracy (p = .661, r =  - 0.04, 95% CI [- 0.23, 0.16]) and comprehensiveness (p = .708, r =  - 0.04, 95% CI [- 0.23, 0.16]). The consistency of ChatGPT-4o responses was supported by Cohen's Kappa coefficients, all statistically significant (p < .001), with 95% confidence intervals ranging from - .038 to 1.084. ChatGPT-4o demonstrated remarkable performance in answering multiple-choice and true-false questions with overall accuracy improving from 86% in the first test to 95% after 14 days. ChatGPT-4o holds significant potential as a clinical decision support tool for healthcare professionals.

chatgpt - 40在乳腺癌筛查中的评价:来自第5版BI-RADS图谱和ACR指南的见解
本研究的目的是评估chatgpt - 40在基于文本的问题中的潜力、可靠性和局限性,以及它在基于第5版BI-RADS图谱和ACR乳腺癌筛查指南的临床决策支持过程中的有效性。在这项研究中,两名放射科医生在11月5日至19日期间向chatgpt - 40提交了总共100个问题,其中50个是多项选择题,50个是非题。chatgpt - 40提供的答案在基线和14天后由放射科医生使用李克特量表评估准确性和全面性。采用Mann-Whitney U、Wilcoxon检验进行组间比较;用Cohen’s Kappa评估反应一致性,用双比例z检验评估总体准确性差异。根据双比例z检验,总体准确度从86%提高到95%具有统计学意义(p = 0.030)。两组之间的比较显示,准确率有统计学意义上的显著提高(p =。013, r =。35, 95% CI[0.09, 0.61])和综合性(p =。014, r =。35, 95% CI[0.09, 0.61])问题的真假率。另一方面,准确度(p =。180, r =。19, 95% CI[- 0.09, 0.47])和综合性(p =。180, r =。19, 95% CI[- 0.09, 0.47])。此外,评估不同问题格式对表现的影响的组比较显示,在准确性方面没有显着差异(p =。661年,r = - 0.04, 95%可信区间(- 0.23,0.16))和全面性(p =。708, r = - 0.04, 95% CI[- 0.23, 0.16])。chatgpt - 40反应的一致性得到Cohen's Kappa系数的支持,均具有统计学显著性(p
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信