评估GPT-4在紧急情况下的视觉解释和临床推理:一项为期五年的分析

IF 2.4
Te-Hao Wang, Jing-Cheng Jheng, Yen-Ting Tseng, Li-Fu Chen, Yu-Chun Chen
{"title":"评估GPT-4在紧急情况下的视觉解释和临床推理:一项为期五年的分析","authors":"Te-Hao Wang, Jing-Cheng Jheng, Yen-Ting Tseng, Li-Fu Chen, Yu-Chun Chen","doi":"10.1097/JCMA.0000000000001273","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.</p><p><strong>Methods: </strong>An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.</p><p><strong>Results: </strong>GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.</p><p><strong>Conclusion: </strong>GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.</p>","PeriodicalId":94115,"journal":{"name":"Journal of the Chinese Medical Association : JCMA","volume":" ","pages":"672-680"},"PeriodicalIF":2.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating GPT-4's visual interpretation and clinical reasoning on emergency settings: A 5-year analysis.\",\"authors\":\"Te-Hao Wang, Jing-Cheng Jheng, Yen-Ting Tseng, Li-Fu Chen, Yu-Chun Chen\",\"doi\":\"10.1097/JCMA.0000000000001273\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.</p><p><strong>Methods: </strong>An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.</p><p><strong>Results: </strong>GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.</p><p><strong>Conclusion: </strong>GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.</p>\",\"PeriodicalId\":94115,\"journal\":{\"name\":\"Journal of the Chinese Medical Association : JCMA\",\"volume\":\" \",\"pages\":\"672-680\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Chinese Medical Association : JCMA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1097/JCMA.0000000000001273\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Chinese Medical Association : JCMA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/JCMA.0000000000001273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:生成式人工智能的使用,特别是像GPT-4这样的大型语言模型,正在医学教育中得到扩展。本研究评估了GPT-4解释急诊医学委员会考试问题的能力,包括文本和图像,以评估其在紧急情况下的认知和决策表现。方法:采用台湾省急诊医学委员会考试试题(2018-2022)进行观察性研究。GPT-4的表现是根据问题类型的准确性和推理性来评估的。统计分析考察了影响表现的因素,包括知识维度、认知水平、临床小插曲的存在和问题的极性。结果:GPT-4的总体准确率为60.1%,基于文本的问题(60.2%)和基于图像的问题(59.3%)的结果相似。它对图像类型的识别准确率为100%,对结果的解释准确率为86.4%。然而,诊断推理的准确性下降(83.1%),最终决策的准确性进一步下降(59.3%)。这种逐步下降凸显了GPT-4很难将图像分析整合到临床结论中。在问题特征和人工智能表现之间没有发现显著的关联。结论:GPT-4具有较强的图像识别能力和中等诊断推理能力,但决策能力有限,特别是在综合视觉和临床数据时。虽然作为训练工具很有希望,但它对模式识别的依赖超过了临床理解,限制了现实世界的适用性。在人工智能能够可靠地支持紧急医疗决策之前,还需要进一步改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating GPT-4's visual interpretation and clinical reasoning on emergency settings: A 5-year analysis.

Background: The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.

Methods: An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.

Results: GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.

Conclusion: GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信