{"title":"Evaluating GPT-4's visual interpretation and clinical reasoning on emergency settings: A 5-year analysis.","authors":"Te-Hao Wang, Jing-Cheng Jheng, Yen-Ting Tseng, Li-Fu Chen, Yu-Chun Chen","doi":"10.1097/JCMA.0000000000001273","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.</p><p><strong>Methods: </strong>An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.</p><p><strong>Results: </strong>GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.</p><p><strong>Conclusion: </strong>GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.</p>","PeriodicalId":94115,"journal":{"name":"Journal of the Chinese Medical Association : JCMA","volume":" ","pages":"672-680"},"PeriodicalIF":2.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Chinese Medical Association : JCMA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/JCMA.0000000000001273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.
Methods: An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.
Results: GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.
Conclusion: GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.