{"title":"评估GPT-4在紧急情况下的视觉解释和临床推理:一项为期五年的分析","authors":"Te-Hao Wang, Jing-Cheng Jheng, Yen-Ting Tseng, Li-Fu Chen, Yu-Chun Chen","doi":"10.1097/JCMA.0000000000001273","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.</p><p><strong>Methods: </strong>An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.</p><p><strong>Results: </strong>GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.</p><p><strong>Conclusion: </strong>GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.</p>","PeriodicalId":94115,"journal":{"name":"Journal of the Chinese Medical Association : JCMA","volume":" ","pages":"672-680"},"PeriodicalIF":2.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating GPT-4's visual interpretation and clinical reasoning on emergency settings: A 5-year analysis.\",\"authors\":\"Te-Hao Wang, Jing-Cheng Jheng, Yen-Ting Tseng, Li-Fu Chen, Yu-Chun Chen\",\"doi\":\"10.1097/JCMA.0000000000001273\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.</p><p><strong>Methods: </strong>An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.</p><p><strong>Results: </strong>GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.</p><p><strong>Conclusion: </strong>GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.</p>\",\"PeriodicalId\":94115,\"journal\":{\"name\":\"Journal of the Chinese Medical Association : JCMA\",\"volume\":\" \",\"pages\":\"672-680\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Chinese Medical Association : JCMA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1097/JCMA.0000000000001273\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Chinese Medical Association : JCMA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/JCMA.0000000000001273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating GPT-4's visual interpretation and clinical reasoning on emergency settings: A 5-year analysis.
Background: The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings.
Methods: An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity.
Results: GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance.
Conclusion: GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.