Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies.

IF 4 4区 医学 Q2 GASTROENTEROLOGY & HEPATOLOGY
Alejandro García-Rudolph, Elena Hernández-Pena, Nuria Del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso
{"title":"Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies.","authors":"Alejandro García-Rudolph, Elena Hernández-Pena, Nuria Del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso","doi":"10.17235/reed.2025.11369/2025","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction and aim: </strong>Although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o in answering clinical questions in Spanish and to qualitatively analyze its errors.</p><p><strong>Methods: </strong>We used the most recent official board examination for Specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6%). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct).</p><p><strong>Results: </strong>Twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making.</p><p><strong>Conclusions: </strong>Clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.</p>","PeriodicalId":21342,"journal":{"name":"Revista Espanola De Enfermedades Digestivas","volume":" ","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista Espanola De Enfermedades Digestivas","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.17235/reed.2025.11369/2025","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction and aim: Although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o in answering clinical questions in Spanish and to qualitatively analyze its errors.

Methods: We used the most recent official board examination for Specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6%). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct).

Results: Twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making.

Conclusions: Clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.

多模态生成人工智能模型gpt - 40在18个公共胃肠病学病例研究中的临床推理评价
简介和目的:虽然生成语言模型在消化系统疾病领域得到了广泛的研究,但进一步的进展需要解决未被探索的方面,如语言偏见、评估模型反应背后的临床推理,以及在非英语环境中使用真实的临床材料。本研究的目的是评估gpt - 40在西班牙语回答临床问题的准确性,并定性分析其错误。方法:我们采用最新的官方胃肠病学专家委员会考试(西班牙,2023年),重点关注其实践部分,其中包括18个通过文本和图像描述的真实临床病例,共计50个选择题(共200个选项)。除组委会撤回的一个问题外,分析了49个有效问题。gpt - 40答对39题(79.6%)。有临床图像的问题(22/29正确)和没有临床图像的问题(17/20正确)之间没有显著差异。结果:20%的答案是错误的。在这些情况下,该模型被提示提供其推理,然后由一组专家进行定性分析。错误与不适当的治疗概括、对诊断或治疗顺序的混淆、上下文信息整合不良、对禁忌症的不了解以及在临床决策中遗漏关键的时间标准有关。结论:临床影像不增加误差率;然而,观察到的失败表明,该模型倾向于忽略已经提供的信息(如临床背景或时间标准),从而损害其推理的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.00
自引率
25.00%
发文量
400
审稿时长
6-12 weeks
期刊介绍: La Revista Española de Enfermedades Digestivas, Órgano Oficial de la Sociedad Española de Patología Digestiva (SEPD), Sociedad Española de Endoscopia Digestiva (SEED) y Asociación Española de Ecografía Digestiva (AEED), publica artículos originales, editoriales, revisiones, casos clínicos, cartas al director, imágenes en patología digestiva, y otros artículos especiales sobre todos los aspectos relativos a las enfermedades digestivas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信