Alejandro García-Rudolph, Elena Hernández-Pena, Nuria Del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso
{"title":"多模态生成人工智能模型gpt - 40在18个公共胃肠病学病例研究中的临床推理评价","authors":"Alejandro García-Rudolph, Elena Hernández-Pena, Nuria Del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso","doi":"10.17235/reed.2025.11369/2025","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction and aim: </strong>Although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o in answering clinical questions in Spanish and to qualitatively analyze its errors.</p><p><strong>Methods: </strong>We used the most recent official board examination for Specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6%). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct).</p><p><strong>Results: </strong>Twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making.</p><p><strong>Conclusions: </strong>Clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.</p>","PeriodicalId":21342,"journal":{"name":"Revista Espanola De Enfermedades Digestivas","volume":" ","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies.\",\"authors\":\"Alejandro García-Rudolph, Elena Hernández-Pena, Nuria Del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso\",\"doi\":\"10.17235/reed.2025.11369/2025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction and aim: </strong>Although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o in answering clinical questions in Spanish and to qualitatively analyze its errors.</p><p><strong>Methods: </strong>We used the most recent official board examination for Specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6%). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct).</p><p><strong>Results: </strong>Twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making.</p><p><strong>Conclusions: </strong>Clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.</p>\",\"PeriodicalId\":21342,\"journal\":{\"name\":\"Revista Espanola De Enfermedades Digestivas\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Revista Espanola De Enfermedades Digestivas\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.17235/reed.2025.11369/2025\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"GASTROENTEROLOGY & HEPATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista Espanola De Enfermedades Digestivas","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.17235/reed.2025.11369/2025","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies.
Introduction and aim: Although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o in answering clinical questions in Spanish and to qualitatively analyze its errors.
Methods: We used the most recent official board examination for Specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6%). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct).
Results: Twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making.
Conclusions: Clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.
期刊介绍:
La Revista Española de Enfermedades Digestivas, Órgano Oficial de la Sociedad Española de Patología Digestiva (SEPD), Sociedad Española de Endoscopia Digestiva (SEED) y Asociación Española de Ecografía Digestiva (AEED), publica artículos originales, editoriales, revisiones, casos clínicos, cartas al director, imágenes en patología digestiva, y otros artículos especiales sobre todos los aspectos relativos a las enfermedades digestivas.