Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.

IF 1.4 3区医学 Q2 PEDIATRICS

European Journal of Pediatric Surgery Pub Date : 2025-10-01 Epub Date: 2025-03-05 DOI:10.1055/a-2551-2131

Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert

{"title":"Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.","authors":"Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert","doi":"10.1055/a-2551-2131","DOIUrl":null,"url":null,"abstract":"The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.","PeriodicalId":56316,"journal":{"name":"European Journal of Pediatric Surgery","volume":" ","pages":"382-389"},"PeriodicalIF":1.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Pediatric Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2551-2131","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/5 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.

查看原文本刊更多论文

解决复杂的儿科手术案例研究：副驾驶、ChatGPT-4和经验丰富的儿科医生的比较分析。

大型语言模型（llm）的出现导致了包括医学在内的多个领域的显著进步。然而，它们在儿科手术中的作用在很大程度上仍未被探索。本研究旨在评估人工智能模型ChatGPT-4和Microsoft Copilot提出诊断程序、初步诊断和鉴别诊断的能力，并利用复杂的儿科外科经典疾病临床病例短片回答临床问题。方法：研究于2024年4月进行。我们使用13个儿科外科疾病的复杂临床病例来评估llm的表现，并比较了一组经验丰富的儿科外科医生的反应。此外，儿科外科医生对LLMs的诊断建议的完整性和准确性进行了评价。为了确定性能上的差异，我们进行了统计分析。结果：ChatGPT-4的测试得分（52.1%）高于Copilot(47.9%)，但低于儿科外科医生（68.8%）。ChatGPT-4、Copilot和儿科外科医生之间的总体表现差异具有统计学意义(p)。结论：本研究揭示了人工智能模型在儿科外科中的表现存在显著局限性。尽管法学硕士在各个领域都表现出潜力，但他们在处理临床决策任务方面的可靠性和准确性是有限的。需要进一步的研究来提高人工智能的能力，并确定其在临床环境中的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Pediatric Surgery 医学-外科

CiteScore

3.90

自引率

5.60%

发文量

审稿时长

6-12 weeks

期刊介绍： This broad-based international journal updates you on vital developments in pediatric surgery through original articles, abstracts of the literature, and meeting announcements. You will find state-of-the-art information on: abdominal and thoracic surgery neurosurgery urology gynecology oncology orthopaedics traumatology anesthesiology child pathology embryology morphology Written by surgeons, physicians, anesthesiologists, radiologists, and others involved in the surgical care of neonates, infants, and children, the EJPS is an indispensable resource for all specialists.