Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4 and Experienced Pediatric Surgeons' Performance.

IF 1.5 3区 医学 Q2 PEDIATRICS
Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver Johannes Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert
{"title":"Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4 and Experienced Pediatric Surgeons' Performance.","authors":"Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver Johannes Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert","doi":"10.1055/a-2551-2131","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the AI models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.</p><p><strong>Methods: </strong>We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance we performed statistical analyses.</p><p><strong>Results: </strong>ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%), but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p <0.01). ChatGPT-4 demonstrated a superior performance in generating differential diagnoses compared to Copilot (p<0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, recommendations of LLMs were rated as average by pediatric surgeons.</p><p><strong>Conclusion: </strong>This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.</p>","PeriodicalId":56316,"journal":{"name":"European Journal of Pediatric Surgery","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Pediatric Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2551-2131","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the AI models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.

Methods: We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance we performed statistical analyses.

Results: ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%), but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p <0.01). ChatGPT-4 demonstrated a superior performance in generating differential diagnoses compared to Copilot (p<0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, recommendations of LLMs were rated as average by pediatric surgeons.

Conclusion: This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.

求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.90
自引率
5.60%
发文量
66
审稿时长
6-12 weeks
期刊介绍: This broad-based international journal updates you on vital developments in pediatric surgery through original articles, abstracts of the literature, and meeting announcements. You will find state-of-the-art information on: abdominal and thoracic surgery neurosurgery urology gynecology oncology orthopaedics traumatology anesthesiology child pathology embryology morphology Written by surgeons, physicians, anesthesiologists, radiologists, and others involved in the surgical care of neonates, infants, and children, the EJPS is an indispensable resource for all specialists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信