解决复杂的儿科手术案例研究:副驾驶、ChatGPT-4和经验丰富的儿科医生的比较分析。

IF 1.4 3区 医学 Q2 PEDIATRICS
European Journal of Pediatric Surgery Pub Date : 2025-10-01 Epub Date: 2025-03-05 DOI:10.1055/a-2551-2131
Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert
{"title":"解决复杂的儿科手术案例研究:副驾驶、ChatGPT-4和经验丰富的儿科医生的比较分析。","authors":"Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert","doi":"10.1055/a-2551-2131","DOIUrl":null,"url":null,"abstract":"<p><p>The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (<i>p</i> < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (<i>p</i> < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.</p>","PeriodicalId":56316,"journal":{"name":"European Journal of Pediatric Surgery","volume":" ","pages":"382-389"},"PeriodicalIF":1.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.\",\"authors\":\"Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert\",\"doi\":\"10.1055/a-2551-2131\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (<i>p</i> < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (<i>p</i> < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.</p>\",\"PeriodicalId\":56316,\"journal\":{\"name\":\"European Journal of Pediatric Surgery\",\"volume\":\" \",\"pages\":\"382-389\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Journal of Pediatric Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1055/a-2551-2131\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/5 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Pediatric Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2551-2131","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/5 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(llm)的出现导致了包括医学在内的多个领域的显著进步。然而,它们在儿科手术中的作用在很大程度上仍未被探索。本研究旨在评估人工智能模型ChatGPT-4和Microsoft Copilot提出诊断程序、初步诊断和鉴别诊断的能力,并利用复杂的儿科外科经典疾病临床病例短片回答临床问题。方法:研究于2024年4月进行。我们使用13个儿科外科疾病的复杂临床病例来评估llm的表现,并比较了一组经验丰富的儿科外科医生的反应。此外,儿科外科医生对LLMs的诊断建议的完整性和准确性进行了评价。为了确定性能上的差异,我们进行了统计分析。结果:ChatGPT-4的测试得分(52.1%)高于Copilot(47.9%),但低于儿科外科医生(68.8%)。ChatGPT-4、Copilot和儿科外科医生之间的总体表现差异具有统计学意义(p)。结论:本研究揭示了人工智能模型在儿科外科中的表现存在显著局限性。尽管法学硕士在各个领域都表现出潜力,但他们在处理临床决策任务方面的可靠性和准确性是有限的。需要进一步的研究来提高人工智能的能力,并确定其在临床环境中的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.

The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.90
自引率
5.60%
发文量
66
审稿时长
6-12 weeks
期刊介绍: This broad-based international journal updates you on vital developments in pediatric surgery through original articles, abstracts of the literature, and meeting announcements. You will find state-of-the-art information on: abdominal and thoracic surgery neurosurgery urology gynecology oncology orthopaedics traumatology anesthesiology child pathology embryology morphology Written by surgeons, physicians, anesthesiologists, radiologists, and others involved in the surgical care of neonates, infants, and children, the EJPS is an indispensable resource for all specialists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信