儿科外科培训生与人工智能:DeepSeek、Copilot、b谷歌Bard和儿科外科医生在欧洲儿科外科培训考试(EPSITE)中的表现比较分析

IF 1.6 3区 医学 Q2 PEDIATRICS
Richard Gnatzy, Martin Lacher, Salvatore Cascio, Oliver Münsterer, Richard Wagner, Ophelia Aubert
{"title":"儿科外科培训生与人工智能:DeepSeek、Copilot、b谷歌Bard和儿科外科医生在欧洲儿科外科培训考试(EPSITE)中的表现比较分析","authors":"Richard Gnatzy, Martin Lacher, Salvatore Cascio, Oliver Münsterer, Richard Wagner, Ophelia Aubert","doi":"10.1007/s00383-025-06104-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Large language models (LLMs) have advanced rapidly, but their utility in pediatric surgery remains uncertain. This study assessed the performance of three AI models-DeepSeek, Microsoft Copilot (GPT-4) and Google Bard-on the European Pediatric Surgery In-Training Examination (EPSITE).</p><p><strong>Methods: </strong>We evaluated model performance using 294 EPSITE questions from 2021 to 2023. Data for Copilot and Bard were collected in early 2024, while DeepSeek was assessed in 2025. Responses were compared to those of pediatric surgical trainees. Statistical analyses determined performance differences.</p><p><strong>Results: </strong>DeepSeek achieved the highest accuracy (85.0%), followed by Copilot (55.4%) and Bard (48.0%). Pediatric surgical trainees averaged 60.1%. Performance differences were statistically significant (p < 0.0001). DeepSeek significantly outperformed both human trainees and other models (p < 0.0001), while Bard was consistently outperformed by trainees across all training levels (p < 0.01). Sixth-year trainees performed better than Copilot (p < 0.05). Copilot and Bard failed to answer a small portion of questions (3.4% and 4.7%, respectively) due to ethical concerns or perceived lack of correct choices. The time gap between model assessments reflects the rapid evolution of LLMs, contributing to the superior performance of newer models like DeepSeek.</p><p><strong>Conclusion: </strong>LLMs show variable performance in pediatric surgery, with newer models like DeepSeek demonstrating marked improvement. These findings highlight the rapid progression of LLM capabilities and emphasize the need for ongoing evaluation before clinical integration, especially in high-stakes decision-making contexts.</p>","PeriodicalId":19832,"journal":{"name":"Pediatric Surgery International","volume":"41 1","pages":"247"},"PeriodicalIF":1.6000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12334501/pdf/","citationCount":"0","resultStr":"{\"title\":\"Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons' performance on the European Pediatric Surgical In-Training Examinations (EPSITE).\",\"authors\":\"Richard Gnatzy, Martin Lacher, Salvatore Cascio, Oliver Münsterer, Richard Wagner, Ophelia Aubert\",\"doi\":\"10.1007/s00383-025-06104-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Large language models (LLMs) have advanced rapidly, but their utility in pediatric surgery remains uncertain. This study assessed the performance of three AI models-DeepSeek, Microsoft Copilot (GPT-4) and Google Bard-on the European Pediatric Surgery In-Training Examination (EPSITE).</p><p><strong>Methods: </strong>We evaluated model performance using 294 EPSITE questions from 2021 to 2023. Data for Copilot and Bard were collected in early 2024, while DeepSeek was assessed in 2025. Responses were compared to those of pediatric surgical trainees. Statistical analyses determined performance differences.</p><p><strong>Results: </strong>DeepSeek achieved the highest accuracy (85.0%), followed by Copilot (55.4%) and Bard (48.0%). Pediatric surgical trainees averaged 60.1%. Performance differences were statistically significant (p < 0.0001). DeepSeek significantly outperformed both human trainees and other models (p < 0.0001), while Bard was consistently outperformed by trainees across all training levels (p < 0.01). Sixth-year trainees performed better than Copilot (p < 0.05). Copilot and Bard failed to answer a small portion of questions (3.4% and 4.7%, respectively) due to ethical concerns or perceived lack of correct choices. The time gap between model assessments reflects the rapid evolution of LLMs, contributing to the superior performance of newer models like DeepSeek.</p><p><strong>Conclusion: </strong>LLMs show variable performance in pediatric surgery, with newer models like DeepSeek demonstrating marked improvement. These findings highlight the rapid progression of LLM capabilities and emphasize the need for ongoing evaluation before clinical integration, especially in high-stakes decision-making contexts.</p>\",\"PeriodicalId\":19832,\"journal\":{\"name\":\"Pediatric Surgery International\",\"volume\":\"41 1\",\"pages\":\"247\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12334501/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pediatric Surgery International\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00383-025-06104-9\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric Surgery International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00383-025-06104-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0

摘要

目的:大型语言模型(LLMs)发展迅速,但其在儿科外科中的应用仍不确定。本研究评估了deepseek、Microsoft Copilot (GPT-4)和谷歌bard这三种人工智能模型在欧洲儿科外科培训考试(EPSITE)中的表现。方法:我们使用2021年至2023年的294个EPSITE问题来评估模型的性能。Copilot和Bard的数据是在2024年初收集的,而DeepSeek的评估是在2025年进行的。将这些反应与儿科外科实习生的反应进行比较。统计分析确定了性能差异。结果:DeepSeek的准确率最高(85.0%),其次是Copilot(55.4%)和Bard(48.0%)。小儿外科学员平均60.1%。结论:llm在儿科手术中表现出不同的表现,像DeepSeek这样的新模型表现出明显的改善。这些发现突出了LLM能力的快速发展,并强调了在临床整合之前进行持续评估的必要性,特别是在高风险决策环境中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons' performance on the European Pediatric Surgical In-Training Examinations (EPSITE).

Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons' performance on the European Pediatric Surgical In-Training Examinations (EPSITE).

Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons' performance on the European Pediatric Surgical In-Training Examinations (EPSITE).

Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons' performance on the European Pediatric Surgical In-Training Examinations (EPSITE).

Objective: Large language models (LLMs) have advanced rapidly, but their utility in pediatric surgery remains uncertain. This study assessed the performance of three AI models-DeepSeek, Microsoft Copilot (GPT-4) and Google Bard-on the European Pediatric Surgery In-Training Examination (EPSITE).

Methods: We evaluated model performance using 294 EPSITE questions from 2021 to 2023. Data for Copilot and Bard were collected in early 2024, while DeepSeek was assessed in 2025. Responses were compared to those of pediatric surgical trainees. Statistical analyses determined performance differences.

Results: DeepSeek achieved the highest accuracy (85.0%), followed by Copilot (55.4%) and Bard (48.0%). Pediatric surgical trainees averaged 60.1%. Performance differences were statistically significant (p < 0.0001). DeepSeek significantly outperformed both human trainees and other models (p < 0.0001), while Bard was consistently outperformed by trainees across all training levels (p < 0.01). Sixth-year trainees performed better than Copilot (p < 0.05). Copilot and Bard failed to answer a small portion of questions (3.4% and 4.7%, respectively) due to ethical concerns or perceived lack of correct choices. The time gap between model assessments reflects the rapid evolution of LLMs, contributing to the superior performance of newer models like DeepSeek.

Conclusion: LLMs show variable performance in pediatric surgery, with newer models like DeepSeek demonstrating marked improvement. These findings highlight the rapid progression of LLM capabilities and emphasize the need for ongoing evaluation before clinical integration, especially in high-stakes decision-making contexts.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.00
自引率
5.60%
发文量
215
审稿时长
3-6 weeks
期刊介绍: Pediatric Surgery International is a journal devoted to the publication of new and important information from the entire spectrum of pediatric surgery. The major purpose of the journal is to promote postgraduate training and further education in the surgery of infants and children. The contents will include articles in clinical and experimental surgery, as well as related fields. One section of each issue is devoted to a special topic, with invited contributions from recognized authorities. Other sections will include: -Review articles- Original articles- Technical innovations- Letters to the editor
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信