Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.

IF 5.2 2区 教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES
Olena Bolgova, Paul Ganguly, Volodymyr Mavrych
{"title":"Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.","authors":"Olena Bolgova, Paul Ganguly, Volodymyr Mavrych","doi":"10.1002/ase.70044","DOIUrl":null,"url":null,"abstract":"<p><p>Integrating artificial intelligence, particularly large language models (LLMs), into medical education represents a significant new step in how medical knowledge is accessed, processed, and evaluated. The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots in different topics of medical embryology courses. Two hundred United States Medical Licensing Examination (USMLE)-style multiple-choice questions were selected from the course exam database and distributed across 20 topics. The results of 3 attempts by GPT-4o, Claude, Gemini, Copilot, and GPT-3.5 to answer the assessment items were evaluated. Statistical analyses included intraclass correlation coefficients for reliability, one-way and two-way mixed ANOVAs for performance comparisons, and post hoc analyses. Effect sizes were calculated using Cohen's f and eta-squared (η<sup>2</sup>). On average, the selected chatbots correctly answered 78.7% ± 15.1% of the questions. GPT-4o and Claude performed best, correctly answering 89.7% and 87.5% of the questions, respectively, without a statistical difference in their performance (p = 0.238). The performance of other chatbots was significantly lower (p < 0.01): Copilot (82.5%), Gemini (74.8%), and GPT-3.5 (59.0%). Test-retest reliability analysis showed good reliability for GPT-4o (ICC = 0.803), Claude (ICC = 0.865), and Gemini (ICC = 0.876), with moderate reliability for Copilot and GPT-3.5. This study suggests that AI models like GPT-4o and Claude show promise for providing tailored embryology instruction, though instructor verification remains essential.</p>","PeriodicalId":124,"journal":{"name":"Anatomical Sciences Education","volume":" ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anatomical Sciences Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1002/ase.70044","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Integrating artificial intelligence, particularly large language models (LLMs), into medical education represents a significant new step in how medical knowledge is accessed, processed, and evaluated. The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots in different topics of medical embryology courses. Two hundred United States Medical Licensing Examination (USMLE)-style multiple-choice questions were selected from the course exam database and distributed across 20 topics. The results of 3 attempts by GPT-4o, Claude, Gemini, Copilot, and GPT-3.5 to answer the assessment items were evaluated. Statistical analyses included intraclass correlation coefficients for reliability, one-way and two-way mixed ANOVAs for performance comparisons, and post hoc analyses. Effect sizes were calculated using Cohen's f and eta-squared (η2). On average, the selected chatbots correctly answered 78.7% ± 15.1% of the questions. GPT-4o and Claude performed best, correctly answering 89.7% and 87.5% of the questions, respectively, without a statistical difference in their performance (p = 0.238). The performance of other chatbots was significantly lower (p < 0.01): Copilot (82.5%), Gemini (74.8%), and GPT-3.5 (59.0%). Test-retest reliability analysis showed good reliability for GPT-4o (ICC = 0.803), Claude (ICC = 0.865), and Gemini (ICC = 0.876), with moderate reliability for Copilot and GPT-3.5. This study suggests that AI models like GPT-4o and Claude show promise for providing tailored embryology instruction, though instructor verification remains essential.

法学硕士在医学胚胎学中的表现比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究。
将人工智能,特别是大型语言模型(llm)集成到医学教育中,代表了医学知识获取、处理和评估的重要新步骤。本研究的目的是对高级LLM聊天机器人在医学胚胎学课程不同主题中的表现进行综合分析比较。从课程考试数据库中选择了200个美国医师执照考试(USMLE)式的选择题,分布在20个主题中。评估gpt - 40、Claude、Gemini、Copilot和GPT-3.5对评估项目的3次回答结果。统计分析包括可靠性的类内相关系数,性能比较的单向和双向混合方差分析,以及事后分析。效应量采用Cohen's f和theta -squared (η2)计算。平均而言,被选择的聊天机器人正确回答了78.7%±15.1%的问题。gpt - 40和Claude表现最好,分别正确回答了89.7%和87.5%的问题,他们的表现没有统计学差异(p = 0.238)。其他聊天机器人的表现明显较低(p
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Anatomical Sciences Education
Anatomical Sciences Education Anatomy/education-
CiteScore
10.30
自引率
39.70%
发文量
91
期刊介绍: Anatomical Sciences Education, affiliated with the American Association for Anatomy, serves as an international platform for sharing ideas, innovations, and research related to education in anatomical sciences. Covering gross anatomy, embryology, histology, and neurosciences, the journal addresses education at various levels, including undergraduate, graduate, post-graduate, allied health, medical (both allopathic and osteopathic), and dental. It fosters collaboration and discussion in the field of anatomical sciences education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信