Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.

IF 1.7 4区教育学 Q2 EDUCATION, SCIENTIFIC DISCIPLINES

Advances in Physiology Education Pub Date : 2025-06-01 Epub Date: 2025-01-17 DOI:10.1152/advan.00093.2024

Volodymyr Mavrych, Ahmed Yaqinuddin, Olena Bolgova

{"title":"Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.","authors":"Volodymyr Mavrych, Ahmed Yaqinuddin, Olena Bolgova","doi":"10.1152/advan.00093.2024","DOIUrl":null,"url":null,"abstract":"Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5 and GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on multiple-choice questions (MCQs) from the medical neuroscience course database to evaluate chatbot reliability. Five successive attempts of each chatbot to answer 200 United States Medical Licensing Examination (USMLE)-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that, at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots, with 83% and 81.7% correct answers, which is better than the average student result. They were followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%). Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1-86.7%, and the lowest results were for Brain stem, Special senses, and Cerebellum, with 54.4-57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.NEW & NOTEWORTHY This research evaluates the effectiveness of different AI-driven large language models (Claude, ChatGPT, Copilot, and Gemini) compared to medical students in answering neuroscience questions. The study offers insights into the specific areas of neuroscience in which these chatbots may excel or have limitations, providing a comprehensive analysis of chatbots' current capabilities in processing and interacting with certain topics of the basic medical sciences curriculum.","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":" ","pages":"430-437"},"PeriodicalIF":1.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00093.2024","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/17 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5 and GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on multiple-choice questions (MCQs) from the medical neuroscience course database to evaluate chatbot reliability. Five successive attempts of each chatbot to answer 200 United States Medical Licensing Examination (USMLE)-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that, at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots, with 83% and 81.7% correct answers, which is better than the average student result. They were followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%). Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1-86.7%, and the lowest results were for Brain stem, Special senses, and Cerebellum, with 54.4-57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.NEW & NOTEWORTHY This research evaluates the effectiveness of different AI-driven large language models (Claude, ChatGPT, Copilot, and Gemini) compared to medical students in answering neuroscience questions. The study offers insights into the specific areas of neuroscience in which these chatbots may excel or have limitations, providing a comprehensive analysis of chatbots' current capabilities in processing and interacting with certain topics of the basic medical sciences curriculum.

查看原文本刊更多论文

克劳德、ChatGPT、副驾驶和双子星表现对神经科学不同主题学生的影响。

尽管对大型语言模型及其回答各种许可考试问题的能力进行了广泛的研究，但在医学课程中的特定科目（特别是医学神经科学）中使用聊天机器人的关注有限。本研究比较了Claude 3.5 Sonnet （Anthropic）、GPT-3.5、GPT-4-1106 （OpenAI）、Copilot免费版（Microsoft）和Gemini 1.5 Flash（谷歌）与学生在医学神经科学课程数据库mcq上的表现，以评估聊天机器人的可靠性。每个聊天机器人连续5次尝试回答200个usmle风格的问题，并根据准确性、相关性和全面性进行评估。mcq被分为12个类别/主题。结果表明，在目前的发展水平下，选定的ai驱动的聊天机器人平均能准确回答67.2%的医学神经科学课程的mcq，比学生的平均水平低7.4%。然而，Claude和GPT-4分别以83%和81.7%的正确率超越了其他聊天机器人，优于学生的平均成绩。其次是Copilot（59.5%）、GPT-3.5（58.3%）和Gemini（53.6%）。在不同类别中，神经细胞学、胚胎学和间脑是三个最好的题目，平均答对率为78.1% ~ 86.7%，最低的是脑干、特殊感觉和小脑，答对率为54.4% ~ 57.7%。我们的研究表明，Claude和GPT-4是目前最先进的两个聊天机器人。他们在回答与神经科学相关的mcq方面表现出的熟练程度超过了普通医科学生。这一突破标志着人工智能如何补充和增强教育工具和技术的一个重要里程碑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in Physiology Education 医学-生理学

CiteScore

3.40

自引率

19.00%

发文量

100

审稿时长

>12 weeks

期刊介绍： Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.