Volodymyr Mavrych, Ahmed Yaqinuddin, Olena Bolgova
{"title":"Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.","authors":"Volodymyr Mavrych, Ahmed Yaqinuddin, Olena Bolgova","doi":"10.1152/advan.00093.2024","DOIUrl":null,"url":null,"abstract":"<p><p>Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5 and GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on multiple-choice questions (MCQs) from the medical neuroscience course database to evaluate chatbot reliability. Five successive attempts of each chatbot to answer 200 United States Medical Licensing Examination (USMLE)-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that, at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots, with 83% and 81.7% correct answers, which is better than the average student result. They were followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%). Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1-86.7%, and the lowest results were for Brain stem, Special senses, and Cerebellum, with 54.4-57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.<b>NEW & NOTEWORTHY</b> This research evaluates the effectiveness of different AI-driven large language models (Claude, ChatGPT, Copilot, and Gemini) compared to medical students in answering neuroscience questions. The study offers insights into the specific areas of neuroscience in which these chatbots may excel or have limitations, providing a comprehensive analysis of chatbots' current capabilities in processing and interacting with certain topics of the basic medical sciences curriculum.</p>","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":" ","pages":"430-437"},"PeriodicalIF":1.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00093.2024","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/17 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5 and GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on multiple-choice questions (MCQs) from the medical neuroscience course database to evaluate chatbot reliability. Five successive attempts of each chatbot to answer 200 United States Medical Licensing Examination (USMLE)-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that, at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots, with 83% and 81.7% correct answers, which is better than the average student result. They were followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%). Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1-86.7%, and the lowest results were for Brain stem, Special senses, and Cerebellum, with 54.4-57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.NEW & NOTEWORTHY This research evaluates the effectiveness of different AI-driven large language models (Claude, ChatGPT, Copilot, and Gemini) compared to medical students in answering neuroscience questions. The study offers insights into the specific areas of neuroscience in which these chatbots may excel or have limitations, providing a comprehensive analysis of chatbots' current capabilities in processing and interacting with certain topics of the basic medical sciences curriculum.
期刊介绍:
Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.