Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES
Olena Bolgova, Inna Shypilova, Volodymyr Mavrych
{"title":"Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.","authors":"Olena Bolgova, Inna Shypilova, Volodymyr Mavrych","doi":"10.2196/67244","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.</p><p><strong>Objective: </strong>The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots-Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)-against the academic results of medical students in the medical biochemistry course.</p><p><strong>Methods: </strong>We used 200 USMLE (United States Medical Licensing Examination)-style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data's basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05.</p><p><strong>Results: </strong>On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students' performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04).</p><p><strong>Conclusions: </strong>Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.</p>","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"11 ","pages":"e67244"},"PeriodicalIF":3.2000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12005600/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/67244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.

Objective: The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots-Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)-against the academic results of medical students in the medical biochemistry course.

Methods: We used 200 USMLE (United States Medical Licensing Examination)-style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data's basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05.

Results: On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students' performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04).

Conclusions: Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.

Abstract Image

Abstract Image

生物化学教学中的大型语言模型:绩效比较评价。
背景:人工智能(AI)的最新进展,特别是在大型语言模型(llm)方面,已经开启了各个领域创新的新时代,医学处于这场技术革命的最前沿。许多研究表明,在目前的发展水平下,法学硕士可以通过不同的董事会考试。然而,回答特定主题相关问题的能力需要验证。目的:本研究的目的是对高级LLM聊天机器人claude (Anthropic)、GPT-4 (OpenAI)、Gemini(谷歌)和Copilot (Microsoft)的表现与医学生在医学生物化学课程中的学习成绩进行综合分析比较。方法:我们从课程考试数据库中选择200道USMLE(美国医师执照考试)式选择题(mcq)。它们包含不同的复杂程度,分布在23个不同的主题中。带有表格和图片的问题不包括在研究中。在2024年8月,对Claude 3.5 Sonnet、GPT-4-1106、Gemini 1.5 Flash和Copilot连续5次尝试回答该问卷的结果进行准确性评估。采用Statistica 13.5.0.17 (TIBCO Software Inc .)软件对数据进行基本统计分析。考虑到数据的二元性,采用卡方检验对不同聊天机器人之间的结果进行比较,结果具有统计学显著性水平:所选聊天机器人平均答对了81.1% (SD 12.8%)的问题,比学生的表现高出8.3% (P=.02)。在本研究中,Claude在生物化学mcq中表现最好,正确率为92.5%(185/200),其次是GPT-4(170/200, 85%)、Gemini(157/200, 78.5%)和Copilot(128/200, 64%)。聊天机器人在以下4个主题上表现出最好的结果:二十烷类(平均100%,SD 0%)、生物能量学和电子传递链(平均96.4%,SD 7.2%)、单磷酸己糖途径(平均91.7%,SD 16.7%)和酮体(平均93.8%,SD 12.5%)。皮尔逊卡方检验显示,所有4个聊天机器人的答案之间存在统计学上显著的关联(结论:我们的研究表明,不同的人工智能模型可能在特定的医学领域具有独特的优势,可以利用这些优势为生物化学课程提供有针对性的支持。这一表现突出了人工智能在医学教育和评估方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Medical Education
JMIR Medical Education Social Sciences-Education
CiteScore
6.90
自引率
5.60%
发文量
54
审稿时长
8 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信