医学教育中的大型语言模型:回答组织学问题的比较跨平台评估。

IF 3.8 2区 医学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Medical Education Online Pub Date : 2025-12-01 Epub Date: 2025-07-12 DOI:10.1080/10872981.2025.2534065
Volodymyr Mavrych, Einas M Yousef, Ahmed Yaqinuddin, Olena Bolgova
{"title":"医学教育中的大型语言模型:回答组织学问题的比较跨平台评估。","authors":"Volodymyr Mavrych, Einas M Yousef, Ahmed Yaqinuddin, Olena Bolgova","doi":"10.1080/10872981.2025.2534065","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (<i>p</i> > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.<b>Clinical trial number</b>: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.</p>","PeriodicalId":47656,"journal":{"name":"Medical Education Online","volume":"30 1","pages":"2534065"},"PeriodicalIF":3.8000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12258195/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.\",\"authors\":\"Volodymyr Mavrych, Einas M Yousef, Ahmed Yaqinuddin, Olena Bolgova\",\"doi\":\"10.1080/10872981.2025.2534065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (<i>p</i> > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.<b>Clinical trial number</b>: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.</p>\",\"PeriodicalId\":47656,\"journal\":{\"name\":\"Medical Education Online\",\"volume\":\"30 1\",\"pages\":\"2534065\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12258195/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education Online\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/10872981.2025.2534065\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education Online","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/10872981.2025.2534065","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(llm)在医学学科中表现出了很好的能力,但它们在基础医学科学中的表现仍然不完整。医学组织学需要事实知识和解释技能,为评估医学教育中的人工智能能力提供了一个独特的领域。评估和比较当前五种llm: GPT-4.1、Claude 3.7 Sonnet、Gemini 2.0 Flash、Copilot和DeepSeek R1在正确回答医学组织学选择题(mcq)方面的表现。这项横断面比较研究使用了200个usmle风格的组织学mcq,涉及20个主题。每个法学硕士分三次完成所有问题。性能指标包括准确率、测试重测可靠性(ICC)和特定主题分析。统计分析采用方差分析与事后Tukey检验和双向混合方差分析系统-主题相互作用。所有llm的准确率都非常高(平均91.1%,SD 7.2)。Gemini表现最好(92.0%),其次是Claude(91.5%)、Copilot(91.0%)、GPT-4(90.8%)和DeepSeek(90.3%),系统之间没有显著差异(p > 0.05)。Claude的信度最高(ICC = 0.931), GPT-4次之(ICC = 0.882)。组织学方法、血液与造血系统和循环系统的准确性和重复性均为100%,而肌肉组织(76.0%)和淋巴系统(84.7%)的准确性和重复性最高。法学硕士在回答组织学mcq方面表现出卓越的准确性和可靠性,明显优于其他医学学科。最小的系统间可变性表明技术成熟,尽管特定主题的挑战和可靠性问题表明对人类专门知识的持续需求。这些发现反映了人工智能的快速发展,并确定组织学特别适合人工智能辅助医学教育。临床试验号:临床试验号与本研究无关,因为它不涉及药物或治疗干预。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.

Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (p > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.Clinical trial number: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Medical Education Online
Medical Education Online EDUCATION & EDUCATIONAL RESEARCH-
CiteScore
6.00
自引率
2.20%
发文量
97
审稿时长
8 weeks
期刊介绍: Medical Education Online is an open access journal of health care education, publishing peer-reviewed research, perspectives, reviews, and early documentation of new ideas and trends. Medical Education Online aims to disseminate information on the education and training of physicians and other health care professionals. Manuscripts may address any aspect of health care education and training, including, but not limited to: -Basic science education -Clinical science education -Residency education -Learning theory -Problem-based learning (PBL) -Curriculum development -Research design and statistics -Measurement and evaluation -Faculty development -Informatics/web
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信