Volodymyr Mavrych, Einas M Yousef, Ahmed Yaqinuddin, Olena Bolgova
{"title":"医学教育中的大型语言模型:回答组织学问题的比较跨平台评估。","authors":"Volodymyr Mavrych, Einas M Yousef, Ahmed Yaqinuddin, Olena Bolgova","doi":"10.1080/10872981.2025.2534065","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (<i>p</i> > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.<b>Clinical trial number</b>: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.</p>","PeriodicalId":47656,"journal":{"name":"Medical Education Online","volume":"30 1","pages":"2534065"},"PeriodicalIF":3.8000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12258195/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.\",\"authors\":\"Volodymyr Mavrych, Einas M Yousef, Ahmed Yaqinuddin, Olena Bolgova\",\"doi\":\"10.1080/10872981.2025.2534065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (<i>p</i> > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.<b>Clinical trial number</b>: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.</p>\",\"PeriodicalId\":47656,\"journal\":{\"name\":\"Medical Education Online\",\"volume\":\"30 1\",\"pages\":\"2534065\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12258195/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education Online\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/10872981.2025.2534065\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education Online","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/10872981.2025.2534065","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.
Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (p > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.Clinical trial number: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.
期刊介绍:
Medical Education Online is an open access journal of health care education, publishing peer-reviewed research, perspectives, reviews, and early documentation of new ideas and trends.
Medical Education Online aims to disseminate information on the education and training of physicians and other health care professionals. Manuscripts may address any aspect of health care education and training, including, but not limited to:
-Basic science education
-Clinical science education
-Residency education
-Learning theory
-Problem-based learning (PBL)
-Curriculum development
-Research design and statistics
-Measurement and evaluation
-Faculty development
-Informatics/web