人工智能是医学教育评估的未来吗?人工智能与人工评价在客观结构化临床检查中的比较。

IF 2.7 2区 医学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Murat Tekin, Mustafa Onur Yurdal, Çetin Toraman, Güneş Korkmaz, İbrahim Uysal
{"title":"人工智能是医学教育评估的未来吗?人工智能与人工评价在客观结构化临床检查中的比较。","authors":"Murat Tekin, Mustafa Onur Yurdal, Çetin Toraman, Güneş Korkmaz, İbrahim Uysal","doi":"10.1186/s12909-025-07241-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Objective Structured Clinical Examinations (OSCEs) are widely used in medical education to assess students' clinical and professional skills. Recent advancements in artificial intelligence (AI) offer opportunities to complement human evaluations. This study aims to explore the consistency between human and AI evaluators in assessing medical students' clinical skills during OSCE.</p><p><strong>Methods: </strong>This cross-sectional study was conducted at a state university in Turkey, focusing on pre-clinical medical students (Years 1, 2, and 3). Four clinical skills-intramuscular injection, square knot tying, basic life support, and urinary catheterization-were evaluated during OSCE at the end of the 2023-2024 academic year. Video recordings of the students' performances were assessed by five evaluators: a real-time human assessor, two video-based expert human assessors, and two AI-based systems (ChatGPT-4o and Gemini Flash 1.5). The evaluations were based on standardized checklists validated by the university. Data were collected from 196 students, with sample sizes ranging from 43 to 58 for each skill. Consistency among evaluators was analyzed using statistical methods.</p><p><strong>Results: </strong>AI models consistently assigned higher scores than human evaluators across all skills. For intramuscular injection, the mean total score given by AI was 28.23, while human evaluators averaged 25.25. For knot tying, AI scores averaged 16.07 versus 10.44 for humans. In basic life support, AI scores were 17.05 versus 16.48 for humans. For urinary catheterization, mean scores were similar (AI: 26.68; humans: 27.02), but showed considerable variance in individual criteria. Inter-rater consistency was higher for visually observable steps, while auditory tasks led to greater discrepancies between AI and human evaluators.</p><p><strong>Conclusions: </strong>AI shows promise as a supplemental tool for OSCE evaluation, especially for visually based clinical skills. However, its reliability varies depending on the perceptual demands of the skill being assessed. The higher and more uniform scores given by AI suggest potential for standardization, yet refinement is needed for accurate assessment of skills requiring verbal communication or auditory cues.</p>","PeriodicalId":51234,"journal":{"name":"BMC Medical Education","volume":"25 1","pages":"641"},"PeriodicalIF":2.7000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12046780/pdf/","citationCount":"0","resultStr":"{\"title\":\"Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination.\",\"authors\":\"Murat Tekin, Mustafa Onur Yurdal, Çetin Toraman, Güneş Korkmaz, İbrahim Uysal\",\"doi\":\"10.1186/s12909-025-07241-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Objective Structured Clinical Examinations (OSCEs) are widely used in medical education to assess students' clinical and professional skills. Recent advancements in artificial intelligence (AI) offer opportunities to complement human evaluations. This study aims to explore the consistency between human and AI evaluators in assessing medical students' clinical skills during OSCE.</p><p><strong>Methods: </strong>This cross-sectional study was conducted at a state university in Turkey, focusing on pre-clinical medical students (Years 1, 2, and 3). Four clinical skills-intramuscular injection, square knot tying, basic life support, and urinary catheterization-were evaluated during OSCE at the end of the 2023-2024 academic year. Video recordings of the students' performances were assessed by five evaluators: a real-time human assessor, two video-based expert human assessors, and two AI-based systems (ChatGPT-4o and Gemini Flash 1.5). The evaluations were based on standardized checklists validated by the university. Data were collected from 196 students, with sample sizes ranging from 43 to 58 for each skill. Consistency among evaluators was analyzed using statistical methods.</p><p><strong>Results: </strong>AI models consistently assigned higher scores than human evaluators across all skills. For intramuscular injection, the mean total score given by AI was 28.23, while human evaluators averaged 25.25. For knot tying, AI scores averaged 16.07 versus 10.44 for humans. In basic life support, AI scores were 17.05 versus 16.48 for humans. For urinary catheterization, mean scores were similar (AI: 26.68; humans: 27.02), but showed considerable variance in individual criteria. Inter-rater consistency was higher for visually observable steps, while auditory tasks led to greater discrepancies between AI and human evaluators.</p><p><strong>Conclusions: </strong>AI shows promise as a supplemental tool for OSCE evaluation, especially for visually based clinical skills. However, its reliability varies depending on the perceptual demands of the skill being assessed. The higher and more uniform scores given by AI suggest potential for standardization, yet refinement is needed for accurate assessment of skills requiring verbal communication or auditory cues.</p>\",\"PeriodicalId\":51234,\"journal\":{\"name\":\"BMC Medical Education\",\"volume\":\"25 1\",\"pages\":\"641\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12046780/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Education\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12909-025-07241-4\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Education","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12909-025-07241-4","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

摘要

背景:目的结构化临床考试被广泛应用于医学教育中,以评估学生的临床和专业技能。人工智能(AI)的最新进展为补充人类评估提供了机会。本研究旨在探讨人类和人工智能评估者在评估欧安组织期间医学生临床技能方面的一致性。方法:这项横断面研究在土耳其的一所州立大学进行,重点是临床前医学院学生(1、2和3年级)。在2023-2024学年结束时,欧安组织对四项临床技能——肌肉注射、方结打结、基本生命支持和导尿进行了评估。学生表演的视频记录由五名评估人员进行评估:一名实时人工评估人员,两名基于视频的专家人工评估人员和两个基于人工智能的系统(chatgpt - 40和Gemini Flash 1.5)。评估是基于经过大学验证的标准化清单。数据收集自196名学生,每种技能的样本量从43到58不等。采用统计学方法分析评价者的一致性。结果:人工智能模型在所有技能上的得分始终高于人类评估者。对于肌肉注射,人工智能给出的平均总分为28.23分,而人类评估者给出的平均总分为25.25分。在打结方面,人工智能平均得分为16.07分,而人类平均得分为10.44分。在基本生命支持方面,人工智能得分为17.05,而人类得分为16.48。导尿的平均得分相似(AI: 26.68;人类:27.02),但在个体标准上表现出相当大的差异。对于视觉上可观察到的步骤,评估者之间的一致性更高,而听觉任务导致人工智能和人类评估者之间的差异更大。结论:人工智能有望成为欧安组织评估的补充工具,特别是用于基于视觉的临床技能。然而,它的可靠性取决于被评估技能的感知需求。人工智能给出的更高和更统一的分数表明了标准化的潜力,但需要改进才能准确评估需要口头沟通或听觉提示的技能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination.

Background: Objective Structured Clinical Examinations (OSCEs) are widely used in medical education to assess students' clinical and professional skills. Recent advancements in artificial intelligence (AI) offer opportunities to complement human evaluations. This study aims to explore the consistency between human and AI evaluators in assessing medical students' clinical skills during OSCE.

Methods: This cross-sectional study was conducted at a state university in Turkey, focusing on pre-clinical medical students (Years 1, 2, and 3). Four clinical skills-intramuscular injection, square knot tying, basic life support, and urinary catheterization-were evaluated during OSCE at the end of the 2023-2024 academic year. Video recordings of the students' performances were assessed by five evaluators: a real-time human assessor, two video-based expert human assessors, and two AI-based systems (ChatGPT-4o and Gemini Flash 1.5). The evaluations were based on standardized checklists validated by the university. Data were collected from 196 students, with sample sizes ranging from 43 to 58 for each skill. Consistency among evaluators was analyzed using statistical methods.

Results: AI models consistently assigned higher scores than human evaluators across all skills. For intramuscular injection, the mean total score given by AI was 28.23, while human evaluators averaged 25.25. For knot tying, AI scores averaged 16.07 versus 10.44 for humans. In basic life support, AI scores were 17.05 versus 16.48 for humans. For urinary catheterization, mean scores were similar (AI: 26.68; humans: 27.02), but showed considerable variance in individual criteria. Inter-rater consistency was higher for visually observable steps, while auditory tasks led to greater discrepancies between AI and human evaluators.

Conclusions: AI shows promise as a supplemental tool for OSCE evaluation, especially for visually based clinical skills. However, its reliability varies depending on the perceptual demands of the skill being assessed. The higher and more uniform scores given by AI suggest potential for standardization, yet refinement is needed for accurate assessment of skills requiring verbal communication or auditory cues.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Medical Education
BMC Medical Education EDUCATION, SCIENTIFIC DISCIPLINES-
CiteScore
4.90
自引率
11.10%
发文量
795
审稿时长
6 months
期刊介绍: BMC Medical Education is an open access journal publishing original peer-reviewed research articles in relation to the training of healthcare professionals, including undergraduate, postgraduate, and continuing education. The journal has a special focus on curriculum development, evaluations of performance, assessment of training needs and evidence-based medicine.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信