Comparing generative artificial intelligence platforms and nursing student performance on a women's health nursing examination in Korea: a Rasch model approach.

IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES
Eun Jeong Ko, Tae Kyung Lee, Geum Hee Jeong
{"title":"Comparing generative artificial intelligence platforms and nursing student performance on a women's health nursing examination in Korea: a Rasch model approach.","authors":"Eun Jeong Ko, Tae Kyung Lee, Geum Hee Jeong","doi":"10.3352/jeehp.2025.22.23","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women's health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.</p><p><strong>Methods: </strong>The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT Free, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.</p><p><strong>Results: </strong>The items satisfied unidimensionality (DETECT=-0.16). Item difficulty parameter estimates ranged from -3.87 to 1.96 logits (mean=-0.61), with a mean difficulty index of 0.79. Examinees' ability parameter estimates ranged from -0.71 to 3.14 logits (mean=1.17). GPT-4o, ChatGPT Free, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, -0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.</p><p><strong>Conclusion: </strong>Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"23"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/5 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women's health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.

Methods: The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT Free, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.

Results: The items satisfied unidimensionality (DETECT=-0.16). Item difficulty parameter estimates ranged from -3.87 to 1.96 logits (mean=-0.61), with a mean difficulty index of 0.79. Examinees' ability parameter estimates ranged from -0.71 to 3.14 logits (mean=1.17). GPT-4o, ChatGPT Free, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, -0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.

Conclusion: Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.

比较生成人工智能平台和护理学生在韩国女性健康护理考试中的表现:Rasch模型方法。
目的:本心理测量学研究旨在比较生成式人工智能(AI)平台与韩国翰林大学护理专业学生在50项女性健康护理考试中的能力参数估计,使用Rasch模型。它还试图估算道具难度参数并评估AI在不同难度级别中的表现。方法:于2023年6月对111名护理专业四年级学生进行问卷调查,包括39项选择题和11项真假题。2024年12月,6个生成式AI平台(gpt - 40、ChatGPT Free、Claude。ai, Clova X, Mistral。ai, b谷歌Gemini)完成了相同的项目。使用Rasch模型对响应进行分析,以估计能力和难度参数。通过维数评估来枚举贡献性状(DETECT)验证单维性,并使用R包irtQ和TAM进行分析。结果:项目满足单维性(DETECT=-0.16)。道具难度参数估计范围从-3.87到1.96 logits(平均=-0.61),平均难度指数为0.79。考生的能力参数估计值范围为-0.71 ~ 3.14 logits(平均=1.17)。gpt - 40、ChatGPT Free和Claude。ai的表现优于学生能力中位数(1.09 logits),得分分别为2.68、2.34和2.34,而Clova X、Mistral。双子座的得分较低(0.20,-0.12,0.80)。测试信息曲线在θ=0以下达到峰值,表明适合低到中等水平的考生。结论:先进的生成式人工智能平台近似于优秀学生的表现,但结果有所不同。Rasch模型有效地评估了人工智能的能力,支持其在护理教育中未来人工智能绩效评估的潜在效用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
9.60
自引率
9.10%
发文量
32
审稿时长
5 weeks
期刊介绍: Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信