Comparing generative artificial intelligence platforms and nursing student performance on a women's health nursing examination in Korea: a Rasch model approach.
{"title":"Comparing generative artificial intelligence platforms and nursing student performance on a women's health nursing examination in Korea: a Rasch model approach.","authors":"Eun Jeong Ko, Tae Kyung Lee, Geum Hee Jeong","doi":"10.3352/jeehp.2025.22.23","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women's health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.</p><p><strong>Methods: </strong>The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT Free, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.</p><p><strong>Results: </strong>The items satisfied unidimensionality (DETECT=-0.16). Item difficulty parameter estimates ranged from -3.87 to 1.96 logits (mean=-0.61), with a mean difficulty index of 0.79. Examinees' ability parameter estimates ranged from -0.71 to 3.14 logits (mean=1.17). GPT-4o, ChatGPT Free, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, -0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.</p><p><strong>Conclusion: </strong>Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"23"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/5 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women's health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.
Methods: The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT Free, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.
Results: The items satisfied unidimensionality (DETECT=-0.16). Item difficulty parameter estimates ranged from -3.87 to 1.96 logits (mean=-0.61), with a mean difficulty index of 0.79. Examinees' ability parameter estimates ranged from -0.71 to 3.14 logits (mean=1.17). GPT-4o, ChatGPT Free, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, -0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.
Conclusion: Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.
期刊介绍:
Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.