Artificial intelligence in radiology examinations: a psychometric comparison of question generation methods.

IF 1.7 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Diagnostic and interventional radiology Pub Date : 2025-07-21 DOI:10.4274/dir.2025.253407

Emre Emekli, Betül Nalan Karahan

{"title":"Artificial intelligence in radiology examinations: a psychometric comparison of question generation methods.","authors":"Emre Emekli, Betül Nalan Karahan","doi":"10.4274/dir.2025.253407","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to evaluate the usability of artificial intelligence (AI)-based question generation methods-Chat Generative Pre-trained Transformer (ChatGPT)-4o (a non-template-based large language model) and a template-based automatic item generation (AIG) method-in the context of radiology education. The primary objective was to compare the psychometric properties, perceived quality, and educational applicability of generated multiple-choice questions (MCQs) with those written by a faculty member.Methods: Fifth-year medical students who participated in the radiology clerkship at Eskişehir Osmangazi University were invited to take a voluntary 15-question examination covering musculoskeletal and rheumatologic imaging. The examination included five MCQs from each of three sources: a radiologist educator, ChatGPT-4o, and the template-based AIG method. Student responses were evaluated in terms of difficulty and discrimination indices. Following the examination, students rated each question using a Likert scale based on clarity, difficulty, plausibility of distractors, and alignment with learning goals. Correlations between students' examination performance and their theoretical/practical radiology grades were analyzed using Pearson's correlation method.Results: A total of 115 students participated. Faculty-written questions had the highest mean correct response rate (2.91 ± 1.34), followed by template-based AIG (2.32 ± 1.66) and ChatGPT-4o (2.3 ± 1.14) questions (P < 0.001). The mean difficulty index was 0.58 for faculty, and 0.46 for both template- based AIG and ChatGPT-4o. Discrimination indices were acceptable (≥0.2) or very good (≥0.4) for template-based AIG questions. In contrast, four of the ChatGPT-generated questions were acceptable, and three were very good. Student evaluations of questions and the overall examination were favorable, particularly regarding question clarity and content alignment. Examination scores showed a weak correlation with practical examination performance (P = 0.041), but not with theoretical grades (P = 0.652).Conclusion: Both the ChatGPT-4o and template-based AIG methods produced MCQs with acceptable psychometric properties. While faculty-written questions were most effective overall, AI-generated questions- especially those from the template-based AIG method-showed strong potential for use in radiology education. However, the small number of items per method and the single-institution context limit the robustness and generalizability of the findings. These results should be regarded as exploratory, and further validation in larger, multicenter studies is required.Clinical significance: AI-based question generation may potentially support educators by enhancing efficiency and consistency in assessment item creation. These methods may complement traditional approaches to help scale up high-quality MCQ development in medical education, particularly in resource-limited settings; however, they should be applied with caution and expert oversight until further evidence is available, especially given the preliminary nature of the current findings.","PeriodicalId":11341,"journal":{"name":"Diagnostic and interventional radiology","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and interventional radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.4274/dir.2025.253407","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This study aimed to evaluate the usability of artificial intelligence (AI)-based question generation methods-Chat Generative Pre-trained Transformer (ChatGPT)-4o (a non-template-based large language model) and a template-based automatic item generation (AIG) method-in the context of radiology education. The primary objective was to compare the psychometric properties, perceived quality, and educational applicability of generated multiple-choice questions (MCQs) with those written by a faculty member.

Methods: Fifth-year medical students who participated in the radiology clerkship at Eskişehir Osmangazi University were invited to take a voluntary 15-question examination covering musculoskeletal and rheumatologic imaging. The examination included five MCQs from each of three sources: a radiologist educator, ChatGPT-4o, and the template-based AIG method. Student responses were evaluated in terms of difficulty and discrimination indices. Following the examination, students rated each question using a Likert scale based on clarity, difficulty, plausibility of distractors, and alignment with learning goals. Correlations between students' examination performance and their theoretical/practical radiology grades were analyzed using Pearson's correlation method.

Results: A total of 115 students participated. Faculty-written questions had the highest mean correct response rate (2.91 ± 1.34), followed by template-based AIG (2.32 ± 1.66) and ChatGPT-4o (2.3 ± 1.14) questions (P < 0.001). The mean difficulty index was 0.58 for faculty, and 0.46 for both template- based AIG and ChatGPT-4o. Discrimination indices were acceptable (≥0.2) or very good (≥0.4) for template-based AIG questions. In contrast, four of the ChatGPT-generated questions were acceptable, and three were very good. Student evaluations of questions and the overall examination were favorable, particularly regarding question clarity and content alignment. Examination scores showed a weak correlation with practical examination performance (P = 0.041), but not with theoretical grades (P = 0.652).

Conclusion: Both the ChatGPT-4o and template-based AIG methods produced MCQs with acceptable psychometric properties. While faculty-written questions were most effective overall, AI-generated questions- especially those from the template-based AIG method-showed strong potential for use in radiology education. However, the small number of items per method and the single-institution context limit the robustness and generalizability of the findings. These results should be regarded as exploratory, and further validation in larger, multicenter studies is required.

Clinical significance: AI-based question generation may potentially support educators by enhancing efficiency and consistency in assessment item creation. These methods may complement traditional approaches to help scale up high-quality MCQ development in medical education, particularly in resource-limited settings; however, they should be applied with caution and expert oversight until further evidence is available, especially given the preliminary nature of the current findings.

查看原文本刊更多论文

放射学检查中的人工智能：问题生成方法的心理测量比较。

目的：本研究旨在评估基于人工智能（AI）的问题生成方法——聊天生成预训练转换器(ChatGPT)- 40（一种非基于模板的大型语言模型）和基于模板的自动问题生成（AIG）方法在放射学教育中的可用性。本研究的主要目的是比较生成的多项选择题（mcq）与教师编写的多项选择题的心理测量特性、感知质量和教育适用性。方法：邀请参加eski奥斯曼加齐大学放射学实习的五年级医学生自愿参加15题检查，包括肌肉骨骼和风湿病影像学检查。该考试包括五个mcq，分别来自三个来源：放射科医生、教育工作者、chatgpt - 40和基于模板的AIG方法。对学生的回答进行了难度指数和区别指数的评估。考试结束后，学生使用李克特量表根据清晰度、难度、干扰因素的合理性以及与学习目标的一致性对每个问题进行评分。采用Pearson相关法分析学生考试成绩与放射学理论/实践成绩之间的相关性。结果：共有115名学生参与。教师写作题的平均正确率最高（2.91±1.34），其次是基于模板的AIG（2.32±1.66）和chatgpt - 40（2.3±1.14）题（P < 0.001）。教师的平均难度指数为0.58，基于模板的AIG和chatgpt - 40的平均难度指数为0.46。基于模板的AIG问题的歧视指数为可接受（≥0.2）或非常好（≥0.4）。相比之下，chatgpt生成的问题中有四个是可以接受的，三个非常好。学生对问题和整体考试的评价是有利的，特别是在问题的清晰度和内容一致性方面。考试成绩与实际考试成绩呈弱相关（P = 0.041），与理论成绩无显著相关性（P = 0.652）。结论：chatgpt - 40和基于模板的AIG方法产生的mcq具有可接受的心理测量特性。虽然教师编写的问题总体上是最有效的，但人工智能生成的问题——尤其是那些基于模板的AIG方法的问题——在放射学教育中显示出强大的应用潜力。然而，每个方法的项目数量少，而且是单一机构的背景限制了研究结果的稳健性和普遍性。这些结果应该被认为是探索性的，需要在更大的、多中心的研究中进一步验证。临床意义：基于人工智能的问题生成可以通过提高评估项目创建的效率和一致性来潜在地支持教育工作者。这些方法可以补充传统方法，帮助在医学教育中扩大高质量MCQ发展，特别是在资源有限的情况下；但是，在获得进一步证据之前，特别是考虑到目前调查结果的初步性质，应谨慎使用这些措施并由专家监督。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnostic and interventional radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

期刊介绍： Diagnostic and Interventional Radiology (Diagn Interv Radiol) is the open access, online-only official publication of Turkish Society of Radiology. It is published bimonthly and the journal’s publication language is English. The journal is a medium for original articles, reviews, pictorial essays, technical notes related to all fields of diagnostic and interventional radiology.