Alexa, write my exam: ChatGPT for MCQ creation

IF 4.9 1区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Education Pub Date : 2024-08-30 DOI:10.1111/medu.15496

Stephen D. Schneid, Chris Armour, Sean Evans, Katharina Brandl

{"title":"Alexa, write my exam: ChatGPT for MCQ creation","authors":"Stephen D. Schneid, Chris Armour, Sean Evans, Katharina Brandl","doi":"10.1111/medu.15496","DOIUrl":null,"url":null,"abstract":"Writing high-quality exam questions requires substantial faculty development and, more importantly, diverts time from other significant educational responsibilities. Recent research has demonstrated the efficiency of ChatGPT in generating multiple-choice questions (MCQs) and its ability to pass all three United States Medical Licensing Exams.1 Given the potential of new artificial intelligence systems like ChatGPT, this study aims to explore their use in streamlining item writing without compromising the desirable psychometric properties of assessments.ChatGPT 3.5 was prompted to ‘write 25 MCQs with clinical vignette in UMSLE Step 1 style on the pharmacology of antibiotics, antivirals and antiparasitic drugs addressing their indications, mechanism of action, adverse effects and contraindications’. Faculty reviewed all questions for accuracy and made minor modifications. For questions that did not align with the courses' learning objectives, ChatGPT was prompted to generate alternatives, such as ‘another question on the Pharmacology of HIV drugs’. Additionally, 25 MCQs were created without the help of ChatGPT. ChatGPT assisted question writing took approximately 1 hour (with adjustments and corrections) compared to 9 hours without the help of ChatGPT.Seventy-one second year Pharmacy students were assessed in Spring 2023 with a 50-item exam consisting of 25 ChatGPT-constructed and 25 faculty-generated MCQs. We compared the difficulty and psychometric characteristics of the ChatGPT-assisted and non-assisted questions using descriptive statistics, student's t-tests and Mann–Whitney test.Students' performance on MCQs generated by ChatGPT was not significantly different to that on faculty-generated items for the average scores (76.44%, SD = 16.71 for ChatGPT vs. 82.52 %, SD = 10.90 for faculty), discrimination index (0.29, SD = 0.15 for ChatGPT vs. 0.25, SD = 0.17 for faculty), and the point-biserial correlation (0.31, SD = 0.13 for ChatGPT vs. 0.28, SD = 0.15 for faculty). Students took longer on average to answer ChatGPT-generated questions compared to faculty-generated questions (71 seconds, SD = 22 for ChatGPT vs. 58 seconds, SD = 25 for faculty, p < 0.05), likely due to the prevalence of ‘window dressing’. This question flaw was identified in 40% of the ChatGPT-generated questions, which may explain the additional time required.We learned that while ChatGPT can effectively generate high-quality MCQs, saving time in the process, careful review by content experts is necessary to ensure the quality of the questions, particularly to identify and correct ‘window dressing’ flaws commonly found in ChatGPT-generated items.We will present this data at upcoming faculty development sessions to promote the adoption of ChatGPT for generating exam questions. By presenting robust data demonstrating ChatGPT's efficacy, we believe that more faculty will integrate this tool into their question writing processes. Faculty will also be alerted to potential questions flaws and prepared to address them.Additionally, recognising that students often desire more practice questions, we discovered that they are generally unfamiliar with this method. We plan to empower students to use ChatGPT to assist with their studies, while concurrently training faculty to become more adept at using ChatGPT to generate both practice and test items.The authors declare that they have no conflict of interest.","PeriodicalId":18370,"journal":{"name":"Medical Education","volume":"58 11","pages":"1373-1374"},"PeriodicalIF":4.9000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/medu.15496","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/medu.15496","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Writing high-quality exam questions requires substantial faculty development and, more importantly, diverts time from other significant educational responsibilities. Recent research has demonstrated the efficiency of ChatGPT in generating multiple-choice questions (MCQs) and its ability to pass all three United States Medical Licensing Exams.¹ Given the potential of new artificial intelligence systems like ChatGPT, this study aims to explore their use in streamlining item writing without compromising the desirable psychometric properties of assessments.

ChatGPT 3.5 was prompted to ‘write 25 MCQs with clinical vignette in UMSLE Step 1 style on the pharmacology of antibiotics, antivirals and antiparasitic drugs addressing their indications, mechanism of action, adverse effects and contraindications’. Faculty reviewed all questions for accuracy and made minor modifications. For questions that did not align with the courses' learning objectives, ChatGPT was prompted to generate alternatives, such as ‘another question on the Pharmacology of HIV drugs’. Additionally, 25 MCQs were created without the help of ChatGPT. ChatGPT assisted question writing took approximately 1 hour (with adjustments and corrections) compared to 9 hours without the help of ChatGPT.

Seventy-one second year Pharmacy students were assessed in Spring 2023 with a 50-item exam consisting of 25 ChatGPT-constructed and 25 faculty-generated MCQs. We compared the difficulty and psychometric characteristics of the ChatGPT-assisted and non-assisted questions using descriptive statistics, student's t-tests and Mann–Whitney test.

Students' performance on MCQs generated by ChatGPT was not significantly different to that on faculty-generated items for the average scores (76.44%, SD = 16.71 for ChatGPT vs. 82.52 %, SD = 10.90 for faculty), discrimination index (0.29, SD = 0.15 for ChatGPT vs. 0.25, SD = 0.17 for faculty), and the point-biserial correlation (0.31, SD = 0.13 for ChatGPT vs. 0.28, SD = 0.15 for faculty). Students took longer on average to answer ChatGPT-generated questions compared to faculty-generated questions (71 seconds, SD = 22 for ChatGPT vs. 58 seconds, SD = 25 for faculty, p < 0.05), likely due to the prevalence of ‘window dressing’. This question flaw was identified in 40% of the ChatGPT-generated questions, which may explain the additional time required.

We learned that while ChatGPT can effectively generate high-quality MCQs, saving time in the process, careful review by content experts is necessary to ensure the quality of the questions, particularly to identify and correct ‘window dressing’ flaws commonly found in ChatGPT-generated items.

We will present this data at upcoming faculty development sessions to promote the adoption of ChatGPT for generating exam questions. By presenting robust data demonstrating ChatGPT's efficacy, we believe that more faculty will integrate this tool into their question writing processes. Faculty will also be alerted to potential questions flaws and prepared to address them.

Additionally, recognising that students often desire more practice questions, we discovered that they are generally unfamiliar with this method. We plan to empower students to use ChatGPT to assist with their studies, while concurrently training faculty to become more adept at using ChatGPT to generate both practice and test items.

The authors declare that they have no conflict of interest.

查看原文本刊更多论文

Alexa, write my exam：创建 MCQ 的 ChatGPT

编写高质量的试题需要大量的师资力量，更重要的是，会占用其他重要教育职责的时间。最近的研究表明，ChatGPT 生成选择题（MCQ）的效率很高，而且能够通过美国的所有三次医学执业资格考试1。鉴于 ChatGPT 等新型人工智能系统的潜力，本研究旨在探索其在简化题目撰写过程中的应用，同时不影响评估的理想心理测量特性。ChatGPT 3.5 被提示 "以 UMSLE Step 1 的风格撰写 25 道带有临床小插图的 MCQ，内容涉及抗生素、抗病毒药物和抗寄生虫药物的药理学，包括其适应症、作用机制、不良反应和禁忌症"。教员审查了所有问题的准确性，并稍作修改。对于不符合课程学习目标的问题，ChatGPT 会提示生成替代问题，如 "另一个关于 HIV 药物药理学的问题"。此外，还有 25 个 MCQ 是在没有 ChatGPT 帮助的情况下生成的。2023 年春季，71 名药剂学二年级学生参加了由 50 个项目组成的考试，其中包括 25 个由 ChatGPT 构建的 MCQ 和 25 个由教师生成的 MCQ。我们使用描述性统计、学生 t 检验和 Mann-Whitney 检验比较了 ChatGPT 辅助试题和非辅助试题的难度和心理测量特征。在平均得分（ChatGPT为76.44%，SD=16.71；教师为82.52%，SD=10.90）、辨别指数（ChatGPT为0.29，SD=0.15；教师为0.25，SD=0.17）和点-阶梯相关（ChatGPT为0.31，SD=0.13；教师为0.28，SD=0.15）方面，学生的表现与教师生成的题目没有明显差异。与教师提出的问题相比，学生回答 ChatGPT 生成的问题平均需要更长的时间（ChatGPT 为 71 秒，标准差 = 22；教师为 58 秒，标准差 = 25，p <0.05），这可能是由于 "窗口修饰 "的普遍存在。我们了解到，虽然 ChatGPT 可以有效生成高质量的 MCQs，并在此过程中节省时间，但内容专家的仔细审核对于确保试题质量是必要的，尤其是要识别并纠正 ChatGPT 生成的题目中常见的 "橱窗修饰 "缺陷。通过展示证明 ChatGPT 有效性的有力数据，我们相信会有更多的教师将这一工具整合到他们的试题编写过程中。此外，我们认识到学生通常希望获得更多的练习题，但我们发现他们普遍不熟悉这种方法。我们计划授权学生使用 ChatGPT 来帮助他们学习，同时培训教师，使他们更善于使用 ChatGPT 来生成练习和测试题目。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical Education 医学-卫生保健

CiteScore

8.40

自引率

10.00%

发文量

279

审稿时长

4-8 weeks

期刊介绍： Medical Education seeks to be the pre-eminent journal in the field of education for health care professionals, and publishes material of the highest quality, reflecting world wide or provocative issues and perspectives. The journal welcomes high quality papers on all aspects of health professional education including; -undergraduate education -postgraduate training -continuing professional development -interprofessional education