检查人工智能在评估中的作用:牙科考试中ChatGPT和教育者生成的多项选择题的比较研究。

IF 1.9 4区 教育学 Q3 DENTISTRY, ORAL SURGERY & MEDICINE
Nezaket Ezgi Özer, Yusuf Balcı, Gaye Bölükbaşı, Betul İlhan, Pelin Güneri
{"title":"检查人工智能在评估中的作用:牙科考试中ChatGPT和教育者生成的多项选择题的比较研究。","authors":"Nezaket Ezgi Özer, Yusuf Balcı, Gaye Bölükbaşı, Betul İlhan, Pelin Güneri","doi":"10.1111/eje.70034","DOIUrl":null,"url":null,"abstract":"<p><strong>Aim: </strong>To compare the item difficulty and discriminative index of multiple-choice questions (MCQs) generated by ChatGPT with those created by dental educators, based on the performance of dental students in a real exam setting.</p><p><strong>Materials and methods: </strong>A total of 40 MCQs-20 generated by ChatGPT 4.0 and 20 by dental educators-were developed based on the Oral Diagnosis and Radiology course content. An independent, blinded panel of three educators assessed all MCQs for accuracy, relevance and clarity. Fifth-year dental students participated in an onsite and online exam featuring these questions. Item difficulty and discriminative indices were calculated using classical test theory and point-biserial correlation. Statistical analysis was conducted with the Shapiro-Wilk test, paired sample t-test and independent t-test, with significance set at p < 0.05.</p><p><strong>Results: </strong>Educators created 20 valid MCQs in 2.5 h, with minor revisions needed for three questions. ChatGPT generated 36 MCQs in 30 min; 20 were accepted, while 44% were excluded due to poor distractors, repetition, bias, or factual errors. Eighty fifth-year dental students completed the exam. The mean difficulty index was 0.41 ± 0.19 for educator-generated questions and 0.42 ± 0.15 for ChatGPT-generated questions, with no statistically significant difference (p = 0.773). Similarly, the mean discriminative index was 0.30 ± 0.16 for educator-generated questions and 0.32 ± 0.16 for ChatGPT-generated questions, also showing no significant difference (p = 0.578). Notably, 60% (n = 12) of ChatGPT-generated and 50% (n = 10) of educator-generated questions met the criteria for 'good quality', demonstrating balanced difficulty and strong discriminative performance.</p><p><strong>Conclusion: </strong>ChatGPT-generated MCQs performed comparably to educator-created questions in terms of difficulty and discriminative power, highlighting their potential to support assessment design. However, it is important to note that a substantial portion of the initial ChatGPT-generated MCQs were excluded by the independent panel due to issues related to clarity, accuracy, or distractor quality. To avoid overreliance, particularly among faculty who may lack experience in question development or awareness of AI limitations, expert review is essential before use. Future studies should investigate AI's ability to generate complex question formats and its long-term impact on learning.</p>","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator-Generated Multiple-Choice Questions in a Dental Exam.\",\"authors\":\"Nezaket Ezgi Özer, Yusuf Balcı, Gaye Bölükbaşı, Betul İlhan, Pelin Güneri\",\"doi\":\"10.1111/eje.70034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Aim: </strong>To compare the item difficulty and discriminative index of multiple-choice questions (MCQs) generated by ChatGPT with those created by dental educators, based on the performance of dental students in a real exam setting.</p><p><strong>Materials and methods: </strong>A total of 40 MCQs-20 generated by ChatGPT 4.0 and 20 by dental educators-were developed based on the Oral Diagnosis and Radiology course content. An independent, blinded panel of three educators assessed all MCQs for accuracy, relevance and clarity. Fifth-year dental students participated in an onsite and online exam featuring these questions. Item difficulty and discriminative indices were calculated using classical test theory and point-biserial correlation. Statistical analysis was conducted with the Shapiro-Wilk test, paired sample t-test and independent t-test, with significance set at p < 0.05.</p><p><strong>Results: </strong>Educators created 20 valid MCQs in 2.5 h, with minor revisions needed for three questions. ChatGPT generated 36 MCQs in 30 min; 20 were accepted, while 44% were excluded due to poor distractors, repetition, bias, or factual errors. Eighty fifth-year dental students completed the exam. The mean difficulty index was 0.41 ± 0.19 for educator-generated questions and 0.42 ± 0.15 for ChatGPT-generated questions, with no statistically significant difference (p = 0.773). Similarly, the mean discriminative index was 0.30 ± 0.16 for educator-generated questions and 0.32 ± 0.16 for ChatGPT-generated questions, also showing no significant difference (p = 0.578). Notably, 60% (n = 12) of ChatGPT-generated and 50% (n = 10) of educator-generated questions met the criteria for 'good quality', demonstrating balanced difficulty and strong discriminative performance.</p><p><strong>Conclusion: </strong>ChatGPT-generated MCQs performed comparably to educator-created questions in terms of difficulty and discriminative power, highlighting their potential to support assessment design. However, it is important to note that a substantial portion of the initial ChatGPT-generated MCQs were excluded by the independent panel due to issues related to clarity, accuracy, or distractor quality. To avoid overreliance, particularly among faculty who may lack experience in question development or awareness of AI limitations, expert review is essential before use. Future studies should investigate AI's ability to generate complex question formats and its long-term impact on learning.</p>\",\"PeriodicalId\":50488,\"journal\":{\"name\":\"European Journal of Dental Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Journal of Dental Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/eje.70034\",\"RegionNum\":4,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/eje.70034","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

目的:比较ChatGPT生成的多选题(mcq)与牙科教育工作者制作的多选题(mcq)的难度和判别指数,以牙科学生在实际考试中的表现为基础。材料与方法:根据口腔诊断与放射学课程内容,编制由ChatGPT 4.0生成的mcq -20和牙科教育工作者生成的mcq -20共40份。由三名教育工作者组成的独立盲法小组评估了所有mcq的准确性、相关性和清晰度。五年级的牙科学生参加了一个现场和在线考试,其中包括这些问题。采用经典测试理论和点双列相关计算题目难度和判别指标。统计分析采用夏皮罗-威尔克检验、配对样本t检验和独立t检验,显著性设置为p。结果:教育工作者在2.5小时内创建了20个有效的mcq,其中三个问题需要进行轻微修改。ChatGPT在30分钟内生成36个mcq;20人被接受,而44%的人因干扰不良、重复、偏见或事实错误而被排除。85名牙科专业的学生完成了考试。教师自编试题的平均难度指数为0.41±0.19,chatgpt自编试题的平均难度指数为0.42±0.15,差异无统计学意义(p = 0.773)。同样,教育者生成问题的平均判别指数为0.30±0.16,chatgpt生成问题的平均判别指数为0.32±0.16,差异无统计学意义(p = 0.578)。值得注意的是,60% (n = 12)的chatgpt生成的问题和50% (n = 10)的教育工作者生成的问题符合“良好质量”的标准,显示出平衡的难度和强辨别性能。结论:chatgpt生成的mcq在难度和判别能力方面与教育者创建的问题相当,突出了它们支持评估设计的潜力。然而,值得注意的是,由于清晰度、准确性或干扰物质量等问题,独立小组排除了chatgpt生成的初始mcq的很大一部分。为了避免过度依赖,特别是那些可能缺乏问题开发经验或对人工智能局限性认识的教师,在使用前必须进行专家审查。未来的研究应该调查人工智能生成复杂问题格式的能力及其对学习的长期影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator-Generated Multiple-Choice Questions in a Dental Exam.

Aim: To compare the item difficulty and discriminative index of multiple-choice questions (MCQs) generated by ChatGPT with those created by dental educators, based on the performance of dental students in a real exam setting.

Materials and methods: A total of 40 MCQs-20 generated by ChatGPT 4.0 and 20 by dental educators-were developed based on the Oral Diagnosis and Radiology course content. An independent, blinded panel of three educators assessed all MCQs for accuracy, relevance and clarity. Fifth-year dental students participated in an onsite and online exam featuring these questions. Item difficulty and discriminative indices were calculated using classical test theory and point-biserial correlation. Statistical analysis was conducted with the Shapiro-Wilk test, paired sample t-test and independent t-test, with significance set at p < 0.05.

Results: Educators created 20 valid MCQs in 2.5 h, with minor revisions needed for three questions. ChatGPT generated 36 MCQs in 30 min; 20 were accepted, while 44% were excluded due to poor distractors, repetition, bias, or factual errors. Eighty fifth-year dental students completed the exam. The mean difficulty index was 0.41 ± 0.19 for educator-generated questions and 0.42 ± 0.15 for ChatGPT-generated questions, with no statistically significant difference (p = 0.773). Similarly, the mean discriminative index was 0.30 ± 0.16 for educator-generated questions and 0.32 ± 0.16 for ChatGPT-generated questions, also showing no significant difference (p = 0.578). Notably, 60% (n = 12) of ChatGPT-generated and 50% (n = 10) of educator-generated questions met the criteria for 'good quality', demonstrating balanced difficulty and strong discriminative performance.

Conclusion: ChatGPT-generated MCQs performed comparably to educator-created questions in terms of difficulty and discriminative power, highlighting their potential to support assessment design. However, it is important to note that a substantial portion of the initial ChatGPT-generated MCQs were excluded by the independent panel due to issues related to clarity, accuracy, or distractor quality. To avoid overreliance, particularly among faculty who may lack experience in question development or awareness of AI limitations, expert review is essential before use. Future studies should investigate AI's ability to generate complex question formats and its long-term impact on learning.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.10
自引率
16.70%
发文量
127
审稿时长
6-12 weeks
期刊介绍: The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信