Nezaket Ezgi Özer, Yusuf Balcı, Gaye Bölükbaşı, Betul İlhan, Pelin Güneri
{"title":"检查人工智能在评估中的作用:牙科考试中ChatGPT和教育者生成的多项选择题的比较研究。","authors":"Nezaket Ezgi Özer, Yusuf Balcı, Gaye Bölükbaşı, Betul İlhan, Pelin Güneri","doi":"10.1111/eje.70034","DOIUrl":null,"url":null,"abstract":"<p><strong>Aim: </strong>To compare the item difficulty and discriminative index of multiple-choice questions (MCQs) generated by ChatGPT with those created by dental educators, based on the performance of dental students in a real exam setting.</p><p><strong>Materials and methods: </strong>A total of 40 MCQs-20 generated by ChatGPT 4.0 and 20 by dental educators-were developed based on the Oral Diagnosis and Radiology course content. An independent, blinded panel of three educators assessed all MCQs for accuracy, relevance and clarity. Fifth-year dental students participated in an onsite and online exam featuring these questions. Item difficulty and discriminative indices were calculated using classical test theory and point-biserial correlation. Statistical analysis was conducted with the Shapiro-Wilk test, paired sample t-test and independent t-test, with significance set at p < 0.05.</p><p><strong>Results: </strong>Educators created 20 valid MCQs in 2.5 h, with minor revisions needed for three questions. ChatGPT generated 36 MCQs in 30 min; 20 were accepted, while 44% were excluded due to poor distractors, repetition, bias, or factual errors. Eighty fifth-year dental students completed the exam. The mean difficulty index was 0.41 ± 0.19 for educator-generated questions and 0.42 ± 0.15 for ChatGPT-generated questions, with no statistically significant difference (p = 0.773). Similarly, the mean discriminative index was 0.30 ± 0.16 for educator-generated questions and 0.32 ± 0.16 for ChatGPT-generated questions, also showing no significant difference (p = 0.578). Notably, 60% (n = 12) of ChatGPT-generated and 50% (n = 10) of educator-generated questions met the criteria for 'good quality', demonstrating balanced difficulty and strong discriminative performance.</p><p><strong>Conclusion: </strong>ChatGPT-generated MCQs performed comparably to educator-created questions in terms of difficulty and discriminative power, highlighting their potential to support assessment design. However, it is important to note that a substantial portion of the initial ChatGPT-generated MCQs were excluded by the independent panel due to issues related to clarity, accuracy, or distractor quality. To avoid overreliance, particularly among faculty who may lack experience in question development or awareness of AI limitations, expert review is essential before use. Future studies should investigate AI's ability to generate complex question formats and its long-term impact on learning.</p>","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator-Generated Multiple-Choice Questions in a Dental Exam.\",\"authors\":\"Nezaket Ezgi Özer, Yusuf Balcı, Gaye Bölükbaşı, Betul İlhan, Pelin Güneri\",\"doi\":\"10.1111/eje.70034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Aim: </strong>To compare the item difficulty and discriminative index of multiple-choice questions (MCQs) generated by ChatGPT with those created by dental educators, based on the performance of dental students in a real exam setting.</p><p><strong>Materials and methods: </strong>A total of 40 MCQs-20 generated by ChatGPT 4.0 and 20 by dental educators-were developed based on the Oral Diagnosis and Radiology course content. An independent, blinded panel of three educators assessed all MCQs for accuracy, relevance and clarity. Fifth-year dental students participated in an onsite and online exam featuring these questions. Item difficulty and discriminative indices were calculated using classical test theory and point-biserial correlation. Statistical analysis was conducted with the Shapiro-Wilk test, paired sample t-test and independent t-test, with significance set at p < 0.05.</p><p><strong>Results: </strong>Educators created 20 valid MCQs in 2.5 h, with minor revisions needed for three questions. ChatGPT generated 36 MCQs in 30 min; 20 were accepted, while 44% were excluded due to poor distractors, repetition, bias, or factual errors. Eighty fifth-year dental students completed the exam. The mean difficulty index was 0.41 ± 0.19 for educator-generated questions and 0.42 ± 0.15 for ChatGPT-generated questions, with no statistically significant difference (p = 0.773). Similarly, the mean discriminative index was 0.30 ± 0.16 for educator-generated questions and 0.32 ± 0.16 for ChatGPT-generated questions, also showing no significant difference (p = 0.578). Notably, 60% (n = 12) of ChatGPT-generated and 50% (n = 10) of educator-generated questions met the criteria for 'good quality', demonstrating balanced difficulty and strong discriminative performance.</p><p><strong>Conclusion: </strong>ChatGPT-generated MCQs performed comparably to educator-created questions in terms of difficulty and discriminative power, highlighting their potential to support assessment design. However, it is important to note that a substantial portion of the initial ChatGPT-generated MCQs were excluded by the independent panel due to issues related to clarity, accuracy, or distractor quality. To avoid overreliance, particularly among faculty who may lack experience in question development or awareness of AI limitations, expert review is essential before use. Future studies should investigate AI's ability to generate complex question formats and its long-term impact on learning.</p>\",\"PeriodicalId\":50488,\"journal\":{\"name\":\"European Journal of Dental Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Journal of Dental Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/eje.70034\",\"RegionNum\":4,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/eje.70034","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator-Generated Multiple-Choice Questions in a Dental Exam.
Aim: To compare the item difficulty and discriminative index of multiple-choice questions (MCQs) generated by ChatGPT with those created by dental educators, based on the performance of dental students in a real exam setting.
Materials and methods: A total of 40 MCQs-20 generated by ChatGPT 4.0 and 20 by dental educators-were developed based on the Oral Diagnosis and Radiology course content. An independent, blinded panel of three educators assessed all MCQs for accuracy, relevance and clarity. Fifth-year dental students participated in an onsite and online exam featuring these questions. Item difficulty and discriminative indices were calculated using classical test theory and point-biserial correlation. Statistical analysis was conducted with the Shapiro-Wilk test, paired sample t-test and independent t-test, with significance set at p < 0.05.
Results: Educators created 20 valid MCQs in 2.5 h, with minor revisions needed for three questions. ChatGPT generated 36 MCQs in 30 min; 20 were accepted, while 44% were excluded due to poor distractors, repetition, bias, or factual errors. Eighty fifth-year dental students completed the exam. The mean difficulty index was 0.41 ± 0.19 for educator-generated questions and 0.42 ± 0.15 for ChatGPT-generated questions, with no statistically significant difference (p = 0.773). Similarly, the mean discriminative index was 0.30 ± 0.16 for educator-generated questions and 0.32 ± 0.16 for ChatGPT-generated questions, also showing no significant difference (p = 0.578). Notably, 60% (n = 12) of ChatGPT-generated and 50% (n = 10) of educator-generated questions met the criteria for 'good quality', demonstrating balanced difficulty and strong discriminative performance.
Conclusion: ChatGPT-generated MCQs performed comparably to educator-created questions in terms of difficulty and discriminative power, highlighting their potential to support assessment design. However, it is important to note that a substantial portion of the initial ChatGPT-generated MCQs were excluded by the independent panel due to issues related to clarity, accuracy, or distractor quality. To avoid overreliance, particularly among faculty who may lack experience in question development or awareness of AI limitations, expert review is essential before use. Future studies should investigate AI's ability to generate complex question formats and its long-term impact on learning.
期刊介绍:
The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.