{"title":"Faculty versus artificial intelligence chatbot: a comparative analysis of multiple-choice question quality in physiology.","authors":"Anup Kumar D Dhanvijay, Amita Kumari, Mohammed Jaffer Pinjar, Anita Kumari, Abhimanyu Ganguly, Ankita Priya, Ayesha Juhi, Pratima Gupta, Himel Mondal","doi":"10.1152/advan.00197.2025","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background:</b> Multiple-choice questions (MCQs) are widely used for assessment in medical education. While human-generated MCQs benefit from pedagogical insight, creating high-quality items is time-intensive. With the advent of artificial intelligence (AI), tools like DeepSeek R1 offer potential for automated MCQ generation, though their educational validity remains uncertain. With this background, this study compared the psychometric quality of Physiology MCQs generated by faculty and an AI chatbot. <b>Methods:</b> A total of 200 MCQs were developed following the standard syllabus and question design guidelines - 100 by Physiology faculty and 100 by the AI chatbot DeepSeek R1. Fifty questions from each group were randomly selected and administered to undergraduate medical students in 2 hours. Item analysis was conducted post-assessment using difficulty index (DIFI), discrimination index (DI), and non-functional distractors (NFDs). Statistical comparisons were made using t-tests or non-parametric equivalents, with significance at p <0.05. <b>Results:</b> Chatbot-generated MCQs had a significantly higher DIFI (0.64 ± 0.22) than faculty MCQs (0.47 ± 0.19, p <.0001). No significant difference in DI was found between the groups (p = .17). Faculty MCQs had significantly fewer NFDs (median 0) compared to chatbot MCQs (median 1, p = .0063). <b>Conclusion:</b> AI-generated MCQs demonstrated comparable discrimination ability but were generally easier and contained more ineffective distractors. While chatbots show promise in MCQ generation, further refinement is needed to improve distractor quality and item difficulty. AI can complement but not yet replace human expertise in assessment design.</p>","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00197.2025","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Multiple-choice questions (MCQs) are widely used for assessment in medical education. While human-generated MCQs benefit from pedagogical insight, creating high-quality items is time-intensive. With the advent of artificial intelligence (AI), tools like DeepSeek R1 offer potential for automated MCQ generation, though their educational validity remains uncertain. With this background, this study compared the psychometric quality of Physiology MCQs generated by faculty and an AI chatbot. Methods: A total of 200 MCQs were developed following the standard syllabus and question design guidelines - 100 by Physiology faculty and 100 by the AI chatbot DeepSeek R1. Fifty questions from each group were randomly selected and administered to undergraduate medical students in 2 hours. Item analysis was conducted post-assessment using difficulty index (DIFI), discrimination index (DI), and non-functional distractors (NFDs). Statistical comparisons were made using t-tests or non-parametric equivalents, with significance at p <0.05. Results: Chatbot-generated MCQs had a significantly higher DIFI (0.64 ± 0.22) than faculty MCQs (0.47 ± 0.19, p <.0001). No significant difference in DI was found between the groups (p = .17). Faculty MCQs had significantly fewer NFDs (median 0) compared to chatbot MCQs (median 1, p = .0063). Conclusion: AI-generated MCQs demonstrated comparable discrimination ability but were generally easier and contained more ineffective distractors. While chatbots show promise in MCQ generation, further refinement is needed to improve distractor quality and item difficulty. AI can complement but not yet replace human expertise in assessment design.
期刊介绍:
Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.