Can Autograding of Student-Generated Questions Quality by ChatGPT Match Human Experts?

IF 4.9 3区教育学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

IEEE Transactions on Learning Technologies Pub Date : 2024-04-30 DOI:10.1109/TLT.2024.3394807

Kangkang Li;Qian Yang;Xianmin Yang

{"title":"Can Autograding of Student-Generated Questions Quality by ChatGPT Match Human Experts?","authors":"Kangkang Li;Qian Yang;Xianmin Yang","doi":"10.1109/TLT.2024.3394807","DOIUrl":null,"url":null,"abstract":"The student-generated question (SGQ) strategy is an effective instructional strategy for developing students' higher order cognitive and critical thinking. However, assessing the quality of SGQs is time consuming and domain experts intensive. Previous automatic evaluation work focused on surface-level features of questions. To overcome this limitation, the state-of-the-art language models GPT-3.5 and GPT-4.0 were used to evaluate 1084 SGQs for topic relevance, clarity of expression, answerability, challenging, and cognitive level. Results showed that GPT-4.0 exhibits superior grading consistency with experts compared to GPT-3.5 in terms of topic relevance, clarity of expression, answerability, and difficulty level. GPT-3.5 and GPT-4.0 had low consistency with experts in terms of cognitive level. Over three rounds of testing, GPT-4.0 demonstrated higher stability in autograding when contrasted with GPT-3.5. In addition, to validate the effectiveness of GPT in evaluating SGQs from different domains and subjects, we have done the same experiment on a part of LearningQ dataset. We also discussed the attitudes of teachers and students toward automatic grading by GPT models. The findings underscore the potential of GPT-4.0 to assist teachers in evaluating the quality of SGQs. Nevertheless, the cognitive level assessment of SGQs still needs manual examination by teachers.","PeriodicalId":49191,"journal":{"name":"IEEE Transactions on Learning Technologies","volume":"17 ","pages":"1600-1610"},"PeriodicalIF":4.9000,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Learning Technologies","FirstCategoryId":"95","ListUrlMain":"https://ieeexplore.ieee.org/document/10510637/","RegionNum":3,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The student-generated question (SGQ) strategy is an effective instructional strategy for developing students' higher order cognitive and critical thinking. However, assessing the quality of SGQs is time consuming and domain experts intensive. Previous automatic evaluation work focused on surface-level features of questions. To overcome this limitation, the state-of-the-art language models GPT-3.5 and GPT-4.0 were used to evaluate 1084 SGQs for topic relevance, clarity of expression, answerability, challenging, and cognitive level. Results showed that GPT-4.0 exhibits superior grading consistency with experts compared to GPT-3.5 in terms of topic relevance, clarity of expression, answerability, and difficulty level. GPT-3.5 and GPT-4.0 had low consistency with experts in terms of cognitive level. Over three rounds of testing, GPT-4.0 demonstrated higher stability in autograding when contrasted with GPT-3.5. In addition, to validate the effectiveness of GPT in evaluating SGQs from different domains and subjects, we have done the same experiment on a part of LearningQ dataset. We also discussed the attitudes of teachers and students toward automatic grading by GPT models. The findings underscore the potential of GPT-4.0 to assist teachers in evaluating the quality of SGQs. Nevertheless, the cognitive level assessment of SGQs still needs manual examination by teachers.

查看原文本刊更多论文

ChatGPT 的学生生成问题质量自动交易能否匹配人工专家？

学生生成问题（SGQ）策略是培养学生高阶认知和批判性思维的有效教学策略。然而，评估 SGQ 的质量需要耗费大量时间，而且需要领域专家的参与。以往的自动评估工作侧重于问题的表面特征。为了克服这一局限性，我们使用最先进的语言模型 GPT-3.5 和 GPT-4.0 对 1084 个 SGQ 进行了主题相关性、表达清晰度、可回答性、挑战性和认知水平方面的评估。结果表明，与 GPT-3.5 相比，GPT-4.0 在主题相关性、表达清晰度、可回答性和难度水平方面与专家的评分一致性更高。在认知水平方面，GPT-3.5 和 GPT-4.0 与专家的一致性较低。在三轮测试中，与 GPT-3.5 相比，GPT-4.0 在自动评分方面表现出更高的稳定性。此外，为了验证 GPT 在评估不同领域和学科的 SGQ 方面的有效性，我们在部分 LearningQ 数据集上做了同样的实验。我们还讨论了教师和学生对 GPT 模型自动评分的态度。实验结果凸显了 GPT-4.0 在协助教师评估 SGQ 质量方面的潜力。然而，SGQ 的认知水平评估仍需要教师进行人工检查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Learning Technologies COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-

CiteScore

7.50

自引率

5.40%

发文量

审稿时长

>12 weeks

期刊介绍： The IEEE Transactions on Learning Technologies covers all advances in learning technologies and their applications, including but not limited to the following topics: innovative online learning systems; intelligent tutors; educational games; simulation systems for education and training; collaborative learning tools; learning with mobile devices; wearable devices and interfaces for learning; personalized and adaptive learning systems; tools for formative and summative assessment; tools for learning analytics and educational data mining; ontologies for learning systems; standards and web services that support learning; authoring tools for learning materials; computer support for peer tutoring; learning via computer-mediated inquiry, field, and lab work; social learning techniques; social networks and infrastructures for learning and knowledge sharing; and creation and management of learning objects.