Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

IF 5.1 2区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Journal of Computer Assisted Learning Pub Date : 2024-12-25 DOI:10.1111/jcal.13100

Umar Alkafaween, Ibrahim Albluwi, Paul Denny

{"title":"Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming","authors":"Umar Alkafaween, Ibrahim Albluwi, Paul Denny","doi":"10.1111/jcal.13100","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming.</p>\n </section>\n \n <section>\n \n <h3> Objectives</h3>\n \n <p>We evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem.</p>\n </section>\n \n <section>\n \n <h3> Results and Conclusions</h3>\n \n <p>Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design.</p>\n </section>\n </div>","PeriodicalId":48071,"journal":{"name":"Journal of Computer Assisted Learning","volume":"41 1","pages":""},"PeriodicalIF":5.1000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Assisted Learning","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jcal.13100","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming.

Objectives

We evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems.

Methods

Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem.

Results and Conclusions

Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design.

查看原文本刊更多论文

自动化自动分级：大型语言模型作为入门编程的测试套件生成器

自动评分的编程作业为学生提供即时反馈，并大大减少了教师手动评分的时间。然而，在自动评分系统中为编程问题创建全面的测试用例套件既耗时又复杂。定义测试套件所需要的努力可能会阻止一些教师产生额外的问题，或者导致测试覆盖率不足，潜在地导致对学生解决方案的误导性反馈。这样的限制可能会减少学生在学习编程时获得及时反馈的充分证明的好处。我们评估使用大型语言模型（llm）的有效性，作为更大工作流程的一部分，为cs1级编程问题自动生成测试套件。方法将每个问题的陈述和参考解决方案提供给GPT-4，生成一个测试套件，供自动评分器使用。我们使用26个问题的样本来评估我们提出的方法，以及超过25000个对这些问题的尝试解决方案，这些问题是由编程入门课程的学生提交的。我们比较了llm生成的测试套件和讲师为每个问题创建的测试套件的性能。结果和结论我们的研究结果表明，法学硕士生成的测试套件可以正确识别大多数有效的解决方案，并且对于大多数问题至少与讲师测试套件一样全面。此外，llm生成的测试套件暴露了一些问题陈述中的模糊性，强调了它们改进自动升级和教学设计的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computer Assisted Learning EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

9.70

自引率

6.00%

发文量

116

期刊介绍： The Journal of Computer Assisted Learning is an international peer-reviewed journal which covers the whole range of uses of information and communication technology to support learning and knowledge exchange. It aims to provide a medium for communication among researchers as well as a channel linking researchers, practitioners, and policy makers. JCAL is also a rich source of material for master and PhD students in areas such as educational psychology, the learning sciences, instructional technology, instructional design, collaborative learning, intelligent learning systems, learning analytics, open, distance and networked learning, and educational evaluation and assessment. This is the case for formal (e.g., schools), non-formal (e.g., workplace learning) and informal learning (e.g., museums and libraries) situations and environments. Volumes often include one Special Issue which these provides readers with a broad and in-depth perspective on a specific topic. First published in 1985, JCAL continues to have the aim of making the outcomes of contemporary research and experience accessible. During this period there have been major technological advances offering new opportunities and approaches in the use of a wide range of technologies to support learning and knowledge transfer more generally. There is currently much emphasis on the use of network functionality and the challenges its appropriate uses pose to teachers/tutors working with students locally and at a distance. JCAL welcomes: -Empirical reports, single studies or programmatic series of studies on the use of computers and information technologies in learning and assessment -Critical and original meta-reviews of literature on the use of computers for learning -Empirical studies on the design and development of innovative technology-based systems for learning -Conceptual articles on issues relating to the Aims and Scope