Evaluating the Test Adequacy of Benchmarks for LLMs on Code Generation

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Software-Evolution and Process Pub Date : 2025-06-25 DOI:10.1002/smr.70034

Xiangyue Liu, Xiaobing Sun, Lili Bo, Yufei Hu, Xinwei Liu, Zhenlei Ye

{"title":"Evaluating the Test Adequacy of Benchmarks for LLMs on Code Generation","authors":"Xiangyue Liu, Xiaobing Sun, Lili Bo, Yufei Hu, Xinwei Liu, Zhenlei Ye","doi":"10.1002/smr.70034","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Code generation for users' intent has become increasingly prevalent with the large language models (LLMs). To automatically evaluate the effectiveness of these models, multiple execution-based benchmarks are proposed, including specially crafted tasks, accompanied by some test cases and a ground truth solution. LLMs are regarded as well-performed in code generation tasks if they can pass the test cases corresponding to most tasks in these benchmarks. However, it is unknown whether the test cases have sufficient test adequacy and whether the test adequacy can affect the evaluation. In this paper, we conducted an empirical study to evaluate the test adequacy of the execution-based benchmarks and to explore their effects during evaluation for LLMs. Based on the evaluation of the widely used benchmarks, HumanEval, MBPP, and two enhanced benchmarks HumanEval+ and MBPP+, we obtained the following results: (1) All the evaluated benchmarks have high statement coverage (above 99.16%), low branch coverage (74.39%) and low mutation score (87.69%). Especially for the tasks with higher cyclomatic complexities in the HumanEval and MBPP, the mutation score of test cases is lower. (2) No significant correlation exists between test adequacy (statement coverage, branch coverage and mutation score) of benchmarks and evaluating results on LLMs at the individual task level. (3) There is a significant positive correlation between mutation score-based evaluation and another execution-based evaluation metric (<span></span><math>\n <semantics>\n <mrow>\n <mi>A</mi>\n <mi>v</mi>\n <mi>g</mi>\n <mi>P</mi>\n <mi>a</mi>\n <mi>s</mi>\n <mi>s</mi>\n <mi>R</mi>\n <mi>a</mi>\n <mi>t</mi>\n <mi>i</mi>\n <mi>o</mi>\n </mrow>\n <annotation>$$ AvgPassRatio $$</annotation>\n </semantics></math>) on LLMs at the individual task level. (4) The existing test case augmentation techniques have limited improvement in the coverage of test cases in the benchmark, while significantly improving the mutation score by approximately 34.60% and also can bring a more rigorous evaluation to LLMs on code generation. (5) The LLM-based test case generation technique (EvalPlus) performs better than the traditional search-based technique (Pynguin) in improving the benchmarks' test quality and evaluation ability of code generation.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 7","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70034","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Code generation for users' intent has become increasingly prevalent with the large language models (LLMs). To automatically evaluate the effectiveness of these models, multiple execution-based benchmarks are proposed, including specially crafted tasks, accompanied by some test cases and a ground truth solution. LLMs are regarded as well-performed in code generation tasks if they can pass the test cases corresponding to most tasks in these benchmarks. However, it is unknown whether the test cases have sufficient test adequacy and whether the test adequacy can affect the evaluation. In this paper, we conducted an empirical study to evaluate the test adequacy of the execution-based benchmarks and to explore their effects during evaluation for LLMs. Based on the evaluation of the widely used benchmarks, HumanEval, MBPP, and two enhanced benchmarks HumanEval+ and MBPP+, we obtained the following results: (1) All the evaluated benchmarks have high statement coverage (above 99.16%), low branch coverage (74.39%) and low mutation score (87.69%). Especially for the tasks with higher cyclomatic complexities in the HumanEval and MBPP, the mutation score of test cases is lower. (2) No significant correlation exists between test adequacy (statement coverage, branch coverage and mutation score) of benchmarks and evaluating results on LLMs at the individual task level. (3) There is a significant positive correlation between mutation score-based evaluation and another execution-based evaluation metric ( $A v g P a s s R a t i o$ ) on LLMs at the individual task level. (4) The existing test case augmentation techniques have limited improvement in the coverage of test cases in the benchmark, while significantly improving the mutation score by approximately 34.60% and also can bring a more rigorous evaluation to LLMs on code generation. (5) The LLM-based test case generation technique (EvalPlus) performs better than the traditional search-based technique (Pynguin) in improving the benchmarks' test quality and evaluation ability of code generation.

查看原文本刊更多论文

评估llm代码生成基准测试的充分性

针对用户意图的代码生成在大型语言模型（llm）中变得越来越普遍。为了自动评估这些模型的有效性，提出了多个基于执行的基准，包括特别制作的任务，伴随着一些测试用例和一个基本的真实解决方案。如果llm能够通过与这些基准测试中的大多数任务相对应的测试用例，则认为它们在代码生成任务中表现良好。然而，测试用例是否具有足够的测试充分性，以及测试充分性是否会影响评估，这是未知的。在本文中，我们进行了一项实证研究，以评估基于执行的基准测试的充分性，并探讨其在llm评估中的作用。通过对目前广泛使用的基准HumanEval、MBPP以及两个增强基准HumanEval+和MBPP+的评价，得到以下结果：(1)评价的基准语句覆盖率均较高（99.16%以上），分支覆盖率较低（74.39%），突变评分较低（87.69%）。特别是对于HumanEval和MBPP中圈复杂度较高的任务，测试用例的突变分数较低。(2)基准的测试充分性（语句覆盖率、分支覆盖率和突变得分）与llm在个体任务水平上的评价结果不存在显著相关。(3)基于突变分数的评价与另一项基于执行力的评价指标（a vgPassRatio $$ AvgPassRatio $$）在llm个体任务水平上存在显著正相关。(4)现有的测试用例增强技术对基准测试用例覆盖率的提高有限，但显著提高了约34.60%的突变分数，也可以对llm在代码生成方面进行更严格的评估。(5)基于llm的测试用例生成技术（EvalPlus）在提高基准测试质量和代码生成的评估能力方面优于传统的基于搜索的技术（Pynguin）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109