{"title":"Evaluating the Test Adequacy of Benchmarks for LLMs on Code Generation","authors":"Xiangyue Liu, Xiaobing Sun, Lili Bo, Yufei Hu, Xinwei Liu, Zhenlei Ye","doi":"10.1002/smr.70034","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Code generation for users' intent has become increasingly prevalent with the large language models (LLMs). To automatically evaluate the effectiveness of these models, multiple execution-based benchmarks are proposed, including specially crafted tasks, accompanied by some test cases and a ground truth solution. LLMs are regarded as well-performed in code generation tasks if they can pass the test cases corresponding to most tasks in these benchmarks. However, it is unknown whether the test cases have sufficient test adequacy and whether the test adequacy can affect the evaluation. In this paper, we conducted an empirical study to evaluate the test adequacy of the execution-based benchmarks and to explore their effects during evaluation for LLMs. Based on the evaluation of the widely used benchmarks, HumanEval, MBPP, and two enhanced benchmarks HumanEval+ and MBPP+, we obtained the following results: (1) All the evaluated benchmarks have high statement coverage (above 99.16%), low branch coverage (74.39%) and low mutation score (87.69%). Especially for the tasks with higher cyclomatic complexities in the HumanEval and MBPP, the mutation score of test cases is lower. (2) No significant correlation exists between test adequacy (statement coverage, branch coverage and mutation score) of benchmarks and evaluating results on LLMs at the individual task level. (3) There is a significant positive correlation between mutation score-based evaluation and another execution-based evaluation metric (<span></span><math>\n <semantics>\n <mrow>\n <mi>A</mi>\n <mi>v</mi>\n <mi>g</mi>\n <mi>P</mi>\n <mi>a</mi>\n <mi>s</mi>\n <mi>s</mi>\n <mi>R</mi>\n <mi>a</mi>\n <mi>t</mi>\n <mi>i</mi>\n <mi>o</mi>\n </mrow>\n <annotation>$$ AvgPassRatio $$</annotation>\n </semantics></math>) on LLMs at the individual task level. (4) The existing test case augmentation techniques have limited improvement in the coverage of test cases in the benchmark, while significantly improving the mutation score by approximately 34.60% and also can bring a more rigorous evaluation to LLMs on code generation. (5) The LLM-based test case generation technique (EvalPlus) performs better than the traditional search-based technique (Pynguin) in improving the benchmarks' test quality and evaluation ability of code generation.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 7","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70034","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Code generation for users' intent has become increasingly prevalent with the large language models (LLMs). To automatically evaluate the effectiveness of these models, multiple execution-based benchmarks are proposed, including specially crafted tasks, accompanied by some test cases and a ground truth solution. LLMs are regarded as well-performed in code generation tasks if they can pass the test cases corresponding to most tasks in these benchmarks. However, it is unknown whether the test cases have sufficient test adequacy and whether the test adequacy can affect the evaluation. In this paper, we conducted an empirical study to evaluate the test adequacy of the execution-based benchmarks and to explore their effects during evaluation for LLMs. Based on the evaluation of the widely used benchmarks, HumanEval, MBPP, and two enhanced benchmarks HumanEval+ and MBPP+, we obtained the following results: (1) All the evaluated benchmarks have high statement coverage (above 99.16%), low branch coverage (74.39%) and low mutation score (87.69%). Especially for the tasks with higher cyclomatic complexities in the HumanEval and MBPP, the mutation score of test cases is lower. (2) No significant correlation exists between test adequacy (statement coverage, branch coverage and mutation score) of benchmarks and evaluating results on LLMs at the individual task level. (3) There is a significant positive correlation between mutation score-based evaluation and another execution-based evaluation metric () on LLMs at the individual task level. (4) The existing test case augmentation techniques have limited improvement in the coverage of test cases in the benchmark, while significantly improving the mutation score by approximately 34.60% and also can bring a more rigorous evaluation to LLMs on code generation. (5) The LLM-based test case generation technique (EvalPlus) performs better than the traditional search-based technique (Pynguin) in improving the benchmarks' test quality and evaluation ability of code generation.