B4：通过可信测试实现对可信代码解决方案的最佳评估

arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI:arxiv-2409.08692

Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun

{"title":"B4：通过可信测试实现对可信代码解决方案的最佳评估","authors":"Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun","doi":"arxiv-2409.08692","DOIUrl":null,"url":null,"abstract":"Selecting the best code solution from multiple generated ones is an essential\ntask in code generation, which can be achieved by using some reliable\nvalidators (e.g., developer-written test cases) for assistance. Since reliable\ntest cases are not always available and can be expensive to build in practice,\nresearchers propose to automatically generate test cases to assess code\nsolutions. However, when both code solutions and test cases are plausible and\nnot reliable, selecting the best solution becomes challenging. Although some\nheuristic strategies have been proposed to tackle this problem, they lack a\nstrong theoretical guarantee and it is still an open question whether an\noptimal selection strategy exists. Our work contributes in two ways. First, we\nshow that within a Bayesian framework, the optimal selection strategy can be\ndefined based on the posterior probability of the observed passing states\nbetween solutions and tests. The problem of identifying the best solution is\nthen framed as an integer programming problem. Second, we propose an efficient\napproach for approximating this optimal (yet uncomputable) strategy, where the\napproximation error is bounded by the correctness of prior knowledge. We then\nincorporate effective prior knowledge to tailor code generation tasks. Both\ntheoretical and empirical studies confirm that existing heuristics are limited\nin selecting the best solutions with plausible test cases. Our proposed\napproximated optimal strategy B4 significantly surpasses existing heuristics in\nselecting code solutions generated by large language models (LLMs) with\nLLM-generated tests, achieving a relative performance improvement by up to 50%\nover the strongest heuristic and 246% over the random selection in the most\nchallenging scenarios. Our code is publicly available at\nhttps://github.com/ZJU-CTAG/B4.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests\",\"authors\":\"Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun\",\"doi\":\"arxiv-2409.08692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Selecting the best code solution from multiple generated ones is an essential\\ntask in code generation, which can be achieved by using some reliable\\nvalidators (e.g., developer-written test cases) for assistance. Since reliable\\ntest cases are not always available and can be expensive to build in practice,\\nresearchers propose to automatically generate test cases to assess code\\nsolutions. However, when both code solutions and test cases are plausible and\\nnot reliable, selecting the best solution becomes challenging. Although some\\nheuristic strategies have been proposed to tackle this problem, they lack a\\nstrong theoretical guarantee and it is still an open question whether an\\noptimal selection strategy exists. Our work contributes in two ways. First, we\\nshow that within a Bayesian framework, the optimal selection strategy can be\\ndefined based on the posterior probability of the observed passing states\\nbetween solutions and tests. The problem of identifying the best solution is\\nthen framed as an integer programming problem. Second, we propose an efficient\\napproach for approximating this optimal (yet uncomputable) strategy, where the\\napproximation error is bounded by the correctness of prior knowledge. We then\\nincorporate effective prior knowledge to tailor code generation tasks. Both\\ntheoretical and empirical studies confirm that existing heuristics are limited\\nin selecting the best solutions with plausible test cases. Our proposed\\napproximated optimal strategy B4 significantly surpasses existing heuristics in\\nselecting code solutions generated by large language models (LLMs) with\\nLLM-generated tests, achieving a relative performance improvement by up to 50%\\nover the strongest heuristic and 246% over the random selection in the most\\nchallenging scenarios. Our code is publicly available at\\nhttps://github.com/ZJU-CTAG/B4.\",\"PeriodicalId\":501278,\"journal\":{\"name\":\"arXiv - CS - Software Engineering\",\"volume\":\"41 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08692\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在代码生成过程中，从多个生成的代码方案中选择最佳代码方案是一项基本任务，这可以通过使用一些可靠的验证器（如开发人员编写的测试用例）来实现。由于可靠的测试用例并不总是可用，而且在实践中构建成本可能很高，因此研究人员建议自动生成测试用例来评估代码解决方案。然而，当代码解决方案和测试用例都似是而非且不可靠时，选择最佳解决方案就变得非常具有挑战性。虽然已经提出了一些启发式策略来解决这个问题，但它们缺乏有力的理论保证，而且是否存在最佳选择策略仍是一个未决问题。我们的工作在两个方面做出了贡献。首先，我们展示了在贝叶斯框架内，可以根据观察到的解决方案和测试之间的传递状态的后验概率来定义最佳选择策略。这样，确定最佳解决方案的问题就被归结为一个整数编程问题。其次，我们提出了近似这一最优（但不可计算）策略的有效方法，近似误差受先验知识正确性的限制。然后，我们结合有效的先验知识来定制代码生成任务。理论和实证研究都证实，现有的启发式方法在选择具有可信测试用例的最佳解决方案方面存在局限性。我们提出的近似最优策略 B4 在选择由大语言模型（LLM）生成的代码解决方案时大大超越了现有的启发式方法，在最具挑战性的场景中，比最强启发式方法的相对性能提高了 50%，比随机选择方法的性能提高了 246%。我们的代码可在https://github.com/ZJU-CTAG/B4。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at https://github.com/ZJU-CTAG/B4.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量