arXiv - CS - Software Engineering最新文献

筛选
英文 中文
Python Symbolic Execution with LLM-powered Code Generation 利用 LLM 驱动的代码生成实现 Python 符号执行
arXiv - CS - Software Engineering Pub Date : 2024-09-14 DOI: arxiv-2409.09271
Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma
{"title":"Python Symbolic Execution with LLM-powered Code Generation","authors":"Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma","doi":"arxiv-2409.09271","DOIUrl":"https://doi.org/arxiv-2409.09271","url":null,"abstract":"Symbolic execution is a key technology in software testing, which generates\u0000test cases by collecting symbolic path constraints and then solving constraints\u0000with SMT solvers. Symbolic execution has been proven helpful in generating\u0000high-coverage test cases, but its limitations, e.g., the difficulties in\u0000solving path constraints, prevent it from broader usage in software testing.\u0000Moreover, symbolic execution has encountered many difficulties when applied to\u0000dynamically typed languages like Python, because it is extremely challenging to\u0000translate the flexible Python grammar into rigid solvers. To overcome the main challenges of applying symbolic execution in Python, we\u0000proposed an LLM-empowered agent, LLM-Sym, that automatically calls an SMT\u0000solver, Z3, to solve execution path constraints. Based on an introductory-level\u0000symbolic execution engine, our LLM agent can extend it to supporting programs\u0000with complex data type `list'. The core contribution of LLM-Sym is translating\u0000complex Python path constraints into Z3 code. To enable accurate path-to-Z3\u0000translation, we design a multiple-step code generation pipeline including type\u0000inference, retrieval and self-refine. Our experiments demonstrate that LLM-Sym\u0000is capable of solving path constraints on Leetcode problems with complicated\u0000control flows and list data structures, which is impossible for the backbone\u0000symbolic execution engine. Our approach paves the way for the combination of\u0000the generation ability of LLMs with the reasoning ability of symbolic solvers,\u0000and opens up new opportunities in LLM-augmented test case generation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"194 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs 模型即代码:衡量针对预训练模型中枢的恶意代码中毒攻击
arXiv - CS - Software Engineering Pub Date : 2024-09-14 DOI: arxiv-2409.09368
Jian Zhao, Shenao Wang, Yanjie Zhao, Xinyi Hou, Kailong Wang, Peiming Gao, Yuanchao Zhang, Chen Wei, Haoyu Wang
{"title":"Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs","authors":"Jian Zhao, Shenao Wang, Yanjie Zhao, Xinyi Hou, Kailong Wang, Peiming Gao, Yuanchao Zhang, Chen Wei, Haoyu Wang","doi":"arxiv-2409.09368","DOIUrl":"https://doi.org/arxiv-2409.09368","url":null,"abstract":"The proliferation of pre-trained models (PTMs) and datasets has led to the\u0000emergence of centralized model hubs like Hugging Face, which facilitate\u0000collaborative development and reuse. However, recent security reports have\u0000uncovered vulnerabilities and instances of malicious attacks within these\u0000platforms, highlighting growing security concerns. This paper presents the\u0000first systematic study of malicious code poisoning attacks on pre-trained model\u0000hubs, focusing on the Hugging Face platform. We conduct a comprehensive threat\u0000analysis, develop a taxonomy of model formats, and perform root cause analysis\u0000of vulnerable formats. While existing tools like Fickling and ModelScan offer\u0000some protection, they face limitations in semantic-level analysis and\u0000comprehensive threat detection. To address these challenges, we propose MalHug,\u0000an end-to-end pipeline tailored for Hugging Face that combines dataset loading\u0000script extraction, model deserialization, in-depth taint analysis, and\u0000heuristic pattern matching to detect and classify malicious code poisoning\u0000attacks in datasets and models. In collaboration with Ant Group, a leading\u0000financial technology company, we have implemented and deployed MalHug on a\u0000mirrored Hugging Face instance within their infrastructure, where it has been\u0000operational for over three months. During this period, MalHug has monitored\u0000more than 705K models and 176K datasets, uncovering 91 malicious models and 9\u0000malicious dataset loading scripts. These findings reveal a range of security\u0000threats, including reverse shell, browser credential theft, and system\u0000reconnaissance. This work not only bridges a critical gap in understanding the\u0000security of the PTM supply chain but also provides a practical, industry-tested\u0000solution for enhancing the security of pre-trained model hubs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating API Parameter Security Rules with LLM for API Misuse Detection 利用 LLM 生成 API 参数安全规则,用于 API 滥用检测
arXiv - CS - Software Engineering Pub Date : 2024-09-14 DOI: arxiv-2409.09288
Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin
{"title":"Generating API Parameter Security Rules with LLM for API Misuse Detection","authors":"Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin","doi":"arxiv-2409.09288","DOIUrl":"https://doi.org/arxiv-2409.09288","url":null,"abstract":"In this paper, we present a new framework, named GPTAid, for automatic APSRs\u0000generation by analyzing API source code with LLM and detecting API misuse\u0000caused by incorrect parameter use. To validate the correctness of the\u0000LLM-generated APSRs, we propose an execution feedback-checking approach based\u0000on the observation that security-critical API misuse is often caused by APSRs\u0000violations, and most of them result in runtime errors. Specifically, GPTAid\u0000first uses LLM to generate raw APSRs and the Right calling code, and then\u0000generates Violation code for each raw APSR by modifying the Right calling code\u0000using LLM. Subsequently, GPTAid performs dynamic execution on each piece of\u0000Violation code and further filters out the incorrect APSRs based on runtime\u0000errors. To further generate concrete APSRs, GPTAid employs a code differential\u0000analysis to refine the filtered ones. Particularly, as the programming language\u0000is more precise than natural language, GPTAid identifies the key operations\u0000within Violation code by differential analysis, and then generates the\u0000corresponding concrete APSR based on the aforementioned operations. These\u0000concrete APSRs could be precisely interpreted into applicable detection code,\u0000which proven to be effective in API misuse detection. Implementing on the\u0000dataset containing 200 randomly selected APIs from eight popular libraries,\u0000GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs\u0000than state-of-the-art detectors on a comparison dataset of previously reported\u0000bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown\u0000security bugs were found potentially resulting in severe security issues (e.g.,\u0000system crashes), 150 of which have been confirmed by developers after our\u0000reports.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agents in Software Engineering: Survey, Landscape, and Vision 软件工程中的代理:调查、景观和愿景
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.09030
Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang
{"title":"Agents in Software Engineering: Survey, Landscape, and Vision","authors":"Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang","doi":"arxiv-2409.09030","DOIUrl":"https://doi.org/arxiv-2409.09030","url":null,"abstract":"In recent years, Large Language Models (LLMs) have achieved remarkable\u0000success and have been widely used in various downstream tasks, especially in\u0000the tasks of the software engineering (SE) field. We find that many studies\u0000combining LLMs with SE have employed the concept of agents either explicitly or\u0000implicitly. However, there is a lack of an in-depth survey to sort out the\u0000development context of existing works, analyze how existing works combine the\u0000LLM-based agent technologies to optimize various tasks, and clarify the\u0000framework of LLM-based agents in SE. In this paper, we conduct the first survey\u0000of the studies on combining LLM-based agents with SE and present a framework of\u0000LLM-based agents in SE which includes three key modules: perception, memory,\u0000and action. We also summarize the current challenges in combining the two\u0000fields and propose future opportunities in response to existing challenges. We\u0000maintain a GitHub repository of the related papers at:\u0000https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones 对 Git 提交日志进行实证分析,发现代码克隆中潜在的不一致性
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.08555
Reishi Yokomori, Katsuro Inoue
{"title":"An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones","authors":"Reishi Yokomori, Katsuro Inoue","doi":"arxiv-2409.08555","DOIUrl":"https://doi.org/arxiv-2409.08555","url":null,"abstract":"Code clones are code snippets that are identical or similar to other snippets\u0000within the same or different files. They are often created through\u0000copy-and-paste practices and modified during development and maintenance\u0000activities. Since a pair of code clones, known as a clone pair, has a possible\u0000logical coupling between them, it is expected that changes to each snippet are\u0000made simultaneously (co-changed) and consistently. There is extensive research\u0000on code clones, including studies related to the co-change of clones; however,\u0000detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clone\u0000pairs, using the git-log command to extract changes to cloned code snippets. We\u0000analyzed 45 repositories owned by the Apache Software Foundation on GitHub and\u0000addressed three research questions regarding commit frequency, co-change ratio,\u0000and commit patterns. Our findings indicate that (1) on average, clone snippets\u0000are changed infrequently, typically only two or three times throughout their\u0000lifetime, (2) the ratio of co-changes is about half of all clone changes, with\u000010-20% of co-changed commits being concerning (potentially inconsistent), and\u0000(3) 35-65% of all clone pairs being classified as concerning clone pairs\u0000(potentially inconsistent clone pairs). These results suggest the need for a\u0000consistent management system through the commit timeline of clones.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes 学习基于图形的补丁表示法以识别和评估无声漏洞修复
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.08512
Mei Han, Lulu Wang, Jianming Chang, Bixin Li, Chunguang Zhang
{"title":"Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes","authors":"Mei Han, Lulu Wang, Jianming Chang, Bixin Li, Chunguang Zhang","doi":"arxiv-2409.08512","DOIUrl":"https://doi.org/arxiv-2409.08512","url":null,"abstract":"Software projects are dependent on many third-party libraries, therefore\u0000high-risk vulnerabilities can propagate through the dependency chain to\u0000downstream projects. Owing to the subjective nature of patch management,\u0000software vendors commonly fix vulnerabilities silently. Silent vulnerability\u0000fixes cause downstream software to be unaware of urgent security issues in a\u0000timely manner, posing a security risk to the software. Presently, most of the\u0000existing works for vulnerability fix identification only consider the changed\u0000code as a sequential textual sequence, ignoring the structural information of\u0000the code. In this paper, we propose GRAPE, a GRAph-based Patch rEpresentation\u0000that aims to 1) provide a unified framework for getting vulnerability fix\u0000patches representation; and 2) enhance the understanding of the intent and\u0000potential impact of patches by extracting structural information of the code.\u0000GRAPE employs a novel joint graph structure (MCPG) to represent the syntactic\u0000and semantic information of fix patches and embeds both nodes and edges.\u0000Subsequently, a carefully designed graph convolutional neural network (NE-GCN)\u0000is utilized to fully learn structural features by leveraging the attributes of\u0000the nodes and edges. Moreover, we construct a dataset containing 2251 silent\u0000fixes. For the experimental section, we evaluated patch representation on three\u0000tasks, including vulnerability fix identification, vulnerability types\u0000classification, and vulnerability severity classification. Experimental results\u0000indicate that, in comparison to baseline methods, GRAPE can more effectively\u0000reduce false positives and omissions of vulnerability fixes identification and\u0000provide accurate vulnerability assessments.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Are Existing Road Design Guidelines Suitable for Autonomous Vehicles? 现有道路设计指南是否适合自动驾驶汽车?
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.10562
Yang Sun, Christopher M. Poskitt, Jun Sun
{"title":"Are Existing Road Design Guidelines Suitable for Autonomous Vehicles?","authors":"Yang Sun, Christopher M. Poskitt, Jun Sun","doi":"arxiv-2409.10562","DOIUrl":"https://doi.org/arxiv-2409.10562","url":null,"abstract":"The emergence of Autonomous Vehicles (AVs) has spurred research into testing\u0000the resilience of their perception systems, i.e. to ensure they are not\u0000susceptible to making critical misjudgements. It is important that they are\u0000tested not only with respect to other vehicles on the road, but also those\u0000objects placed on the roadside. Trash bins, billboards, and greenery are all\u0000examples of such objects, typically placed according to guidelines that were\u0000developed for the human visual system, and which may not align perfectly with\u0000the needs of AVs. Existing tests, however, usually focus on adversarial objects\u0000with conspicuous shapes/patches, that are ultimately unrealistic given their\u0000unnatural appearances and the need for white box knowledge. In this work, we\u0000introduce a black box attack on the perception systems of AVs, in which the\u0000objective is to create realistic adversarial scenarios (i.e. satisfying road\u0000design guidelines) by manipulating the positions of common roadside objects,\u0000and without resorting to `unnatural' adversarial patches. In particular, we\u0000propose TrashFuzz , a fuzzing algorithm to find scenarios in which the\u0000placement of these objects leads to substantial misperceptions by the AV --\u0000such as mistaking a traffic light's colour -- with overall the goal of causing\u0000it to violate traffic laws. To ensure the realism of these scenarios, they must\u0000satisfy several rules encoding regulatory guidelines about the placement of\u0000objects on public streets. We implemented and evaluated these attacks for the\u0000Apollo, finding that TrashFuzz induced it into violating 15 out of 24 different\u0000traffic laws.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Modified Condition/Decision Coverage of Rust 实现修改后的锈蚀条件/决定覆盖范围
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.08708
Wanja Zaeske, Pietro Albini, Florian Gilcher, Umut Durak
{"title":"Towards Modified Condition/Decision Coverage of Rust","authors":"Wanja Zaeske, Pietro Albini, Florian Gilcher, Umut Durak","doi":"arxiv-2409.08708","DOIUrl":"https://doi.org/arxiv-2409.08708","url":null,"abstract":"Testing is an essential tool to assure software, especially so in\u0000safety-critical applications. To quantify how thoroughly a software item has\u0000been tested, a test coverage metric is required. Maybe the strictest such\u0000metric known in the safety critical systems is Modified Condition/Decision\u0000Coverage (MC/DC), which DO-178C prescribes for the highest software assurance\u0000level in aviation. In the past, ambiguities in the interpretation of MC/DC have\u0000been resolved already, i. e. in CAST-10. However, some central features of the\u0000Rust programming language necessitate further clarification. This work\u0000investigates aforementioned features, in particular pattern matching, providing\u0000a consistent view on how to apply MC/DC to Rust. Hence, this paper informs the\u0000implementation of Rust MC/DC tools, paving the road towards Rust in\u0000high-assurance applications.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diagnosis via Proofs of Unsatisfiability for First-Order Logic with Relational Objects 通过关系对象一阶逻辑的不可满足性证明进行诊断
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.09223
Nick Feng, Lina Marsso, Marsha Chechik
{"title":"Diagnosis via Proofs of Unsatisfiability for First-Order Logic with Relational Objects","authors":"Nick Feng, Lina Marsso, Marsha Chechik","doi":"arxiv-2409.09223","DOIUrl":"https://doi.org/arxiv-2409.09223","url":null,"abstract":"Satisfiability-based automated reasoning is an approach that is being\u0000successfully used in software engineering to validate complex software,\u0000including for safety-critical systems. Such reasoning underlies many validation\u0000activities, from requirements analysis to design consistency to test coverage.\u0000While generally effective, the back-end constraint solvers are often complex\u0000and inevitably error-prone, which threatens the soundness of their application.\u0000Thus, such solvers need to be validated, which includes checking correctness\u0000and explaining (un)satisfiability results returned by them. In this work, we\u0000consider satisfiability analysis based on First-Order Logic with relational\u0000objects (FOL*) which has been shown to be effective for reasoning about time-\u0000and data-sensitive early system designs. We tackle the challenge of validating\u0000the correctness of FOL* unsatisfiability results and deriving diagnoses to\u0000explain the causes of the unsatisfiability. Inspired by the concept of proofs\u0000of UNSAT from SAT/SMT solvers, we define a proof format and proof rules to\u0000track the solvers' reasoning steps as sequences of derivations towards UNSAT.\u0000We also propose an algorithm to verify the correctness of FOL* proofs while\u0000filtering unnecessary derivations and develop a proof-based diagnosis to\u0000explain the cause of unsatisfiability. We implemented the proposed proof\u0000support on top of the state-of-the-art FOL* satisfiability checker to generate\u0000proofs of UNSAT and validated our approach by applying the proof-based\u0000diagnoses to explain the causes of well-formedness issues of normative\u0000requirements of software systems.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests B4:通过可信测试实现对可信代码解决方案的最佳评估
arXiv - CS - Software Engineering Pub Date : 2024-09-13 DOI: arxiv-2409.08692
Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun
{"title":"B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests","authors":"Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun","doi":"arxiv-2409.08692","DOIUrl":"https://doi.org/arxiv-2409.08692","url":null,"abstract":"Selecting the best code solution from multiple generated ones is an essential\u0000task in code generation, which can be achieved by using some reliable\u0000validators (e.g., developer-written test cases) for assistance. Since reliable\u0000test cases are not always available and can be expensive to build in practice,\u0000researchers propose to automatically generate test cases to assess code\u0000solutions. However, when both code solutions and test cases are plausible and\u0000not reliable, selecting the best solution becomes challenging. Although some\u0000heuristic strategies have been proposed to tackle this problem, they lack a\u0000strong theoretical guarantee and it is still an open question whether an\u0000optimal selection strategy exists. Our work contributes in two ways. First, we\u0000show that within a Bayesian framework, the optimal selection strategy can be\u0000defined based on the posterior probability of the observed passing states\u0000between solutions and tests. The problem of identifying the best solution is\u0000then framed as an integer programming problem. Second, we propose an efficient\u0000approach for approximating this optimal (yet uncomputable) strategy, where the\u0000approximation error is bounded by the correctness of prior knowledge. We then\u0000incorporate effective prior knowledge to tailor code generation tasks. Both\u0000theoretical and empirical studies confirm that existing heuristics are limited\u0000in selecting the best solutions with plausible test cases. Our proposed\u0000approximated optimal strategy B4 significantly surpasses existing heuristics in\u0000selecting code solutions generated by large language models (LLMs) with\u0000LLM-generated tests, achieving a relative performance improvement by up to 50%\u0000over the strongest heuristic and 246% over the random selection in the most\u0000challenging scenarios. Our code is publicly available at\u0000https://github.com/ZJU-CTAG/B4.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信