{"title":"Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models","authors":"Momoko Shiraishi, Takahiro Shinagawa","doi":"arxiv-2409.10506","DOIUrl":"https://doi.org/arxiv-2409.10506","url":null,"abstract":"There is strong motivation to translate C code into Rust code due to the\u0000continuing threat of memory safety vulnerabilities in existing C programs and\u0000the significant attention paid to Rust as an alternative to the C language.\u0000While large language models (LLMs) show promise for automating this translation\u0000by generating more natural and safer code than rule-based methods, previous\u0000studies have shown that LLM-generated Rust code often fails to compile, even\u0000for relatively small C programs, due to significant differences between the two\u0000languages and context window limitations. We propose an LLM-based translation\u0000scheme that improves the success rate of translating large-scale C code into\u0000compilable Rust code. Our approach involves three key techniques: (1)\u0000pre-processing the C code to better align its structure and expressions with\u0000Rust, (2) segmenting the code into optimally sized translation units to avoid\u0000exceeding the LLM's context window limits, and (3) iteratively compiling and\u0000repairing errors while maintaining consistency between translation units using\u0000context-supplementing prompts. Compilation success is an essential first step\u0000in achieving functional equivalence, as only compilable code can be further\u0000tested. In experiments with 20 benchmark C programs, including those exceeding\u00004 kilo lines of code, we successfully translated all programs into compilable\u0000Rust code without losing corresponding parts of the original code.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah
{"title":"NaviQAte: Functionality-Guided Web Application Navigation","authors":"Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah","doi":"arxiv-2409.10741","DOIUrl":"https://doi.org/arxiv-2409.10741","url":null,"abstract":"End-to-end web testing is challenging due to the need to explore diverse web\u0000application functionalities. Current state-of-the-art methods, such as\u0000WebCanvas, are not designed for broad functionality exploration; they rely on\u0000specific, detailed task descriptions, limiting their adaptability in dynamic\u0000web environments. We introduce NaviQAte, which frames web application\u0000exploration as a question-and-answer task, generating action sequences for\u0000functionalities without requiring detailed parameters. Our three-phase approach\u0000utilizes advanced large language models like GPT-4o for complex decision-making\u0000and cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAte\u0000focuses on functionality-guided web application navigation, integrating\u0000multi-modal inputs such as text and images to enhance contextual understanding.\u0000Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets show\u0000that NaviQAte achieves a 44.23% success rate in user task navigation and a\u000038.46% success rate in functionality navigation, representing a 15% and 33%\u0000improvement over WebCanvas. These results underscore the effectiveness of our\u0000approach in advancing automated web application testing.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad
{"title":"AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing","authors":"Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad","doi":"arxiv-2409.10737","DOIUrl":"https://doi.org/arxiv-2409.10737","url":null,"abstract":"Recent advancements in automatic code generation using large language models\u0000(LLMs) have brought us closer to fully automated secure software development.\u0000However, existing approaches often rely on a single agent for code generation,\u0000which struggles to produce secure, vulnerability-free code. Traditional program\u0000synthesis with LLMs has primarily focused on functional correctness, often\u0000neglecting critical dynamic security implications that happen during runtime.\u0000To address these challenges, we propose AutoSafeCoder, a multi-agent framework\u0000that leverages LLM-driven agents for code generation, vulnerability analysis,\u0000and security enhancement through continuous collaboration. The framework\u0000consists of three agents: a Coding Agent responsible for code generation, a\u0000Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent\u0000performing dynamic testing using a mutation-based fuzzing approach to detect\u0000runtime errors. Our contribution focuses on ensuring the safety of multi-agent\u0000code generation by integrating dynamic and static testing in an iterative\u0000process during code generation by LLM that improves security. Experiments using\u0000the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities\u0000compared to baseline LLMs, with no compromise in functionality.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"99 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation","authors":"Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, Weinan Zhang","doi":"arxiv-2409.09584","DOIUrl":"https://doi.org/arxiv-2409.09584","url":null,"abstract":"LLM agents enhanced by tree search algorithms have yielded notable\u0000performances in code generation. However, current search algorithms in this\u0000domain suffer from low search quality due to several reasons: 1) Ineffective\u0000design of the search space for the high-reasoning demands of code generation\u0000tasks, 2) Inadequate integration of code feedback with the search algorithm,\u0000and 3) Poor handling of negative feedback during the search, leading to reduced\u0000search efficiency and quality. To address these challenges, we propose to\u0000search for the reasoning process of the code and use the detailed feedback of\u0000code execution to refine erroneous thoughts during the search. In this paper,\u0000we introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS)\u0000algorithm to conduct thought-level searches before generating code, thereby\u0000exploring a wider range of strategies. More importantly, we construct verbal\u0000feedback from fine-grained code execution feedback to refine erroneous thoughts\u0000during the search. This ensures that the search progresses along the correct\u0000reasoning paths, thus improving the overall search quality of the tree by\u0000leveraging execution feedback. Through extensive experiments, we demonstrate\u0000that RethinkMCTS outperforms previous search-based and feedback-based code\u0000generation baselines. On the HumanEval dataset, it improves the pass@1 of\u0000GPT-3.5-turbo from 70.12 to 89.02 and GPT-4o-mini from 87.20 to 94.51. It\u0000effectively conducts more thorough exploration through thought-level searches\u0000and enhances the search quality of the entire tree by incorporating rethink\u0000operation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"211 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts","authors":"Che Wang, Jiashuo Zhang, Jianbo Gao, Libin Xia, Zhi Guan, Zhong Chen","doi":"arxiv-2409.09661","DOIUrl":"https://doi.org/arxiv-2409.09661","url":null,"abstract":"Smart contracts are susceptible to being exploited by attackers, especially\u0000when facing real-world vulnerabilities. To mitigate this risk, developers often\u0000rely on third-party audit services to identify potential vulnerabilities before\u0000project deployment. Nevertheless, repairing the identified vulnerabilities is\u0000still complex and labor-intensive, particularly for developers lacking security\u0000expertise. Moreover, existing pattern-based repair tools mostly fail to address\u0000real-world vulnerabilities due to their lack of high-level semantic\u0000understanding. To fill this gap, we propose ContractTinker, a Large Language\u0000Models (LLMs)-empowered tool for real-world vulnerability repair. The key\u0000insight is our adoption of the Chain-of-Thought approach to break down the\u0000entire generation task into sub-tasks. Additionally, to reduce hallucination,\u0000we integrate program static analysis to guide the LLM. We evaluate\u0000ContractTinker on 48 high-risk vulnerabilities. The experimental results show\u0000that among the patches generated by ContractTinker, 23 (48%) are valid patches\u0000that fix the vulnerabilities, while 10 (21%) require only minor modifications.\u0000A video of ContractTinker is available at https://youtu.be/HWFVi-YHcPE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Large Language Models for Predicting Cost and Duration in Software Engineering Projects","authors":"Justin Carpenter, Chia-Ying Wu, Nasir U. Eisty","doi":"arxiv-2409.09617","DOIUrl":"https://doi.org/arxiv-2409.09617","url":null,"abstract":"Accurate estimation of project costs and durations remains a pivotal\u0000challenge in software engineering, directly impacting budgeting and resource\u0000management. Traditional estimation techniques, although widely utilized, often\u0000fall short due to their complexity and the dynamic nature of software\u0000development projects. This study introduces an innovative approach using Large\u0000Language Models (LLMs) to enhance the accuracy and usability of project cost\u0000predictions. We explore the efficacy of LLMs against traditional methods and\u0000contemporary machine learning techniques, focusing on their potential to\u0000simplify the estimation process and provide higher accuracy. Our research is\u0000structured around critical inquiries into whether LLMs can outperform existing\u0000models, the ease of their integration into current practices, outperform\u0000traditional estimation, and why traditional methods still prevail in industry\u0000settings. By applying LLMs to a range of real-world datasets and comparing\u0000their performance to both state-of-the-art and conventional methods, this study\u0000aims to demonstrate that LLMs not only yield more accurate estimates but also\u0000offer a user-friendly alternative to complex predictive models, potentially\u0000transforming project management strategies within the software industry.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. B. Pronin, A. V. Volosova, A. V. Ostroukh, Yu. N. Strogov
{"title":"Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions","authors":"C. B. Pronin, A. V. Volosova, A. V. Ostroukh, Yu. N. Strogov","doi":"arxiv-2409.09353","DOIUrl":"https://doi.org/arxiv-2409.09353","url":null,"abstract":"In this paper, an approach to training and evaluating an adapter model for\u0000the popular language model \"zephyr-7b-beta\" is described. The adapter was\u0000developed to improve the performance of the base model in tasks related to\u0000programming and understanding the Russian language. Considering the high\u0000quality of the original model in tasks in the English language, the goal of the\u0000research was to expand its linguistic and technical spectrum. The proposed\u0000adapter was trained using a large and diverse dataset, including\u0000question-answer pairs related to programming, as well code-related texts in\u0000Russian language. The applied training methodology ensures an improvement in\u0000the model's quality of answers in understanding and generating Python code\u0000based on Russian instructions. We evaluated the performance of the base model\u0000with the installed adapter using various metrics, comparing it to the base\u0000model as well as other state-of-the-art models in this field. The obtained\u0000results showed significant improvement, both in tasks related to writing Python\u0000code and in processing the Russian language, confirming the effectiveness of\u0000the proposed adapter.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenyang Yang, Yining Hong, Grace A. Lewis, Tongshuang Wu, Christian Kästner
{"title":"What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing","authors":"Chenyang Yang, Yining Hong, Grace A. Lewis, Tongshuang Wu, Christian Kästner","doi":"arxiv-2409.09261","DOIUrl":"https://doi.org/arxiv-2409.09261","url":null,"abstract":"Machine learning models make mistakes, yet sometimes it is difficult to\u0000identify the systematic problems behind the mistakes. Practitioners engage in\u0000various activities, including error analysis, testing, auditing, and\u0000red-teaming, to form hypotheses of what can go (or has gone) wrong with their\u0000models. To validate these hypotheses, practitioners employ data slicing to\u0000identify relevant examples. However, traditional data slicing is limited by\u0000available features and programmatic slicing functions. In this work, we propose\u0000SemSlicer, a framework that supports semantic data slicing, which identifies a\u0000semantically coherent slice, without the need for existing features. SemSlicer\u0000uses Large Language Models to annotate datasets and generate slices from any\u0000user-defined slicing criteria. We show that SemSlicer generates accurate slices\u0000with low cost, allows flexible trade-offs between different design dimensions,\u0000reliably identifies under-performing data slices, and helps practitioners\u0000identify useful data slices that reflect systematic problems.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments","authors":"Xinyi Zheng, Chen Wei, Shenao Wang, Yanjie Zhao, Peiming Gao, Yuanchao Zhang, Kailong Wang, Haoyu Wang","doi":"arxiv-2409.09356","DOIUrl":"https://doi.org/arxiv-2409.09356","url":null,"abstract":"The exponential growth of open-source package ecosystems, particularly NPM\u0000and PyPI, has led to an alarming increase in software supply chain poisoning\u0000attacks. Existing static analysis methods struggle with high false positive\u0000rates and are easily thwarted by obfuscation and dynamic code execution\u0000techniques. While dynamic analysis approaches offer improvements, they often\u0000suffer from capturing non-package behaviors and employing simplistic testing\u0000strategies that fail to trigger sophisticated malicious behaviors. To address\u0000these challenges, we present OSCAR, a robust dynamic code poisoning detection\u0000pipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in a\u0000sandbox environment, employs fuzz testing on exported functions and classes,\u0000and implements aspect-based behavior monitoring with tailored API hook points.\u0000We evaluate OSCAR against six existing tools using a comprehensive benchmark\u0000dataset of real-world malicious and benign packages. OSCAR achieves an F1 score\u0000of 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as the\u0000current state-of-the-art technologies. Furthermore, for benign packages\u0000exhibiting characteristics typical of malicious packages, OSCAR reduces the\u0000false positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and\u000039.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantly\u0000reducing the workload of manual reviews in real-world deployments. In\u0000cooperation with Ant Group, a leading financial technology company, we have\u0000deployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying\u000010,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months.\u0000This work not only bridges the gap between academic research and industrial\u0000application in code poisoning detection but also provides a robust and\u0000practical solution that has been thoroughly tested in a real-world industrial\u0000setting.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui
{"title":"Rethinking the Influence of Source Code on Test Case Generation","authors":"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui","doi":"arxiv-2409.09464","DOIUrl":"https://doi.org/arxiv-2409.09464","url":null,"abstract":"Large language models (LLMs) have been widely applied to assist test\u0000generation with the source code under test provided as the context. This paper\u0000aims to answer the question: If the source code under test is incorrect, will\u0000LLMs be misguided when generating tests? The effectiveness of test cases is\u0000measured by their accuracy, coverage, and bug detection effectiveness. Our\u0000evaluation results with five open- and six closed-source LLMs on four datasets\u0000demonstrate that incorrect code can significantly mislead LLMs in generating\u0000correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval\u0000dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\u0000and correct code, but only 57.12% when given task descriptions and incorrect\u0000code. For the APPS dataset, prompts with correct code yield tests that detect\u000039.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\u0000findings have important implications for the deployment of LLM-based testing:\u0000using it on mature code may help protect against future regression, but on\u0000early-stage immature code, it may simply bake in errors. Our findings also\u0000underscore the need for further research to improve LLMs resilience against\u0000incorrect code in generating reliable and bug-revealing tests.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}