Massimiliano Di Penta;Domenico Bianculli;Michael R. Lyu;Sebastian Uchitel;Andy Zaidman
{"title":"Relevance of Log Mining and Analytics Papers to IEEE Transactions on Software Engineering","authors":"Massimiliano Di Penta;Domenico Bianculli;Michael R. Lyu;Sebastian Uchitel;Andy Zaidman","doi":"10.1109/TSE.2025.3591380","DOIUrl":"https://doi.org/10.1109/TSE.2025.3591380","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2211-2212"},"PeriodicalIF":5.6,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11126987","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144868352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Protocol Fuzzing via Diverse Seed Corpus Generation","authors":"Zhengxiong Luo;Qingpeng Du;Yujue Wang;Abhik Roychoudhury;Yu Jiang","doi":"10.1109/TSE.2025.3595396","DOIUrl":"10.1109/TSE.2025.3595396","url":null,"abstract":"Protocol fuzzing is an effective technique for discovering vulnerabilities in protocol implementations. Although much progress has been made in optimizing input mutation, the initial seed inputs, which serve as the starting point for fuzzing, are still a critical factor in determining the effectiveness of subsequent fuzzing. Existing methods for seed corpus preparation mainly rely on captured network traffic, which suffers from limited diversity due to the biased message distributions present in real-world traffic. Protocol specifications encompass detailed information on diverse messages and thus provide a more comprehensive way for seed corpus preparation. However, these specifications are voluminous and not directly machine-readable. To address this challenge, we introduce PSG, which enhances protocol fuzzing by leveraging large language models (LLMs) to analyze protocol specifications for generating a high-quality seed corpus. First, PSG systematically reorganizes the protocol specification metadata into a structured knowledge base for effective LLM augmentation. Then, PSG employs a grammar-free method to generate target protocol messages and incorporates an iterative refinement process for better accuracy and efficiency. Our evaluation on 7 widely-used protocols and 13 implementations demonstrates that PSG can effectively generate diverse, protocol-compliant message inputs. Moreover, the generated seed corpus significantly improves the performance of state-of-the-art black-box and grey-box protocol fuzzers, achieving higher branch coverage and discovering more zero-day bugs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2693-2709"},"PeriodicalIF":5.6,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Li;Shanfu Shu;Meng Yan;Zhongxin Liu;Chao Liu;Xiaohong Zhang;David Lo
{"title":"Improving Co-Decoding Based Security Hardening of Code LLMs Leveraging Knowledge Distillation","authors":"Dong Li;Shanfu Shu;Meng Yan;Zhongxin Liu;Chao Liu;Xiaohong Zhang;David Lo","doi":"10.1109/TSE.2025.3591791","DOIUrl":"10.1109/TSE.2025.3591791","url":null,"abstract":"Large Language Models (LLMs) have been widely adopted by developers in software development. However, the massive pretraining code data is not rigorously filtered, allowing LLMs to learn unsafe coding patterns. Several prior studies have demonstrated that code LLMs tend to generate code with potential vulnerabilities. The widespread adoption of intelligent programming assistants poses a significant threat to the software development process. Existing approaches to mitigating this risk primarily involve constructing secure data that are free of vulnerabilities and then retraining or fine-tuning the models. However, such an effort is resource intensive and requires significant manual supervision. When the model parameters are too large (e.g., more than 1 billion) or multiple models with the same parameter scale have the same optimization needs (e.g., to avoid outputting vulnerable code), the above work will become unaffordable. To address this challenge, in previous work, we proposed CoSec, an approach to improve the security of code LLMs with different parameters by utilizing an independent and very small parametric security model as a decoding navigator. Despite CoSec’s excellent performance, we found that there is still room for improving: 1) its ability to maintain the functional correctness of hardened targets, and 2) the security of the generated code. To address the above issues, we propose CoSec+, a hardening framework consisting of three phases: 1) Functional Correctness Alignment, which improves the functional correctness of the security base with knowledge disstillation; 2) Security Training, which yields an independent, but much smaller security model; and 3) Co-decoding, where the security model iteratively reasons about the next token along with the target model. Due to the higher confidence that a well-trained security model places in secure and correct tokens, it guides the target base model to generate more secure code, even as it improves the functional correctness of the target base model. We have conducted extensive experiments in several code LLMs (i.e., CodeGen, StarCoderBase, DeepSeekCoder and Qwen2.5-Coder), and the results show that our approach is effective in improving the functional correctness and security of the models. The evaluation results show that CoSec+ can deliver a 0.8% to 37.7% improvement in security across models of various parameter sizes and families; moreover, it preserves the functional correctness of the target base models—achieving functional-correctness gains of 0.7% to 51.1% for most of those models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2634-2650"},"PeriodicalIF":5.6,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144763184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Wang;Chenghao Su;Yijie Ou;Yanhui Li;Jialiang Tan;Lin Chen;Yuming Zhou
{"title":"Translating to a Low-Resource Language with Compiler Feedback: A Case Study on Cangjie","authors":"Jun Wang;Chenghao Su;Yijie Ou;Yanhui Li;Jialiang Tan;Lin Chen;Yuming Zhou","doi":"10.1109/TSE.2025.3594908","DOIUrl":"10.1109/TSE.2025.3594908","url":null,"abstract":"In the rapidly advancing field of software development, the demand for practical code translation tools has surged, driven by the need for interoperability across different programming environments. Existing learning-based approaches often need help with low-resource programming languages that lack sufficient parallel code corpora for training. To address these limitations, we propose a novel training framework that begins with monolingual seed corpora, generating parallel datasets via back-translation and incorporating compiler feedback to optimize the translation model. As a case study, we apply our method to train a code translation model for a new-born low-resource programming language, Cangjie. We also construct a parallel test dataset for <inline-formula><tex-math>$mathsf{Java}$</tex-math></inline-formula>-to-<inline-formula><tex-math>$mathsf{Cangjie}$</tex-math></inline-formula> translation and test cases to evaluate the effectiveness of our approach. Experimental results demonstrate that compiler feedback greatly enhances syntactical correctness, semantic accuracy, and test pass rates of the translated <inline-formula><tex-math>$mathsf{Cangjie}$</tex-math></inline-formula> code. These findings highlight the potential of our method to support code translation in low-resource settings, expanding the capabilities of learning-based models for programming languages with limited data availability.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2671-2692"},"PeriodicalIF":5.6,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144763185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Lin;Shangwen Wang;Yihao Qin;Liqian Chen;Xiaoguang Mao
{"title":"Large Language Models-Aided Program Debloating","authors":"Bo Lin;Shangwen Wang;Yihao Qin;Liqian Chen;Xiaoguang Mao","doi":"10.1109/TSE.2025.3594673","DOIUrl":"10.1109/TSE.2025.3594673","url":null,"abstract":"As software grows in complexity to accommodate diverse features and platforms, software bloating has emerged as a significant challenge, adversely affecting performance and security. However, existing approaches inadequately address the dual objectives of debloating: maintaining functionality by preserving essential features and enhancing security by reducing security issues. Specifically, current software debloating techniques often rely on input-based analysis, using user inputs as proxies for the specifications of desired features. However, these approaches frequently overfit provided inputs, leading to functionality loss and potential security vulnerabilities. To address these limitations, we propose <monospace>LEADER</monospace>, a program debloating framework enhanced by Large Language Models (LLMs), which leverages their semantic understanding, generative capabilities, and decision-making strengths. <monospace>LEADER</monospace> mainly consists of two modules: (1) a documentation-guided test augmentation module designed to preserve functionality, which leverages LLMs to comprehend program documentation and generates sufficient tests to cover the desired features comprehensively, and (2) a multi-advisor-aided program debloating module that employs a neuro-symbolic pipeline to ensure that the security of the software can be perceived during debloating. This module combines debloating and security advisors for analysis and employs an LLM as a decision-maker to eliminate undesired code securely. Extensive evaluations on widely used benchmarks demonstrate the efficacy of <monospace>LEADER</monospace>. It achieves a 95.5% test case pass rate and reduces program size by 42.5%. Notably, it reduces the introduction of vulnerabilities during debloating by 79.1% and decreases pre-existing vulnerabilities by 16.5% more than CovA. These results demonstrate that <monospace>LEADER</monospace> surpasses the state-of-the-art tool CovA in functionality and security. These results underscore the potential of <monospace>LEADER</monospace> to set a new standard in program debloating by effectively balancing functionality and security.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2651-2670"},"PeriodicalIF":5.6,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144763188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lipeng Ma, Weidong Yang, Sihang Jiang, Ben Fei, Mingjie Zhou, Shuhao Li, Mingyu Zhao, Bo Xu, Yanghua Xiao
{"title":"LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models","authors":"Lipeng Ma, Weidong Yang, Sihang Jiang, Ben Fei, Mingjie Zhou, Shuhao Li, Mingyu Zhao, Bo Xu, Yanghua Xiao","doi":"10.1109/tse.2025.3594046","DOIUrl":"https://doi.org/10.1109/tse.2025.3594046","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"138 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengting He;Shihao Xia;Boqin Qin;Nobuko Yoshida;Tingting Yu;Yiying Zhang;Linhai Song
{"title":"How to Save My Gas Fees: Understanding and Detecting Real-World Gas Issues in Solidity Programs","authors":"Mengting He;Shihao Xia;Boqin Qin;Nobuko Yoshida;Tingting Yu;Yiying Zhang;Linhai Song","doi":"10.1109/TSE.2025.3593930","DOIUrl":"10.1109/TSE.2025.3593930","url":null,"abstract":"The execution of smart contracts on Ethereum, a public blockchain system, incurs a fee called <i>gas fee</i> for its computation and data storage. When programmers develop smart contracts (<i>e.g.</i>, in the Solidity programming language), they could unknowingly write code snippets that unnecessarily cause more gas fees. These issues, or what we call <i>gas wastes</i>, can lead to significant monetary losses for users. This paper takes the initiative in helping Ethereum users reduce their gas fees in two key steps. First, we conduct an empirical study on gas wastes in open-source Solidity programs and Ethereum transaction traces. Second, to validate our study findings, we develop a static tool called <i>PeCatch</i> to effectively detect gas wastes in Solidity programs, and manually examine the Solidity compiler’s code to pinpoint implementation errors causing gas wastes. Overall, we make 11 insights and four suggestions, which can foster future tool development and programmer awareness, and fixing our detected bugs can save <inline-formula><tex-math>${$}$</tex-math></inline-formula>0.76 million in gas fees daily.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2617-2633"},"PeriodicalIF":5.6,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}