Ranim Khojah;Francisco Gomes de Oliveira Neto;Mazen Mohamad;Philipp Leitner
{"title":"The Impact of Prompt Programming on Function-Level Code Generation","authors":"Ranim Khojah;Francisco Gomes de Oliveira Neto;Mazen Mohamad;Philipp Leitner","doi":"10.1109/TSE.2025.3587794","DOIUrl":"10.1109/TSE.2025.3587794","url":null,"abstract":"Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. While some prompt techniques have been studied, the impact of different techniques — and their interactions — on code generation is still not fully understood. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2381-2395"},"PeriodicalIF":5.6,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11077752","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144603464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxue Ma;Yishu Li;Jacky Keung;Xiao Yu;Huiqi Zou;Zhen Yang;Federica Sarro;Earl T. Barr
{"title":"Practitioners’ Expectations on Log Anomaly Detection","authors":"Xiaoxue Ma;Yishu Li;Jacky Keung;Xiao Yu;Huiqi Zou;Zhen Yang;Federica Sarro;Earl T. Barr","doi":"10.1109/TSE.2025.3586700","DOIUrl":"10.1109/TSE.2025.3586700","url":null,"abstract":"Log anomaly detection has become a common practice for software engineers to analyze software system behavior. Despite significant research efforts in log anomaly detection over the past decade, it remains unclear what are practitioners’ expectations on log anomaly detection and whether current research meets their needs. To fill this gap, we conduct an empirical study, surveying 312 practitioners from 36 countries about their expectations on log anomaly detection. In particular, we investigate various factors influencing practitioners’ willingness to adopt log anomaly detection tools. We then perform a literature review on log anomaly detection, focusing on publications in premier venues from 2015 to 2025, to compare practitioners’ needs with the current state of research. Based on this comparison, we highlight the directions for researchers to focus on to develop log anomaly detection techniques that better meet practitioners’ expectations.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2455-2471"},"PeriodicalIF":5.6,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144594132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boyuan Li;Chengwei Liu;Lingling Fan;Sen Chen;Zhenlin Zhang;Zheli Liu
{"title":"Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management","authors":"Boyuan Li;Chengwei Liu;Lingling Fan;Sen Chen;Zhenlin Zhang;Zheli Liu","doi":"10.1109/TSE.2025.3586411","DOIUrl":"10.1109/TSE.2025.3586411","url":null,"abstract":"Integrating third-party software components is a common practice in modern software development, offering significant advantages in terms of efficiency and innovation. However, this practice is fraught with risks related to software licensing. A lack of understanding may lead to disputes, which can pose serious legal and operational challenges. To these ends, both academia and industry have conducted various investigations and proposed solutions and tools to deal with these challenges. However, significant limitations still remain. Moreover, the rapid evolution of open-source software (OSS) licenses, as well as the rapidly incorporated generative software engineering techniques, such as large language models for code (CodeLLMs), are placing greater demands on the systematic management of software license risks. To unveil the severe challenges and explore possible future directions, we conduct the first systematic literature review (SLR) on 80 carefully selected OSS license-related papers, classifying existing research into three key categories, i.e., license identification, license risk assessment, and license risk mitigation. Based on these, we discuss challenges in existing solutions, conclude the opportunities to shed light on future research directions and offer practical recommendations for practitioners. We hope this thorough review will help bridge the gaps between academia and industry and accelerate the ecosystem-wide governance of legitimate software risks within the software engineering community.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2432-2454"},"PeriodicalIF":5.6,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144578245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giuseppe Crupi;Rosalia Tufano;Alejandro Velasco;Antonio Mastropaolo;Denys Poshyvanyk;Gabriele Bavota
{"title":"On the Effectiveness of LLM-as-a-Judge for Code Generation and Summarization","authors":"Giuseppe Crupi;Rosalia Tufano;Alejandro Velasco;Antonio Mastropaolo;Denys Poshyvanyk;Gabriele Bavota","doi":"10.1109/TSE.2025.3586082","DOIUrl":"10.1109/TSE.2025.3586082","url":null,"abstract":"Large Language Models (LLMs) have been recently exploited as judges for complex natural language processing tasks, such as Q&A (Question & Answer). The basic idea is to delegate to an LLM the assessment of the “quality” of the output provided by an automated technique (often another LLM) for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. LLMs-as-a-judge, if proven effective for a specific task, can also unlock new possibilities for automation, with several LLMs proposing a solution for a given instance of the task (<i>e.g.,</i> an answer to a question) and others judging and deciding what is the best output to show the user. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely <i>code generation</i> and <i>code summarization</i>. The rationale for choosing these tasks is two-fold. First, quantitative metrics are usually not enough for the assessment of code summarizers/generators. For example, it is well documented that metrics such as BLEU are quite weak proxies for the quality of the generated summaries. Second, even state-of-the-art techniques still struggle with handling complex instances of these tasks (<i>e.g.,</i> summarizing a quite long / complex function), making them good candidates for benefiting from more advanced solutions envisioning collaboration among LLMs. For <i>code generation</i>, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For <i>code summarization</i>, we compare the judgment of five LLMs to those provided by ninehumans for <inline-formula><tex-math>$sim$</tex-math></inline-formula> 1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with “smaller” LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2329-2345"},"PeriodicalIF":5.6,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144565964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TransferFuzz-Pro: Large Language Model Driven Code Debugging Technology for Verifying Propagated Vulnerability","authors":"Siyuan Li;Kaiyu Xie;Yuekang Li;Hong Li;Yimo Ren;Limin Sun;Hongsong Zhu","doi":"10.1109/TSE.2025.3584774","DOIUrl":"10.1109/TSE.2025.3584774","url":null,"abstract":"Code reuse in software development frequently facilitates the spread of vulnerabilities, leading to imprecise scopes of affected software in CVE reports. Traditional methods focus primarily on detecting reused vulnerability code in target software but lack the ability to confirm whether these vulnerabilities can be triggered in new software contexts. In previous work, we introduced the TransferFuzz framework to address this gap by using historical trace-based fuzzing. However, its effectiveness is constrained by the need for manual intervention and reliance on source code instrumentation. To overcome these limitations, we propose TransferFuzz-Pro, a novel framework that integrates Large Language Model (LLM)-driven code debugging technology. By leveraging LLM for automated, human-like debugging and Proof-of-Concept (PoC) generation, combined with binary-level instrumentation, TransferFuzz-Pro extends verification capabilities to a wider range of targets. Our evaluation shows that TransferFuzz-Pro is significantly faster and can automatically validate vulnerabilities that were previously unverifiable using conventional methods. Notably, it expands the number of affected software instances for 15 CVE-listed vulnerabilities from 15 to 53 and successfully generates PoCs for various Linux distributions. These results demonstrate that TransferFuzz-Pro effectively verifies vulnerabilities introduced by code reuse in target software and automatically generation PoCs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2396-2411"},"PeriodicalIF":5.6,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144565937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuyong Liu;Zhifei Chen;Lin Chen;Yanhui Li;Xuansong Li;Wei Song
{"title":"COTE: Predicting Code-to-Test Co-Evolution by Integrating Link Analysis and Pre-Trained Language Model Techniques","authors":"Yuyong Liu;Zhifei Chen;Lin Chen;Yanhui Li;Xuansong Li;Wei Song","doi":"10.1109/TSE.2025.3583027","DOIUrl":"10.1109/TSE.2025.3583027","url":null,"abstract":"Tests, as an essential artifact, should co-evolve with the production code to ensure that the associated production code satisfies specification. However, developers often postpone or even forget to update tests, making the tests outdated and lag behind the code. To predict which tests need to be updated when production code is changed, it is challenging to identify all related tests and determine their change probabilities due to complex change scenarios. This paper fills the gap and proposes a hybrid approach named COTE to predict code-to-test co-evolution. We first compute the linked test candidates based on different code-to-test dependencies. After that, we identify common co-change patterns by building a method-level dependence graph. For the remaining ambiguous patterns, we leverage a pre-trained language model which captures the semantic features of code and the change reasons contained in commit messages to judge one test’s likelihood of being updated. Experiments on our datasets consisting of 6,314 samples extracted from 5,000 Java projects show that COTE outperforms state-of-the-art approaches, achieving a precision of 89.0% and a recall of 71.6%. This work can help practitioners reduce test maintenance costs and improve software quality.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2232-2253"},"PeriodicalIF":5.6,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144503609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OneMoreTest: A Learning-Based Approach to Generating and Selecting Fault-Revealing Unit Tests","authors":"Wei Wei;Yanjie Jiang;Yahui Li;Lu Zhang;Hui Liu","doi":"10.1109/TSE.2025.3581556","DOIUrl":"10.1109/TSE.2025.3581556","url":null,"abstract":"Developers often manually design a few unit tests for a given method under development. After passing such manually designed tests, however, they usually have to turn to automated test case generation tools like EvoSuite and Randoop for more thorough testing. Although the automatically generated tests may achieve a high coverage, they rarely identify hard-to-detect defects automatically because of the well-known test oracle problem: It is challenging to tell whether the output is correct or incorrect without explicit test oracle (expected output). Consequently, developers should manually select and verify a few suspicious test cases to identify hard-to-detect defects. To this end, in this paper, we propose a novel approach, called <i>OneMoreTest</i>, to generating and selecting the most suspicious tests for manual verification. Based on a manually designed passed test, <i>OneMoreTest</i> automatically generates millions of input-output pairs for the method under test (MUT) with mutation-based fuzzing. It then trains an automatically generated neural network to simulate the MUT’s behavior. For new tests automatically generated for the same MUT, <i>OneMoreTest</i> suggests developers with the top <inline-formula><tex-math>$k$</tex-math></inline-formula> most suspicious tests that have the greatest distances between their actual output and estimated output (i.e., network’s output). Our evaluation on real-world faulty methods suggests that <i>OneMoreTest</i> is accurate. On 70.79% of the involved 178 real-world faulty methods, we can identify the defects by manually verifying only a SINGLE test for each of the methods according to <i>OneMoreTest</i>’s suggestions. Compared against the state of the art, <i>OneMoreTest</i> improved the precision from 46.63% to 72.62%, and recall from 46.63% to 70.79%.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2346-2365"},"PeriodicalIF":5.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enriching Mutation Testing With Innovative Method Invocation Mutation: Filling the Crucial Missing Piece of the Puzzle","authors":"Peng Zhang;Zeyu Lu;Yang Wang;Yibiao Yang;Yuming Zhou;Mike Papadakis","doi":"10.1109/TSE.2025.3573751","DOIUrl":"10.1109/TSE.2025.3573751","url":null,"abstract":"Mutation testing aims to simulate real-world defects, but existing tools often struggle to replicate method invocation defects accurately. To address this, we propose MIN (Method INvocation mutator), which uses a mapping strategy to pair method names with corresponding values, ensuring that methods share argument and return types. This method enhances the feasibility and realism of mutants by considering factors such as library methods, access control, inheritance, and static methods. Experimental results show that integrating MIN into Major (a popular mutation tool) improves semantic similarity to real defects by 11%, increases mutant set diversity to 97.5%, and reduces undetected faults by 38.5%. Furthermore, MIN’s performance rivals that of state-of-the-art machine learning-based mutators like CodeBERT, with a 10x speed advantage over CodeBERT and 4x over DeepMutation in generating compilable mutants. These findings demonstrate that MIN can significantly enhance defect simulation and improve the efficiency of mutation testing.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 7","pages":"2125-2143"},"PeriodicalIF":6.5,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Generalizable Fairness With Mahalanobis Distances Guided Boltzmann Exploratory Testing","authors":"Kaixiang Dong;Peng Wu;Yanting Chen","doi":"10.1109/TSE.2025.3581402","DOIUrl":"10.1109/TSE.2025.3581402","url":null,"abstract":"Although machine learning models have been remarkably effective for decision-making tasks such as employment, insurance, and criminal justice, it remains urgent yet challenging to ensure model predictions are reliable and socially fair. This amounts to detecting and repairing potential discriminatory defects of machine learning models extensively with authentic testing data. In this paper, we propose a novel Mahalanobis distance guided Adaptive Exploratory Fairness Testing (MAEFT) approach, which searches for individual discriminatory instances (IDIs) through deep reinforcement learning with an adaptive extension of Boltzmann exploration, and significantly reduces overestimation. MAEFT uses Mahalanobis distances to guide the search with realistic correlations between input features. Thus, through learning a more accurate state-action value approximation, MAEFT can touch a much wider valid input space, reducing sharply the number of duplicate instances visited, and identify more unique tests and IDIs calibrated for the realistic feature correlations. Compared with state-of-the-art black-box and white-box fairness testing methods, our approach generates on average 4.65%-161.66% more unique tests and identifies 154.60%-634.80% more IDIs, with a performance speed-up of 12.54%-1313.47%. Moreover, the IDIs identified by MAEFT can be well exploited to repair the original models through retraining. These IDIs lead to, on average, a 59.15% boost in model fairness, 15.94%-48.73% higher than those identified by the state-of-the-art fairness testing methods. The models retrained with MAEFT also exhibit 37.66%-46.81% stronger generalization ability than those retrained with the state-of-the-art fairness testing methods.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2213-2231"},"PeriodicalIF":5.6,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair","authors":"André Silva;Sen Fang;Martin Monperrus","doi":"10.1109/TSE.2025.3581062","DOIUrl":"10.1109/TSE.2025.3581062","url":null,"abstract":"Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tune LLMs with naive code representations and does not scale to frontier models. To address this problem, we propose RepairLLaMA, a novel program repair approach that 1) identifies optimal code representations for APR with fine-tuned models, and 2) pioneers state-of-the-art parameter-efficient fine-tuning technique (PEFT) for program repair. This results in RepairLLaMA producing a highly effective ‘program repair adapter’ for fixing bugs with AI. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals and produce better patches. Second, parameter-efficient fine-tuning helps fine-tuning to converge and clearly contributes to the effectiveness of RepairLLaMA in fixing bugs outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 144 Defects4J v2, 109 HumanEval-Java, and 20 GitBug-Java bugs, outperforming all baselines.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2366-2380"},"PeriodicalIF":5.6,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11039501","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144319900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}