{"title":"OneMoreTest: A Learning-Based Approach to Generating and Selecting Fault-Revealing Unit Tests","authors":"Wei Wei;Yanjie Jiang;Yahui Li;Lu Zhang;Hui Liu","doi":"10.1109/TSE.2025.3581556","DOIUrl":"10.1109/TSE.2025.3581556","url":null,"abstract":"Developers often manually design a few unit tests for a given method under development. After passing such manually designed tests, however, they usually have to turn to automated test case generation tools like EvoSuite and Randoop for more thorough testing. Although the automatically generated tests may achieve a high coverage, they rarely identify hard-to-detect defects automatically because of the well-known test oracle problem: It is challenging to tell whether the output is correct or incorrect without explicit test oracle (expected output). Consequently, developers should manually select and verify a few suspicious test cases to identify hard-to-detect defects. To this end, in this paper, we propose a novel approach, called <i>OneMoreTest</i>, to generating and selecting the most suspicious tests for manual verification. Based on a manually designed passed test, <i>OneMoreTest</i> automatically generates millions of input-output pairs for the method under test (MUT) with mutation-based fuzzing. It then trains an automatically generated neural network to simulate the MUT’s behavior. For new tests automatically generated for the same MUT, <i>OneMoreTest</i> suggests developers with the top <inline-formula><tex-math>$k$</tex-math></inline-formula> most suspicious tests that have the greatest distances between their actual output and estimated output (i.e., network’s output). Our evaluation on real-world faulty methods suggests that <i>OneMoreTest</i> is accurate. On 70.79% of the involved 178 real-world faulty methods, we can identify the defects by manually verifying only a SINGLE test for each of the methods according to <i>OneMoreTest</i>’s suggestions. Compared against the state of the art, <i>OneMoreTest</i> improved the precision from 46.63% to 72.62%, and recall from 46.63% to 70.79%.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2346-2365"},"PeriodicalIF":5.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enriching Mutation Testing With Innovative Method Invocation Mutation: Filling the Crucial Missing Piece of the Puzzle","authors":"Peng Zhang;Zeyu Lu;Yang Wang;Yibiao Yang;Yuming Zhou;Mike Papadakis","doi":"10.1109/TSE.2025.3573751","DOIUrl":"10.1109/TSE.2025.3573751","url":null,"abstract":"Mutation testing aims to simulate real-world defects, but existing tools often struggle to replicate method invocation defects accurately. To address this, we propose MIN (Method INvocation mutator), which uses a mapping strategy to pair method names with corresponding values, ensuring that methods share argument and return types. This method enhances the feasibility and realism of mutants by considering factors such as library methods, access control, inheritance, and static methods. Experimental results show that integrating MIN into Major (a popular mutation tool) improves semantic similarity to real defects by 11%, increases mutant set diversity to 97.5%, and reduces undetected faults by 38.5%. Furthermore, MIN’s performance rivals that of state-of-the-art machine learning-based mutators like CodeBERT, with a 10x speed advantage over CodeBERT and 4x over DeepMutation in generating compilable mutants. These findings demonstrate that MIN can significantly enhance defect simulation and improve the efficiency of mutation testing.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 7","pages":"2125-2143"},"PeriodicalIF":6.5,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Generalizable Fairness With Mahalanobis Distances Guided Boltzmann Exploratory Testing","authors":"Kaixiang Dong;Peng Wu;Yanting Chen","doi":"10.1109/TSE.2025.3581402","DOIUrl":"10.1109/TSE.2025.3581402","url":null,"abstract":"Although machine learning models have been remarkably effective for decision-making tasks such as employment, insurance, and criminal justice, it remains urgent yet challenging to ensure model predictions are reliable and socially fair. This amounts to detecting and repairing potential discriminatory defects of machine learning models extensively with authentic testing data. In this paper, we propose a novel Mahalanobis distance guided Adaptive Exploratory Fairness Testing (MAEFT) approach, which searches for individual discriminatory instances (IDIs) through deep reinforcement learning with an adaptive extension of Boltzmann exploration, and significantly reduces overestimation. MAEFT uses Mahalanobis distances to guide the search with realistic correlations between input features. Thus, through learning a more accurate state-action value approximation, MAEFT can touch a much wider valid input space, reducing sharply the number of duplicate instances visited, and identify more unique tests and IDIs calibrated for the realistic feature correlations. Compared with state-of-the-art black-box and white-box fairness testing methods, our approach generates on average 4.65%-161.66% more unique tests and identifies 154.60%-634.80% more IDIs, with a performance speed-up of 12.54%-1313.47%. Moreover, the IDIs identified by MAEFT can be well exploited to repair the original models through retraining. These IDIs lead to, on average, a 59.15% boost in model fairness, 15.94%-48.73% higher than those identified by the state-of-the-art fairness testing methods. The models retrained with MAEFT also exhibit 37.66%-46.81% stronger generalization ability than those retrained with the state-of-the-art fairness testing methods.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2213-2231"},"PeriodicalIF":5.6,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Malo in the Code Jungle: Explainable Fault Localization for Decentralized Applications","authors":"Hui Zhang;Jiajing Wu;Zhiying Wu;Zhe Chen;Dan Lin;Jiachi Chen;Yuren Zhou;Zibin Zheng","doi":"10.1109/TSE.2025.3578816","DOIUrl":"10.1109/TSE.2025.3578816","url":null,"abstract":"Decentralized applications (DApps) have long been sitting ducks for hackers due to their valuable cryptocurrency assets, exposing them to various security risks. When a DApp is attacked, promptly identifying faults is crucial to minimizing financial losses and ensuring effective fault repair. However, existing fault localization methods, which mostly rely on code coverage, often fall short for DApps, particularly when dealing with only one fault case. Furthermore, according to a prior survey, most developers expect fault localization tools to provide reasonable explanations. In this paper, we present Malo, a <underline>m</u>ethod for DApp-specific expl<underline>ai</u>nable fault <underline>lo</u>calization. It identifies fault functions through <italic>suspicious token transfer-guided analysis</i>, and then employs Large Language Models (LLMs) to generate explanations for these identified fault functions. Specifically, Malo examines function call traces and source codes of fault cases to acquire <italic>internal knowledge</i>, and also retrieves relevant project documents from the Web to obtain <italic>external knowledge</i>. By integrating internal and external knowledge, Malo generates reasonable explanations for faults in DApps. Our evaluation on a dataset of 68 real-world DApp faults demonstrates that Malo can locate 62% of faults within the Top-5, 9% higher than the state-of-the-art method. The experiment results also demonstrate a remarkable alignment accuracy of 71% between the explanations generated by Malo and the ground truth. In addition, we conduct a user study, which confirms that explanations generated by Malo can aid developers in comprehending the root cause of faults. Our code and dataset are available online: <uri>https://github.com/SodalimeZero/Malo_Code.git</uri>.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 7","pages":"2197-2210"},"PeriodicalIF":6.5,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144288475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxin Chen;Jinliang Ding;Kay Chen Tan;Jiancheng Qian;Ke Li
{"title":"MBL-CPDP: A Multi-Objective Bilevel Method for Cross-Project Defect Prediction","authors":"Jiaxin Chen;Jinliang Ding;Kay Chen Tan;Jiancheng Qian;Ke Li","doi":"10.1109/TSE.2025.3577808","DOIUrl":"10.1109/TSE.2025.3577808","url":null,"abstract":"Cross-project defect prediction (CPDP) leverages machine learning (ML) techniques to proactively identify software defects, especially where project-specific data is scarce. However, existing CPDP approaches suffer from three critical limitations: ineffective exploration of high-dimensional parameter spaces, poor adaptability across diverse projects with heterogeneous data distributions, and inadequate handling of feature redundancy and distribution discrepancies between source and target projects. To address these challenges, we formulate CPDP as a multi-objective bilevel optimization (MBLO) method, dubbed <monospace>MBL-CPDP</monospace>. Our approach comprises two nested problems: the upper-level, a multi-objective combinatorial optimization problem, enhances robustness by optimizing ML pipelines that integrate feature selection, transfer learning, and classification techniques, while the lower-level problem fine-tunes their hyperparameters. Unlike traditional methods that employ fragmented optimization strategies or single-objective approaches that introduce bias, <monospace>MBL-CPDP</monospace> provides a holistic, end-to-end optimization framework. Additionally, we propose an ensemble learning method to better capture cross-project distribution differences and improve generalization across diverse datasets. An MBLO algorithm is then presented to effectively solve the formulated MBLO problem. To evaluate <monospace>MBL-CPDP</monospace>’s performance, we compare it with five automated ML tools and 50 CPDP techniques across 20 projects. Extensive empirical results show that <monospace>MBL-CPDP</monospace> outperforms the comparison methods, demonstrating its superior adaptability and comprehensive performance evaluation capability.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2305-2328"},"PeriodicalIF":5.6,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144260108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoye Zheng;Zhiyuan Wan;Shun Liu;Kaiwen Yang;David Lo;Xiaohu Yang
{"title":"GNNContext: GNN-based Code Context Prediction for Programming Tasks","authors":"Xiaoye Zheng;Zhiyuan Wan;Shun Liu;Kaiwen Yang;David Lo;Xiaohu Yang","doi":"10.1109/TSE.2025.3578390","DOIUrl":"10.1109/TSE.2025.3578390","url":null,"abstract":"A code context model comprises source code elements and their relations relevant to a programming task. The capture and use of code context models in software tools can benefit software development practices, such as code navigation and search. Prior research has explored approaches that leverage either the structural information of code or interaction histories of developers with integrated development environments to automate the construction of code context models. However, these approaches primarily capture shallow syntactic and lexical features of code elements, with limited ability to capture contextual and structural dependencies among neighboring code elements. In this paper, we propose <sc>GNNContext</small>, a novel approach for predicting code context models based on Graph Neural Networks. Our approach leverages code representation learning models to capture both the syntactic and semantic features of code elements, while employing Graph Neural Networks to learn the structural and contextual information among neighboring code elements in the code context models. To evaluate the effectiveness of our approach, we apply it to a dataset comprising 3,879 code context models that we derive from three Eclipse open-source projects. The evaluation results demonstrate that our proposed approach <sc>GNNContext</small> can significantly outperform the state-of-the-art baseline for code context prediction, achieving average improvements of 62.79%, 56.60%, 73.50% and 81.89% in mean reciprocal rank, top- 1, top-3, and top-5 recall rates, respectively, across predictions of varying steps. Moreover, our approach demonstrates robust performance in a cross-project evaluation setting.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2268-2284"},"PeriodicalIF":5.6,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144259974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiran Wang;Willem Meijer;José Antonio Hernández López;Ulf Nilsson;Dániel Varró
{"title":"Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks","authors":"Yiran Wang;Willem Meijer;José Antonio Hernández López;Ulf Nilsson;Dániel Varró","doi":"10.1109/TSE.2025.3574500","DOIUrl":"10.1109/TSE.2025.3574500","url":null,"abstract":"Jupyter notebooks have become central in data science, integrating code, text and output in a flexible environment. With the rise of machine learning (ML), notebooks are increasingly used for prototyping and data analysis. However, due to their dependence on complex ML libraries and the flexible notebook semantics that allow cells to be run in any order, notebooks are susceptible to software bugs that may lead to program crashes. This paper presents a comprehensive empirical study focusing on crashes in publicly available Python ML notebooks. We collect 64,031 notebooks containing 92,542 crashes from GitHub and Kaggle, and manually analyze a sample of 746 crashes across various aspects, including crash types and root causes. Our analysis identifies unique ML-specific crash types, such as tensor shape mismatches and dataset value errors that violate API constraints. Additionally, we highlight unique root causes tied to notebook semantics, including out-of-order execution and residual errors from previous cells, which have been largely overlooked in prior research. Furthermore, we identify the most error-prone ML libraries, and analyze crash distribution across ML pipeline stages. We find that over 40% of crashes stem from API misuse and notebook-specific issues. Crashes frequently occur when using ML libraries like TensorFlow/Keras and Torch. Additionally, over 70% of the crashes occur during data preparation, model training, and evaluation or prediction stages of the ML pipeline, while data visualization errors tend to be unique to ML notebooks.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 7","pages":"2181-2196"},"PeriodicalIF":6.5,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11022755","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Question Selection for Multimodal Code Search Synthesis Using Probabilistic Version Spaces","authors":"Jiarong Wu;Yanyan Jiang;Lili Wei;Congying Xu;Shing-Chi Cheung;Chang Xu","doi":"10.1109/TSE.2025.3565387","DOIUrl":"10.1109/TSE.2025.3565387","url":null,"abstract":"Searching the occurrences of specific code patterns (code search) is a common task in software engineering, and programming by example (PBE) techniques have been applied to ease customizing code patterns. However, previous PBE tools only synthesize programs meeting the input-output examples, which may not always align with the user intent. To bridge this gap, this paper proposes <sc>Excalibur</small>, a multi-modal (example and natural language description) and interactive synthesizer for code search. <sc>Excalibur</small> ensures that the generated programs are correct for the provided examples (soundness) and include the user-intended program (bounded completeness). Furthermore, <sc>Excalibur</small> helps the user identify the user-intended program through question-answer interaction. To minimize the required interaction efforts, question selection is crucial. To improve question selection for code search, we propose probabilistic version spaces (ProbVS), in which the user-intended program’s probability is high and others are low. ProbVS combines traditional version spaces for compactly representing extensive programs and large language models (on the user-provided natural language description) for adjusting programs’ probabilities to align with users’ intents. Extensive experiments on a benchmark of 44 tasks demonstrated the effectiveness of <sc>Excalibur</small> and ProbVS and demystified how ProbVS affects probability distributions and how the configurable parameters affect ProbVS.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1724-1744"},"PeriodicalIF":6.5,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143889991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhonghao Jiang;Meng Yan;Li Huang;Weifeng Sun;Chao Liu;Song Sun;David Lo
{"title":"DeepVec: State-Vector Aware Test Case Selection for Enhancing Recurrent Neural Network","authors":"Zhonghao Jiang;Meng Yan;Li Huang;Weifeng Sun;Chao Liu;Song Sun;David Lo","doi":"10.1109/TSE.2025.3565037","DOIUrl":"10.1109/TSE.2025.3565037","url":null,"abstract":"Deep Neural Networks (DNN) have realized significant achievements across various application domains. There is no doubt that testing and enhancing a pre-trained DNN that has been deployed in an application scenario is crucial, because it can reduce the failures of the DNN. DNN-driven software testing and enhancement require large amounts of labeled data. The high cost and inefficiency caused by the large volume of data of manual labeling, and the time consumption of testing all cases in real scenarios are unacceptable. Therefore, test case selection technologies are proposed to reduce the time cost by selecting and only labeling representative test cases without compromising testing performance. Test case selection based on neuron coverage (NC) or uncertainty metrics has achieved significant success in Convolutional Neural Networks (CNN) testing. However, it is challenging to transfer these methods to Recurrent Neural Networks (RNN), which excel at text tasks, due to the mismatch in model output formats and the reliance on image-specific characteristics. What’s more, balancing the execution cost and performance of the algorithm is also indispensable. In this paper, we propose a state-vector aware test case selection method for RNN models, namely DeepVec, which reduces the cost of data labeling and saves computing resources and balances the execution cost and performance. DeepVec selects data using uncertainty metric based on the norm of the output vector at each time step (i.e., state-vector), and similarity metric based on the direction angle of the state-vector. Because test cases with smaller state-vector norms often possess greater information entropy and similar changes of state-vector direction angle indicate similar RNN internal states. These metrics can be calculated with just a single inference, which gives it strong bug detection and model improvement capabilities. We evaluate DeepVec on five popular datasets, containing images and texts as well as commonly used 3 RNN classification models, and compare it with NC-based, uncertainty-based, and other black-box methods. Experimental results demonstrate that DeepVec achieves an average relative improvement of 12.5%-118.22% over baseline methods in selecting fault-revealing test cases with time costs reduced to only 1% to 1‱. At the same time, we find that the absolute accuracy improvement after retraining outperforms baseline methods by 0.29%-24.01% when selecting 15% data to retrain.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1702-1723"},"PeriodicalIF":6.5,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLMorpheus: Mutation Testing Using Large Language Models","authors":"Frank Tip;Jonathan Bell;Max Schäfer","doi":"10.1109/TSE.2025.3562025","DOIUrl":"10.1109/TSE.2025.3562025","url":null,"abstract":"In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program’s tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a “+” with a “-”, or removing a function’s body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique for mutation testing where placeholders are introduced at designated locations in a program’s source code and where a Large Language Model (LLM) is prompted to ask what they could be replaced with. The technique is implemented in <italic>LLMorpheus</i>, a mutation testing tool for JavaScript, and evaluated on 13 subject packages, considering several variations on the prompting strategy, and using several LLMs. We find <italic>LLMorpheus</i> to be capable of producing mutants that resemble existing bugs that cannot be produced by <italic>StrykerJS</i>, a state-of-the-art mutation testing tool. Moreover, we report on the running time, cost, and number of mutants produced by <italic>LLMorpheus</i>, demonstrating its practicality.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1645-1665"},"PeriodicalIF":6.5,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}