{"title":"Question Selection for Multimodal Code Search Synthesis Using Probabilistic Version Spaces","authors":"Jiarong Wu;Yanyan Jiang;Lili Wei;Congying Xu;Shing-Chi Cheung;Chang Xu","doi":"10.1109/TSE.2025.3565387","DOIUrl":"10.1109/TSE.2025.3565387","url":null,"abstract":"Searching the occurrences of specific code patterns (code search) is a common task in software engineering, and programming by example (PBE) techniques have been applied to ease customizing code patterns. However, previous PBE tools only synthesize programs meeting the input-output examples, which may not always align with the user intent. To bridge this gap, this paper proposes <sc>Excalibur</small>, a multi-modal (example and natural language description) and interactive synthesizer for code search. <sc>Excalibur</small> ensures that the generated programs are correct for the provided examples (soundness) and include the user-intended program (bounded completeness). Furthermore, <sc>Excalibur</small> helps the user identify the user-intended program through question-answer interaction. To minimize the required interaction efforts, question selection is crucial. To improve question selection for code search, we propose probabilistic version spaces (ProbVS), in which the user-intended program’s probability is high and others are low. ProbVS combines traditional version spaces for compactly representing extensive programs and large language models (on the user-provided natural language description) for adjusting programs’ probabilities to align with users’ intents. Extensive experiments on a benchmark of 44 tasks demonstrated the effectiveness of <sc>Excalibur</small> and ProbVS and demystified how ProbVS affects probability distributions and how the configurable parameters affect ProbVS.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1724-1744"},"PeriodicalIF":6.5,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143889991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhonghao Jiang;Meng Yan;Li Huang;Weifeng Sun;Chao Liu;Song Sun;David Lo
{"title":"DeepVec: State-Vector Aware Test Case Selection for Enhancing Recurrent Neural Network","authors":"Zhonghao Jiang;Meng Yan;Li Huang;Weifeng Sun;Chao Liu;Song Sun;David Lo","doi":"10.1109/TSE.2025.3565037","DOIUrl":"10.1109/TSE.2025.3565037","url":null,"abstract":"Deep Neural Networks (DNN) have realized significant achievements across various application domains. There is no doubt that testing and enhancing a pre-trained DNN that has been deployed in an application scenario is crucial, because it can reduce the failures of the DNN. DNN-driven software testing and enhancement require large amounts of labeled data. The high cost and inefficiency caused by the large volume of data of manual labeling, and the time consumption of testing all cases in real scenarios are unacceptable. Therefore, test case selection technologies are proposed to reduce the time cost by selecting and only labeling representative test cases without compromising testing performance. Test case selection based on neuron coverage (NC) or uncertainty metrics has achieved significant success in Convolutional Neural Networks (CNN) testing. However, it is challenging to transfer these methods to Recurrent Neural Networks (RNN), which excel at text tasks, due to the mismatch in model output formats and the reliance on image-specific characteristics. What’s more, balancing the execution cost and performance of the algorithm is also indispensable. In this paper, we propose a state-vector aware test case selection method for RNN models, namely DeepVec, which reduces the cost of data labeling and saves computing resources and balances the execution cost and performance. DeepVec selects data using uncertainty metric based on the norm of the output vector at each time step (i.e., state-vector), and similarity metric based on the direction angle of the state-vector. Because test cases with smaller state-vector norms often possess greater information entropy and similar changes of state-vector direction angle indicate similar RNN internal states. These metrics can be calculated with just a single inference, which gives it strong bug detection and model improvement capabilities. We evaluate DeepVec on five popular datasets, containing images and texts as well as commonly used 3 RNN classification models, and compare it with NC-based, uncertainty-based, and other black-box methods. Experimental results demonstrate that DeepVec achieves an average relative improvement of 12.5%-118.22% over baseline methods in selecting fault-revealing test cases with time costs reduced to only 1% to 1‱. At the same time, we find that the absolute accuracy improvement after retraining outperforms baseline methods by 0.29%-24.01% when selecting 15% data to retrain.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1702-1723"},"PeriodicalIF":6.5,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLMorpheus: Mutation Testing Using Large Language Models","authors":"Frank Tip;Jonathan Bell;Max Schäfer","doi":"10.1109/TSE.2025.3562025","DOIUrl":"10.1109/TSE.2025.3562025","url":null,"abstract":"In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program’s tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a “+” with a “-”, or removing a function’s body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique for mutation testing where placeholders are introduced at designated locations in a program’s source code and where a Large Language Model (LLM) is prompted to ask what they could be replaced with. The technique is implemented in <italic>LLMorpheus</i>, a mutation testing tool for JavaScript, and evaluated on 13 subject packages, considering several variations on the prompting strategy, and using several LLMs. We find <italic>LLMorpheus</i> to be capable of producing mutants that resemble existing bugs that cannot be produced by <italic>StrykerJS</i>, a state-of-the-art mutation testing tool. Moreover, we report on the running time, cost, and number of mutants produced by <italic>LLMorpheus</i>, demonstrating its practicality.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1645-1665"},"PeriodicalIF":6.5,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Applicability of Code Language Models to Scientific Computing Programs","authors":"Qianhui Zhao;Fang Liu;Xiao Long;Chengru Wu;Li Zhang","doi":"10.1109/TSE.2025.3564599","DOIUrl":"10.1109/TSE.2025.3564599","url":null,"abstract":"Scientific Computing Programming Languages (SCPLs), like MATLAB and R, are popular and widely used for computational mathematics. In recent years, pre-trained code language models (CLMs) have automated many code-related tasks, covering various general programming languages. SCPLs share many similarities with general programming languages, including similar syntactic structures and the semantics of identifiers. Despite the similarities, there exist many differences between them. For example, lots of numerical operations and dedicated libraries exist in SCPLs. However, there has been little comprehensive work analyzing CLMs’ capabilities in the understanding and generation of pragmatic scientific computing programs. To this end, we investigate the applicability of code language models for the SCPL analysis, especially focusing on real-world code in open-source repositories. We first create a benchmark that contains programs and documentation from three widely used scientific computing programming languages, then perform an adequate evaluation of existing advanced code language models on both code understanding and generation tasks using the new benchmark, and study the relations of different training strategies, model types, and model sizes to the performance of different tasks and languages. Evaluation results confirm that, compared to general programming languages, SCPLs are more challenging to understand, and especially to generate, but the use of code language models is nevertheless feasible, and the knowledge obtained from the general languages can be transferred to SCPL analysis. A deeper analysis reveals additional challenges in generating code that incorporates API calls relevant to computational mathematics. We believe that our findings can provide guidance on improving tooling and analyses for the scientific programming languages, and also inspire and motivate researchers to improve the robustness of existing code language models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1685-1701"},"PeriodicalIF":6.5,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing CPS With Design Assumptions-Based Metamorphic Relations and Genetic Programming","authors":"Claudio Mandrioli;Seung Yeob Shin;Domenico Bianculli;Lionel Briand","doi":"10.1109/TSE.2025.3563121","DOIUrl":"10.1109/TSE.2025.3563121","url":null,"abstract":"Cyber-Physical Systems (CPSs) software is used to enforce desired behaviours on physical systems. To test the interaction between the CPS software and the system’s physics, engineers provide traces of desired physical states and observe traces of the actual physical states. CPS requirements describe how closely the actual physical traces should track the desired traces. These requirements are typically defined for specific, simple input traces such as step or ramp sequences, and thus are not applicable to arbitrary inputs. This limits the availability of oracles for CPSs. Our recent work proposes an approach to testing CPSs using control-theoretical design assumptions instead of requirements. This approach circumvents the oracle problem by leveraging the control-theoretical guarantees that are provided when the design assumptions are satisfied. To address the test case generation and oracle problems, researchers have proposed metamorphic testing, which is based on the study of relations across tests, i.e., metamorphic relations (MRs). In this work, we define MRs based on the design assumptions and explore combinations of these MRs using genetic programming to generate CPS test cases. This enables the generation of CPS input traces with potentially arbitrary shapes, together with associated expected output traces. We use the deviation from the expected output traces to guide the generation of input traces that falsify the MRs. Our experiment results show that the MR-falsification provides engineers with new information, helping them identify passed and failed test cases. Furthermore, we show that the generation of traces that falsify the MRs is a non-trivial problem, which cannot be addressed with a random generation approach but is successfully addressed by our approach based on genetic search.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1666-1684"},"PeriodicalIF":6.5,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10976605","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143873070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Systematic Study on Real-World Android App Bundles","authors":"Yutian Tang;Xiapu Luo;Yuming Zhou","doi":"10.1109/TSE.2025.3560026","DOIUrl":"10.1109/TSE.2025.3560026","url":null,"abstract":"Android app developers currently mainly attempt to merge all functions into one app to fit different types of devices. However, this “one-size-fits-all” strategy can introduce various problems to both developers and end-users, such as slower download speed, and a larger attack surface. To resolve this issue, Google promotes the App Bundle framework and requires all new apps must adopt this framework after August 2021. The app bundle framework allows developers to organize their apps in modules. As a new framework, building an app bundle can be time-consuming and error-prone for developers. To fill this gap, in this paper, we discuss how developers build app bundles in practice. By investing in over 200,000 apps from Google Play, we find that 30% of apps have already adopted app bundles. The adoption ratio of large-size apps is even higher than 90%. We also find hands-on programming practices for building feature modules and dynamic assets in app bundles. This study also finds 12 common design practices, which assist developers in building app bundles.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 5","pages":"1615-1628"},"PeriodicalIF":6.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143822716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation","authors":"Hongyan Li;Weifeng Sun;Meng Yan;Ling Xu;Qiang Li;Xiaohong Zhang;Hongyu Zhang","doi":"10.1109/TSE.2025.3558403","DOIUrl":"10.1109/TSE.2025.3558403","url":null,"abstract":"Unit Testing is crucial in software development and maintenance, aiming to verify that the implemented functionality is consistent with the expected functionality. A unit test is composed of two parts: a test prefix, which drives the unit under test to a particular state, and a test assertion, which determines what the expected behavior is under that state. To reduce the effort of conducting unit tests manually, Yu et al. proposed an integrated approach (<i>integration</i> for short), combining information retrieval with a deep learning-based approach to generate assertions for test prefixes, and obtained promising results. In our previous work, we found that the overall performance of <i>integration</i> is mainly due to its success in retrieving assertions. Moreover, <i>integration</i> is limited to specific types of edit operations and struggles to understand the semantic differences between the retrieved focal-test (<i>focal-test</i> includes a test prefix and a unit under test) and the input focal-test. Based on these insights, we then proposed a retrieve-and-edit approach named <small>EditAS</small> to learn the assertion edit patterns to improve the effectiveness of assertion generation in our prior study. Despite being promising, we find that the effectiveness of <small>EditAS</small> can be further improved. Our analysis shows that: ① The editing ability of <small>EditAS</small> still has ample room for improvement. Its performance degrades as the edit distance between the retrieval assertion and ground truth increases. Specifically, the average accuracy of <small>EditAS</small> is <inline-formula><tex-math>$12.38%$</tex-math></inline-formula> when the edit distance is greater than 5. ② <small>EditAS</small> lacks a fine-grained semantic understanding of both the retrieved focal-test and the input focal-test themselves, which leads to many inaccurate token modifications. In particular, an average of 25.57% of the incorrectly generated assertions that need to be modified are not modified, and an average of 6.45% of the assertions that match the ground truth are still modified. Thanks to pre-trained models employing pre-training paradigms on large-scale data, they tend to have good semantic comprehension and code generation abilities. In light of this, we propose <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula>, which improves retrieval-and-edit based assertion generation through retrieval-augmented fine-tuning. Specifically, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> first retrieves a similar focal-test from a predefined corpus and treats its assertion as a prototype. Then, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> uses a pre-trained model, CodeT5, to learn the semantics of the input and similar focal-tests as well as assertion editing patterns to automatically edit the prototype. We first evaluate the <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> for i","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 5","pages":"1591-1614"},"PeriodicalIF":6.5,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}