{"title":"BallPri: test cases prioritization for deep neuron networks via tolerant ball in variable space","authors":"Chengyu Jia, Jinyin Chen, Xiaohao Li, Haibin Zheng, Luxin Zhang","doi":"10.1007/s10515-025-00498-5","DOIUrl":"10.1007/s10515-025-00498-5","url":null,"abstract":"<div><p>Deep neural networks (DNNs) have gained widespread adoption in various applications, including some safety-critical domains such as autonomous driving. However, despite their impressive capabilities and outstanding performance, DNNs could also exhibit incorrect behaviors that may lead to serious accidents. As a result, it requires security assurance urgently when applied to safety-critical applications. Deep testing has been developed as an effective technique for detecting incorrectness in DNN behaviors and improving their robustness when necessary, but it needs a large amount of labeled test cases that are expensive to obtain due to the labor-intensive data labeling process. Test case prioritization has been proposed to identify more error-exposed test cases earlier in advance, and several techniques such as DeepGini and PRIMA have been developed that achieve effective and efficient prioritization for classification tasks. However, these methods still face challenges such as unreliable validity, limited application scenarios, and high time complexity. To tackle these issues, we present a novel test prioritization method <i>BallPri</i> by using tolerant ball in variable space for DNNs. It extracts tolerant ball of different test cases and use minimum non-parametric likelihood ratio (MinLR) to further enlarge the difference of distribution in variable space, to achieve effective and general test cases prioritizing. Extensive experiments on benchmark datasets and models validate that <i>BallPri</i> outperforms the state-of-the-art methods in three key aspects: (1) <i>Effective</i>—it leverages tolerant ball in variable space to identify malicious bug-revealing inputs. <i>BallPri</i> significantly improves 47.83% prioritization effectiveness and 37.27% prioritization efficiency on average compared with baselines. (2) <i>Extensible</i>—it can be applied to various tasks, data and models. We verify the superiority of <i>BallPri</i> on classification and regression task, convolutional neural network and recurrent neural network model, image, text and speech dataset. (3) <i>Efficient</i>—it achieves a low time complexity compared with existing methods. We further evaluate <i>BallPri</i> against potential adaptive attacks and provide guidance for its accuracy and robustness. The open-source code of <i>BallPri</i> could be downloaded at https://github.com/lixiaohaao/BallPri.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bash command comment generation via multi-scale heterogeneous feature fusion","authors":"Junsan Zhang, Yang Zhu, Ao Lu, Yudie Yan, Yao Wan","doi":"10.1007/s10515-025-00494-9","DOIUrl":"10.1007/s10515-025-00494-9","url":null,"abstract":"<div><p>Automatic generation of Bash command comments is crucial for understanding and updating commands in software maintenance. Existing mainstream methods mainly focus on learning from the sequential text of Bash commands and combining retrieval-enhanced techniques to generate comments. However, these methods overlook the syntactic structure of Bash commands, thereby limiting the quality and accuracy of generated comments. This paper proposes a heterogeneous Bash comment generation framework named HBCom, which is aimed at deeply exploring the semantic information of Bash commands from command token sequences and syntactic structures to generate more accurate and natural command comments. The core of HBCom lies in constructing a Heterogeneous Information Graph (HIG) based on an Abstract Syntax Tree, which integrates the syntactic structure of Bash commands with the code sequence through six types of edges, providing a solid information basis for subsequent comment generation. In addition, we propose a heterogeneous and multi-scale graph neural network to capture various relationships in HIGs. Subsequently, we utilize a Transformer decoder, combined with a copy mechanism based on multi-head attention, to decode and fuse the HIG and Bash command tokens features, ultimately generating high-quality comments. We conduct extensive experiments on Bash dataset, demonstrating that HBCom outperforms compared baseline models in BLEU, ROUGE-L, and METEOR metrics. Furthermore, human evaluations confirm HBCom’s effectiveness in real-world application scenarios.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashis Kumar Mandal, Md Nadim, Chanchal K. Roy, Banani Roy, Kevin A. Schneider
{"title":"Quantum software engineering and potential of quantum computing in software engineering research: a review","authors":"Ashis Kumar Mandal, Md Nadim, Chanchal K. Roy, Banani Roy, Kevin A. Schneider","doi":"10.1007/s10515-025-00493-w","DOIUrl":"10.1007/s10515-025-00493-w","url":null,"abstract":"<div><p>Research in software engineering is essential for improving software development practices, leading to reliable and secure software. Leveraging the principles of quantum physics, quantum computing has emerged as a new computational paradigm that offers significant advantages over classical computing. As quantum computing progresses rapidly, its potential applications across various fields are becoming apparent. In software engineering, many tasks involve complex computations where quantum computers can greatly speed up the development process, leading to faster and more efficient solutions. With the growing use of quantum-based applications in different fields, Quantum Software Engineering (QSE) has emerged as a discipline focused on designing, developing, and optimizing quantum software for diverse applications. This paper aims to review the role of quantum computing in software engineering research and the latest developments in QSE. To our knowledge, this is the first comprehensive review on this topic. We begin by introducing quantum computing, exploring its fundamental concepts, and discussing its potential applications in software engineering. We also examine various QSE techniques that expedite software development. Finally, we discuss the opportunities and challenges in quantum-driven software engineering and QSE. Our study reveals that quantum machine learning and quantum optimization have substantial potential to address classical software engineering tasks, though this area is still limited. Current QSE tools and techniques lack robustness and maturity, indicating a need for more focus. One of the main challenges is that quantum computing has yet to reach its full potential.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143529919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, Hui Liu
{"title":"Exploring the potential of general purpose LLMs in automated software refactoring: an empirical study","authors":"Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, Hui Liu","doi":"10.1007/s10515-025-00500-0","DOIUrl":"10.1007/s10515-025-00500-0","url":null,"abstract":"<div><p>Software refactoring is an essential activity for improving the readability, maintainability, and reusability of software projects. To this end, a large number of automated or semi-automated approaches/tools have been proposed to locate poorly designed code, recommend refactoring solutions, and conduct specified refactorings. However, even equipped with such tools, it remains challenging for developers to decide where and what kind of refactorings should be applied. Recent advances in deep learning techniques, especially in large language models (LLMs), make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. The evaluation results suggested that the performance of LLMs in identifying refactoring opportunities is generally low and remains an open problem. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6 to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143527611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CodeDoctor: multi-category code review comment generation","authors":"Yingling Li, Yuhan Wu, Zi’ao Wang, Lei Huang, Junjie Wang, Jianping Li, Minying Huang","doi":"10.1007/s10515-025-00491-y","DOIUrl":"10.1007/s10515-025-00491-y","url":null,"abstract":"<div><p>Code review is an effective software quality assurance activity. However, this process is labor-intensive and time-consuming, requiring reviewers to carefully review under various categories (e.g., function, refactoring, documentation, etc) to generate review comments. Several approaches have been proposed for automatic review comment generation, although they can generate review comments, they hardly cover all manual review comments. Because most of these approaches simply utilize the information of submitted code and review comments, not fully modeling the features of code review (i.e., ignoring review category, the association of issue snippets and review comments). In this paper, we propose CodeDoctor, an automatic review comment generator with data augmentation and category-aware encoder-decoder to generate multi-category review comments. It consists of three main phases: (1) Data augmentation phase, which classifies review comments and builds review exemplars (i.e., the pairs of issue snippet and its comment) to augment review data by using a large language model (LLM) with prompt engineering and feedback loops; (2) Encoder phase, which encodes the inputs (i.e., review category, diff code and review exemplar) into semantic and token representations; (3) Decoder phase, which designs a category-focused decoder to capture the most relevant information of given category for multi-category review comment generation. Evaluations with five commonly-used and state-of-the-art baselines on two datasets show that CodeDoctor outperforms all baselines, with 1770% higher average BLEU-4, 111% higher average ROUGE-L and 49% higher average F1 than the best baseline. Furthermore, a human evaluation also confirms the significant potential of applying CodeDoctor in practical usage. Our approach can relieve the burden of reviewers by automatically generating multi-category review comments, and helps developers better detect code issues as early as possible, thereby facilitating software development.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143513226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bmco-o: a smart code smell detection method based on co-occurrences","authors":"Feiqiao Mao, Kaihang Zhong, Long Cheng","doi":"10.1007/s10515-025-00486-9","DOIUrl":"10.1007/s10515-025-00486-9","url":null,"abstract":"<div><p>Code smell detection is a task aimed at identifying sub-optimal programming structures within code entities that may indicate problems requiring attention. It plays a crucial role in improving software quality. Numerous automatic or semi-automatic methods for code smell detection have been proposed. However, these methods are constrained by the manual setting of detection rules and thresholds, leading to subjective determinations, or they require large-scale labeled datasets for model training. In addition, they exhibit poor detection performance across different projects. Related studies have revealed the existence of co-occurrences among different types of code smells. Therefore, we propose a smart code smell detection method based on code smell co-occurrences, termed BMCo-O. The key insight is that code smell co-occurrences can assist in improving code smell detection. We introduce and utilize <i>code smell co-occurrence impact factor set</i>, a <i> code smell pre-filter mechanism</i>, and a <i>possibility mechanism</i>, which enable BMCo-O to demonstrate outstanding detection performance. To reduce manual intervention, we propose an <i>adaptive detection mechanism</i> that automatically adjusts parameters to detect different types of code smell in various software projects. As an initial attempt, we applied the proposed method to seven classical high-criticality code smells: Message Chain, Feature Envy, Spaghetti Code, Large Class, Complex Class, Refused Bequest, and Long Method. The evaluation results on benchmarks composed of open source software projects demonstrated that BMCo-O significantly outperforms the well-known and widely used methods in detecting these seven classical code smells, especially in F1, with improvements of 137%, 155%, 23%, 195%, 364%, 552% and 35%, respectively. To further verify its effectiveness in actual detection across different software projects, we also implemented a prototype of a new code smell detector using BMCo-O.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmad Bilal, Hamid Turab Mirza, Adnan Ahmad, Ibrar Hussain, Ahmad Salman Khan
{"title":"Unveiling functional aspects in google play education app titles and descriptions influencing app success","authors":"Ahmad Bilal, Hamid Turab Mirza, Adnan Ahmad, Ibrar Hussain, Ahmad Salman Khan","doi":"10.1007/s10515-025-00497-6","DOIUrl":"10.1007/s10515-025-00497-6","url":null,"abstract":"<div><p>Users search for applications on the online application store by inputting functional terms, such as “automated assignment solver”, “English translator” and “free VPN”. In response, the application store recommends a list of applications whose titles and descriptions closely match the user’s search terms. Acknowledging this, application developers incorporate trending and frequently searched functional terms into their application titles and descriptions to make them compelling and to enhance the visibility of their products in user searches, thereby increasing the likelihood of application success. However, traditional literature analyzing mobile application titles and descriptions to determine their impact on application success is scarce and may also lack data-analytical approaches. Moreover, the definition of application success provided by existing literature may be flawed, as it solely relies on higher downloads or positive numeric ratings, neglecting the crucial factor of time. This research proposes a Machine Learning-inspired framework to extract functional (aspects) themes from titles and descriptions of Google Play Education applications, influencing their success. It also formulates an enhanced definition of application success that considers downloads and ratings over a specific time period, and also integrates the user sentiment when defining application success. According to the findings of this research, themes of Math and Homework Support, Learning and Practice, Live Assistance and Tutoring, and Instant Solutions and Tools are highly correlated with success within the Education category of the Google Play store. Developers can enhance the visibility and appeal of their applications in user search results by incorporating these themes into their application titles and descriptions, ultimately leading to higher likelihood of success.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kashumi Madampe, John Grundy, Minh Nguyen, Ellen Welstead-Cloud, Vinh Tuan Huynh, Linh Doan, William Lay, Sayed Hashim
{"title":"EmoReflex: an AI-powered emotion-centric developer insights platform","authors":"Kashumi Madampe, John Grundy, Minh Nguyen, Ellen Welstead-Cloud, Vinh Tuan Huynh, Linh Doan, William Lay, Sayed Hashim","doi":"10.1007/s10515-025-00488-7","DOIUrl":"10.1007/s10515-025-00488-7","url":null,"abstract":"<div><p>There has been great interest in better understanding software engineer emotions during development. But how to do this? We built a prototype AI-powered Emotion-centric Developer Insights Platform, EmoReflex, to support developers to report and reflect how they feel when working on various tasks across different metrics. It also assists their managers to get insights into their team’s emotional health, and provides them with recommendations to guide them handle the team’s emotional wellbeing. We present our tool prototype and evaluation results generated by a user study conducted with two user groups consisting of twenty developers and twenty managers. We present some design implications derived from our user study that can be used to inform design decisions in emotion-centric software development tools.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00488-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MP: motion program synthesis with machine learning interpretability and knowledge graph analogy","authors":"Cheng-Hao Cai","doi":"10.1007/s10515-025-00495-8","DOIUrl":"10.1007/s10515-025-00495-8","url":null,"abstract":"<div><p>The advancement of physics-based engines has led to the popularity of virtual reality. To achieve a more realistic and immersive user experience, the behaviours of objects in virtual scenes are expected to conform to real-world physical laws accurately. This increases the workload and development time for developers. To facilitate development on physics-based engines, this paper proposes MP that is a motion program synthesis approach based on machine learning and analogical reasoning. MP follows the paradigm of test-driven development, where programs are generated to fit test cases of motions subject to multiple environmental factors such as gravity and airflows. To reduce the search space of code generation, regression models are used to find variables that cause significant influences to motions, while analogical reasoning on knowledge graphs is used to find operators that work for the found variables. Besides, constraint solving is used to probabilistically estimate the values of constants in motion programs. Experimental results have demonstrated that MP is efficient in various motion program generation tasks, with random forest regressors achieving low data and time requirements.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00495-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143438677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLM-enhanced evolutionary test generation for untyped languages","authors":"Ruofan Yang, Xianghua Xu, Ran Wang","doi":"10.1007/s10515-025-00496-7","DOIUrl":"10.1007/s10515-025-00496-7","url":null,"abstract":"<div><p>Dynamic programming languages, such as Python, are widely used for their flexibility and support for rapid development. However, the absence of explicit parameter type declarations poses significant challenges in generating automated test cases. This often leads to random assignment of parameter types, increasing the search space and reducing testing efficiency. Current evolutionary algorithms, which rely heavily on random mutations, struggle to handle specific data types and frequently fall into local optima, making it difficult to generate high-quality test cases. Moreover, the resulting test suites often contain errors, preventing immediate usage in real-world applications. To address these challenges, this paper proposes the use of large language models to enhance test case generation for dynamic programming languages. Our method involves three key steps: analyzing parameter types to narrow the search space, introducing meaningful data during mutations to increase test case relevance, and using large language models to automatically repair errors in the generated test suites. Experimental results demonstrate a 16% improvement in test coverage, faster evolutionary cycles, and an increase in the number of executable test suites. These findings highlight the potential of large language models in improving both the efficiency and reliability of test case generation for dynamic programming languages.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}