{"title":"Pipe-DBT: enhancing dynamic binary translation simulators to support pipeline-level simulation","authors":"Tiancheng Tang, Yi Man, Xinbing Zhou, Duqing Wang","doi":"10.1007/s10515-025-00506-8","DOIUrl":"10.1007/s10515-025-00506-8","url":null,"abstract":"<div><p>In response to the lack of pipeline behavior modeling in Instruction-Set Simulators (ISS) and the performance limitations of Cycle-Accurate Simulators (CAS), this paper proposes Pipe-DBT, a pipeline simulation framework based on Dynamic Binary Translation (DBT). This method achieves a balance between accuracy and efficiency through two key techniques: (1) the design of a pipeline state descriptor called Pipsdep, which abstracts data hazards and resource contentions in the form of formal rules about resource occupancy and read/write behaviors, thereby avoiding low-level hardware details; (2) the introduction of a coroutine-based instruction execution flow partitioning mechanism that employs dynamic suspension/resumption to realize cycle-accurate scheduling in multi-stage pipelines. Implemented on QEMU, Pipe-DBT supports variable-length pipelines, a Very Long Instruction Word (VLIW) architecture with four-issue capability, and pipeline forwarding. Under typical DSP workloads, it achieves a simulation speed of 400–1100 KIPS, representing a 2.3<span>(times)</span> improvement over Gem5 in cycle-accurate mode. Experimental results show that only modular extensions to the host DBT framework are required to accommodate heterogeneous pipeline microarchitectures, thereby providing a high-throughput simulation infrastructure for processor design. To the best of our knowledge, this is the first pipeline-level simulation model implemented on a DBT simulator.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00506-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuning Li, Wenkang Zhong, Zongwen Shen, Chuanyi Li, Xiang Chen, Jidong Ge, Bin Luo
{"title":"An empirical study on the code naturalness modeling capability for LLMs in automated patch correctness assessment","authors":"Yuning Li, Wenkang Zhong, Zongwen Shen, Chuanyi Li, Xiang Chen, Jidong Ge, Bin Luo","doi":"10.1007/s10515-025-00502-y","DOIUrl":"10.1007/s10515-025-00502-y","url":null,"abstract":"<div><p>Just like natural language, code can exhibit naturalness. This property manifests in highly repetitive patterns within specific contexts. Code naturalness can be captured by language models and then applied to various software engineering tasks (such as fault localization and program repair). Recently, Large Language Models (LLMs) based on Transformers have become advantageous tools for modeling code naturalness. However, existing work lacks systematic studies on the code naturalness modeling capability for LLMs. To bridge this gap, this paper explores the code naturalness modeling capability for LLMs, starting with the task of automated patch correctness assessment. Specifically, we investigate whether LLMs with different architectures and scales, under varying context window sizes, (1) can identify buggy code from common code based on naturalness and consider fixed code more natural than buggy code, and (2) can distinguish different degrees of repairs (i.e., complete repairs and incomplete repairs) from automated tools. Then, we propose metrics to assess the above two capabilities of the models. Experimental results indicate that models with different architectures and scales have the code naturalness modeling capability, even models not specifically pre-trained on code. Additionally, smaller models do not necessarily exhibit weaker modeling capability compared to larger models. We also find more contextual information only provides limited benefits. Based on experimental findings, we select the best performing model that has 220 M parameters to develop an Entropy-based Automated Patch Correctness Assessment (E-APCA) approach by calculating code naturalness. On the large-scale dataset PraPatch, E-APCA surpasses traditional methods by over 20% across various evaluation metrics. Compared to the latest APCA method Entropy-delta based on a 6.7B LLM, E-APCA achieves a 17.32% higher correct patch recall and a 6.83% higher F1 score, while the reasoning time is less than 7% of that required by Entropy-delta.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ladle: a method for unsupervised anomaly detection across log types","authors":"Juha Mylläri, Tatu Aalto, Jukka K. Nurminen","doi":"10.1007/s10515-025-00504-w","DOIUrl":"10.1007/s10515-025-00504-w","url":null,"abstract":"<div><p>Log files can help detect and diagnose erroneous software behaviour, but their utility is limited by the ability of users and developers to sift through large amounts of text. Unsupervised machine learning tools have been developed to automatically find anomalies in logs, but they are usually not designed for situations where a large number of log streams or log files, each with its own characteristics, need to be analyzed and their anomaly scores compared. We propose Ladle, an accurate unsupervised anomaly detection and localization method that can simultaneously learn the characteristics of hundreds of log types and determine which log entries are the most anomalous across these log types. Ladle uses a sentence transformer (a large language model) to embed short overlapping segments of log files and compares new, potentially anomalous, log segments against a collection of reference data. The result of the comparison is re-centered by subtracting a baseline score indicating how much variation tends to occur in each log type, making anomaly scores comparable across log types. Ladle is designed to adapt to data drift and is updated by adding new reference data without the need to retrain the sentence transformer. We demonstrate the accuracy of Ladle on a real-world dataset consisting of logs produced by an endpoint protection platform test suite. We also compare Ladle’s performance on the dataset to that of a state-of-the-art method for single-log anomaly detection, showing that the latter is inadequate for the multi-log task.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00504-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Requirement falsification for cyber-physical systems using generative models","authors":"Jarkko Peltomäki, Ivan Porres","doi":"10.1007/s10515-025-00503-x","DOIUrl":"10.1007/s10515-025-00503-x","url":null,"abstract":"<div><p>We present the OGAN algorithm for automatic requirement falsification of cyber-physical systems. System inputs and outputs are represented as piecewise constant signals over time while requirements are expressed in signal temporal logic. OGAN can find inputs that are counterexamples for the correctness of a system revealing design, software, or hardware defects before the system is taken into operation. The OGAN algorithm works by training a generative machine learning model to produce such counterexamples. It executes tests offline and does not require any previous model of the system under test. We evaluate OGAN using the ARCH-COMP benchmark problems, and the experimental results show that generative models are a viable method for requirement falsification. OGAN can be applied to new systems with little effort, has few requirements for the system under test, and exhibits state-of-the-art CPS falsification efficiency and effectiveness.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00503-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Liu, Yinkang Xu, Weifeng Sun, Naiqi Huang, Song Sun, Qiang Li, Dan Yang, Meng Yan
{"title":"Tab: template-aware bug report title generation via two-phase fine-tuned models","authors":"Xiao Liu, Yinkang Xu, Weifeng Sun, Naiqi Huang, Song Sun, Qiang Li, Dan Yang, Meng Yan","doi":"10.1007/s10515-025-00505-9","DOIUrl":"10.1007/s10515-025-00505-9","url":null,"abstract":"<div><p>Bug reports play a critical role in the software development lifecycle by helping developers identify and resolve defects efficiently. However, the quality of bug report titles, particularly in open-source communities, can vary significantly, which complicates the bug triage and resolution processes. Existing approaches, such as iTAPE, treat title generation as a one-sentence summarization task using sequence-to-sequence models. While these methods show promise, they face two major limitations: (1) they do not consider the distinct components of bug reports, treating the entire report as a homogeneous input, and (2) they struggle to handle the variability between template-based and non-template-based reports, often resulting in suboptimal titles. To address these limitations, we propose <span>TAB</span>, a hybrid framework that combines a <i>Document Component Analyzer</i> based on a pre-trained BERT model and a <i>Title Generation Model</i> based on CodeT5. <span>TAB</span> addresses the first limitation by segmenting bug reports into four components-<i>Description</i>, <i>Reproduction</i>, <i>Expected Behavior</i>, and <i>Others</i>-to ensure better alignment between input and output. For the second limitation, <span>TAB</span> uses a divergent approach: for template-based reports, titles are generated directly, while for non-template reports, DCA extracts key components to improve title relevance and clarity. We evaluate <span>TAB</span> on both template-based and non-template-based bug reports, demonstrating that it significantly outperforms existing methods. Specifically, <span>TAB</span> achieves average improvements of 170.4–389.5% in METEOR, 67.8–190.0% in ROUGE-L, and 65.7–124.5% in chrF(AF) compared to baseline approaches on template-based reports. Additionally, on non-template-based reports, <span>TAB</span> shows an average improvement of 64% in METEOR, 3.6% in ROUGE-L, and 14.8% in chrF(AF) over the state-of-the-art. These results confirm the robustness of <span>TAB</span> in generating high-quality titles across diverse bug report formats.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143668190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reinforcement learning for mutation operator selection in automated program repair","authors":"Carol Hanna, Aymeric Blot, Justyna Petke","doi":"10.1007/s10515-025-00501-z","DOIUrl":"10.1007/s10515-025-00501-z","url":null,"abstract":"<div><p>Automated program repair techniques aim to aid software developers with the challenging task of fixing bugs. In heuristic-based program repair, a search space of mutated program variants is explored to find potential patches for bugs. Most commonly, every selection of a mutation operator during search is performed uniformly at random, which can generate many buggy, even uncompilable programs. Our goal is to reduce the generation of variants that do not compile or break intended functionality which waste considerable resources. In this paper, we investigate the feasibility of a reinforcement learning-based approach for the selection of mutation operators in heuristic-based program repair. Our proposed approach is programming language, granularity-level, and search strategy agnostic and allows for easy augmentation into existing heuristic-based repair tools. We conducted an extensive empirical evaluation of four operator selection techniques, two reward types, two credit assignment strategies, two integration methods, and three sets of mutation operators using 30,080 independent repair attempts. We evaluated our approach on 353 real-world bugs from the Defects4J benchmark. The reinforcement learning-based mutation operator selection results in a higher number of test-passing variants, but does not exhibit a noticeable improvement in the number of bugs patched in comparison with the baseline, uniform random selection. While reinforcement learning has been previously shown to be successful in improving the search of evolutionary algorithms, often used in heuristic-based program repair, it has yet to demonstrate such improvements when applied to this area of research.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00501-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialun Cao, Meiziniu Li, Ming Wen, Shing-Chi Cheung
{"title":"A study on prompt design, advantages and limitations of ChatGPT for deep learning program repair","authors":"Jialun Cao, Meiziniu Li, Ming Wen, Shing-Chi Cheung","doi":"10.1007/s10515-025-00492-x","DOIUrl":"10.1007/s10515-025-00492-x","url":null,"abstract":"<div><p>The emergence of large language models (LLMs) such as ChatGPT has revolutionized many fields. In particular, recent advances in LLMs have triggered various studies examining the use of these models for software development tasks, such as program repair, code understanding, and code generation. Prior studies have shown the capability of ChatGPT in repairing conventional programs. However, debugging deep learning (DL) programs poses unique challenges since the decision logic is not directly encoded in the source code. This requires LLMs to not only parse the source code syntactically but also understand the intention of DL programs. Therefore, ChatGPT’s capability in repairing DL programs remains unknown. To fill this gap, our study aims to answer three research questions: (1) Can ChatGPT debug DL programs effectively? (2) How can ChatGPT’s repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair? Our study analyzes the typical information that is useful for prompt design and suggests enhanced prompt templates that are more efficient for repairing DL programs. On top of them, we summarize the dual perspectives (i.e., advantages and disadvantages) of ChatGPT’s ability, such as its handling of API misuse and recommendation, and its shortcomings in identifying default parameters. Our findings indicate that ChatGPT has the potential to repair DL programs effectively and that prompt engineering and dialogue can further improve its performance by providing more code intention. We also identified the key intentions that can enhance ChatGPT’s program repairing capability.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00492-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143564412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BallPri: test cases prioritization for deep neuron networks via tolerant ball in variable space","authors":"Chengyu Jia, Jinyin Chen, Xiaohao Li, Haibin Zheng, Luxin Zhang","doi":"10.1007/s10515-025-00498-5","DOIUrl":"10.1007/s10515-025-00498-5","url":null,"abstract":"<div><p>Deep neural networks (DNNs) have gained widespread adoption in various applications, including some safety-critical domains such as autonomous driving. However, despite their impressive capabilities and outstanding performance, DNNs could also exhibit incorrect behaviors that may lead to serious accidents. As a result, it requires security assurance urgently when applied to safety-critical applications. Deep testing has been developed as an effective technique for detecting incorrectness in DNN behaviors and improving their robustness when necessary, but it needs a large amount of labeled test cases that are expensive to obtain due to the labor-intensive data labeling process. Test case prioritization has been proposed to identify more error-exposed test cases earlier in advance, and several techniques such as DeepGini and PRIMA have been developed that achieve effective and efficient prioritization for classification tasks. However, these methods still face challenges such as unreliable validity, limited application scenarios, and high time complexity. To tackle these issues, we present a novel test prioritization method <i>BallPri</i> by using tolerant ball in variable space for DNNs. It extracts tolerant ball of different test cases and use minimum non-parametric likelihood ratio (MinLR) to further enlarge the difference of distribution in variable space, to achieve effective and general test cases prioritizing. Extensive experiments on benchmark datasets and models validate that <i>BallPri</i> outperforms the state-of-the-art methods in three key aspects: (1) <i>Effective</i>—it leverages tolerant ball in variable space to identify malicious bug-revealing inputs. <i>BallPri</i> significantly improves 47.83% prioritization effectiveness and 37.27% prioritization efficiency on average compared with baselines. (2) <i>Extensible</i>—it can be applied to various tasks, data and models. We verify the superiority of <i>BallPri</i> on classification and regression task, convolutional neural network and recurrent neural network model, image, text and speech dataset. (3) <i>Efficient</i>—it achieves a low time complexity compared with existing methods. We further evaluate <i>BallPri</i> against potential adaptive attacks and provide guidance for its accuracy and robustness. The open-source code of <i>BallPri</i> could be downloaded at https://github.com/lixiaohaao/BallPri.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bash command comment generation via multi-scale heterogeneous feature fusion","authors":"Junsan Zhang, Yang Zhu, Ao Lu, Yudie Yan, Yao Wan","doi":"10.1007/s10515-025-00494-9","DOIUrl":"10.1007/s10515-025-00494-9","url":null,"abstract":"<div><p>Automatic generation of Bash command comments is crucial for understanding and updating commands in software maintenance. Existing mainstream methods mainly focus on learning from the sequential text of Bash commands and combining retrieval-enhanced techniques to generate comments. However, these methods overlook the syntactic structure of Bash commands, thereby limiting the quality and accuracy of generated comments. This paper proposes a heterogeneous Bash comment generation framework named HBCom, which is aimed at deeply exploring the semantic information of Bash commands from command token sequences and syntactic structures to generate more accurate and natural command comments. The core of HBCom lies in constructing a Heterogeneous Information Graph (HIG) based on an Abstract Syntax Tree, which integrates the syntactic structure of Bash commands with the code sequence through six types of edges, providing a solid information basis for subsequent comment generation. In addition, we propose a heterogeneous and multi-scale graph neural network to capture various relationships in HIGs. Subsequently, we utilize a Transformer decoder, combined with a copy mechanism based on multi-head attention, to decode and fuse the HIG and Bash command tokens features, ultimately generating high-quality comments. We conduct extensive experiments on Bash dataset, demonstrating that HBCom outperforms compared baseline models in BLEU, ROUGE-L, and METEOR metrics. Furthermore, human evaluations confirm HBCom’s effectiveness in real-world application scenarios.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashis Kumar Mandal, Md Nadim, Chanchal K. Roy, Banani Roy, Kevin A. Schneider
{"title":"Quantum software engineering and potential of quantum computing in software engineering research: a review","authors":"Ashis Kumar Mandal, Md Nadim, Chanchal K. Roy, Banani Roy, Kevin A. Schneider","doi":"10.1007/s10515-025-00493-w","DOIUrl":"10.1007/s10515-025-00493-w","url":null,"abstract":"<div><p>Research in software engineering is essential for improving software development practices, leading to reliable and secure software. Leveraging the principles of quantum physics, quantum computing has emerged as a new computational paradigm that offers significant advantages over classical computing. As quantum computing progresses rapidly, its potential applications across various fields are becoming apparent. In software engineering, many tasks involve complex computations where quantum computers can greatly speed up the development process, leading to faster and more efficient solutions. With the growing use of quantum-based applications in different fields, Quantum Software Engineering (QSE) has emerged as a discipline focused on designing, developing, and optimizing quantum software for diverse applications. This paper aims to review the role of quantum computing in software engineering research and the latest developments in QSE. To our knowledge, this is the first comprehensive review on this topic. We begin by introducing quantum computing, exploring its fundamental concepts, and discussing its potential applications in software engineering. We also examine various QSE techniques that expedite software development. Finally, we discuss the opportunities and challenges in quantum-driven software engineering and QSE. Our study reveals that quantum machine learning and quantum optimization have substantial potential to address classical software engineering tasks, though this area is still limited. Current QSE tools and techniques lack robustness and maturity, indicating a need for more focus. One of the main challenges is that quantum computing has yet to reach its full potential.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143529919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}