{"title":"Unleashing the Potential of Coverage Representation in Deep Learning-Based Fault Localization","authors":"Jinping Wang;Yan Lei;Huan Xie;Chunyan Liu","doi":"10.1109/TSE.2026.3674478","DOIUrl":"10.1109/TSE.2026.3674478","url":null,"abstract":"Fault localization (FL) aims to identify suspicious statements in faulty programs that may lead to program failure. Coverage representation, which includes code coverage trace and test outcomes, has proven effective in FL and is widely adopted in deep learning-based fault localization (DLFL) methods. However, we find that the potential of coverage representation in deep learning-based fault localization (DLFL) methods is inadequately unleashed (<italic>i.e.,</i> in the case of using pure coverage representation) since DLFL has exhibited inferior performance compared to traditional non-deep learning-based fault localization (Non-DLFL). Thus, we conduct systematic analyses by the three key steps of the DLFL task process: data-model-strategy. These analyses reveal that existing methodologies fail to handle three key scale variations in DLFL: sample scale of data, feature scale of model, and task scale of strategy. To address these challenges, we propose <bold>Muser</b>, a <bold><u>mu</u></b>lti-<bold><u>s</u></b>cale-awar<bold><u>e</u></b> deep lea<bold><u>r</u></b>ning-based fault localization method based on the data-model-strategy framework to unleash the potential of coverage representation in DLFL. Specifically, <bold>Muser</b> addresses the three scale variations through a hierarchical design: for the sample scale, <bold>Muser</b> employs a sample-scale-aware data augmentation method that dynamically selects optimal methods based on the sample size to mitigate class imbalance; for the feature scale, <bold>Muser</b> uses a proposed feature-scale-aware adaptive LSTM backbone with model sizing adjustments to handle varying feature dimensionality in coverage representations; for the task scale, <bold>Muser</b> further employs task-scale-aware modeling strategies to enhance robustness across diverse fault localization scenarios, thereby systematically improving model adaptability and performance. We conduct large-scale experiments to evaluate <bold>Muser</b>, and the results show that <bold>Muser</b> significantly improves pure coverage-based DLFL performance, <italic>e.g.,</i> the data augmentation method and the backbone neural network outperform the state-of-the-art (SOTA) baselines (<italic>i.e.,</i> PRAM and RNN-FL) by 12.41% and 94.44% on average in Top-1 metrics, respectively.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1592-1616"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147471005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weifeng Sun;Naiqi Huang;Meng Yan;Li Huang;Zhongxin Liu;Xiao Liu;David Lo
{"title":"Cost-Effective Adversarial Attacks Against Code LLM With Model Attention","authors":"Weifeng Sun;Naiqi Huang;Meng Yan;Li Huang;Zhongxin Liu;Xiao Liu;David Lo","doi":"10.1109/TSE.2026.3663143","DOIUrl":"10.1109/TSE.2026.3663143","url":null,"abstract":"Code LLMs (CLLMs) are vulnerable to adversarial attacks, where semantically identical code mutations mislead models into incorrect predictions. To address this, adversarial training has been proposed, retraining models with adversarial examples generated by attack methods. Among various attack approaches, black-box methods have attracted increasing attention due to their flexibility and applicability. However, existing black-box attack methods face two key challenges: 1) vast mutation spaces limit attack efficiency and effectiveness, and 2) resource-intensive model queries constrain scalability. These challenges hinder the practicality of black-box attacks, especially under resource constraints, prompting the critical question: <italic>Can we enhance the efficiency of existing attack methods without compromising their effectiveness?</i> To answer this, we conduct an empirical study using Explainable AI (XAI) techniques to investigate differences between adversarial and non-adversarial (failure) examples. After analyzing state-of-the-art attack methods against two CLLMs, we introduce the concept of <italic>model attention deviation</i>, which quantifies differences in the model’s focus between unmutated (original) and mutated code. Our findings reveal that adversarial examples exhibit significant attention deviations, with the direction of deviation critically affecting attack success. Building on these insights, we propose <sc>AdvSel</small>, an efficient adversarial attack framework comprising two proxy components: the Attention Proxy Model (APM), which quickly estimates attention deviations to filter unpromising mutations, and the Deviation Direction Proxy Model (DDPM), which assesses whether attention shifts lead toward incorrect predictions. By integrating these proxy models with existing attack methods, <sc>AdvSel</small> effectively prioritizes promising mutations, significantly improving attack efficiency. Experimental evaluations across five CLLMs, four downstream tasks, and three attack methods demonstrate that <sc>AdvSel</small> maintains comparable attack success rates (a slight ASR reduction of 0.62%–0.70%) while significantly reducing model queries (by 34.98%–42.91%) and runtime (by 20.84%–21.45%). Under resource constraints, <sc>AdvSel</small> consistently outperforms baselines, highlighting its practical advantage in cost-effective adversarial evaluation.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1371-1390"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CVH-REC: A Novel Method for Web API Recommendation Based on Cross-View HGNNs","authors":"Shanquan Gao;Yihui Wang;Zhenwei Ou","doi":"10.1109/TSE.2026.3666184","DOIUrl":"10.1109/TSE.2026.3666184","url":null,"abstract":"With the advancement of service computing technology, software developers tend to consume one or more web APIs, a practice that helps them avoid the behavior of reinventing the wheel. These web APIs are capable of providing services or data on the Internet; developers can use them to create feature-rich mashups. Against this backdrop, the number of web APIs across various platforms is growing rapidly, making it increasingly challenging to identify suitable ones for upcoming mashup creation. Consequently, web API recommendation has emerged as an effective means to facilitate web API discovery. We have proposed a novel web API recommendation method called R2API, which constructs the interactions between mashups and web APIs, as well as their tag usage records, into multiple homogeneous hypergraphs and then adopts HGNNs with multi-task learning to learn entity vectors for the recommendation task. While the result is encouraging, R2API’s limitation is its use of simple homogeneous hypergraphs to describe entities, failing to characterize them comprehensively and accurately. To further enhance recommendation performance, this work proposes a novel cross-view HGNNs-based web API recommendation method, namely CVH-REC. First, CVH-REC models the interactions between mashups and web APIs, as well as their tag usage records, into a multi-view knowledge graph to characterize entities more comprehensively and accurately. This knowledge graph comprises a global main hypergraph and four sub-hypergraphs from local views. Second, CVH-REC adopts a contrastive learning and multi-task learning framework to drive multiple HGNNs in jointly learning entity vectors. Third, CVH-REC leverages the SBERT model to derive the semantic vector from the mashup requirement and transfers it into the vector space of the knowledge graph with an MLP. This process enables the generation of a higher-quality requirement vector. By comparing the vector of the mashup requirement with those of web APIs, CVH-REC generates a recommendation list for mashup creation. Extensive experiments on a real-world dataset demonstrate that the proposed method outperforms baseline methods.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1446-1461"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivens da Silva Portugal;Paulo Alencar;Donald Cowan
{"title":"Agentic Recommender Systems: A Systematic Literature Review","authors":"Ivens da Silva Portugal;Paulo Alencar;Donald Cowan","doi":"10.1109/TSE.2026.3657900","DOIUrl":"10.1109/TSE.2026.3657900","url":null,"abstract":"Recommender systems (RSs) are software systems that use machine learning techniques to suggest items, such as movies, products or routes to users based on input provided over time. These systems have been widely used by companies, such as Amazon, Google, and Netflix, who rely heavily on recommendations as a core element of their interaction with users. In the case of software engineering, RSs are used to recommend, for example, software tasks, developers, APIs, libraries, and bug fixes. More recently, advancements in natural language processing and large language models (LLMs) have led to the creation of LLM agents capable of generating plans, interacting with users and the environment, receiving feedback, and using tools. These developments have resulted in the use of LLM agents in RSs, giving rise to multi-agent RSs referred to as agentic recommender systems (ARSs). However, agentic RSs are not yet well characterized in terms of the basic elements of system modeling and design. This systematic literature review characterizes agentic recommender systems with respect to agents, roles, relationships, prompts, integration, use cases, strategies and evaluation methods. An analysis of published studies yields insights, such as (i) the identification of 13 distinct types of agents, (ii) the popularity of GPT models, (iii) a framework for ARSs, (iv) a generalized prompt, (v) the use of persona, cues, output format, and zero-shot prompting, (vi) the frequent use of Amazon, Yelp, and MovieLens datasets, (vii) the use of nDCG@k and Recall@k as common evaluation metrics, and (viii) the identification of many research gaps. This systematic literature review aims to support the continued research and development of agentic recommender systems.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1234-1264"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11363682","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146056291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Function to Repository: Towards Repository-Level Evaluation of Software Vulnerability Detection","authors":"Xin-Cheng Wen;Xinchen Wang;Yujia Chen;Ruida Hu;David Lo;Cuiyun Gao","doi":"10.1109/TSE.2026.3662145","DOIUrl":"10.1109/TSE.2026.3662145","url":null,"abstract":"Deep Learning (DL)-based methods have proven to be effective for software vulnerability detection, with a potential for substantial productivity enhancements for detecting vulnerabilities. Current methods mainly focus on detecting single functions (i.e., intra-procedural vulnerabilities), ignoring the more complex inter-procedural vulnerability detection scenarios in practice. For example, developers routinely engage with program analysis to detect vulnerabilities that span multiple functions within repositories. In addition, the widely-used benchmark datasets generally contain only intra-procedural vulnerabilities, leaving the assessment of inter-procedural vulnerability detection capabilities unexplored. To mitigate the issues, we propose a holistic multi-level evaluation system, named <bold>VulEval</b>, aiming at evaluating the detection performance of inter- and intra-procedural vulnerabilities simultaneously. Specifically, VulEval consists of three interconnected evaluation tasks: <bold>(1) Function-Level Vulnerability Detection</b>, aiming at detecting intra-procedural vulnerability given a code snippet; <bold>(2) Vulnerability-Related Dependency Prediction</b>, aiming at retrieving the vulnerable-related dependency from call graphs for providing developers with explanations about the vulnerabilities; and <bold>(3) Repository-Level Vulnerability Detection</b>, aiming at detecting inter-procedural vulnerabilities by combining with the dependencies identified in the second task. VulEval also consists of a large-scale dataset, with a total of 4,196 CVE entries, 232,239 functions, and corresponding 4,699 repository-level source code in C/C++ programming languages. By evaluating 19 vulnerability detection methods on the data split randomly and by time respectively, we observe that the repository-level vulnerability detection framework outperforms the corresponding function-level methods, with an increase of 7.43% in precision, 3.38% in recall, 4.91% in F1 score, and 5.24% in MCC on average except for PILOT. It indicates that incorporating vulnerability-related dependencies facilitates vulnerability detection. Our experimental results also demonstrate that the performance of program-analysis- and prompt-based methods are not affected when splitting the data by time. In addition, our findings indicate that the split setting, retrieval techniques, and vulnerability types have substantial impacts on the performance of repository-level vulnerability detection. We conclude our insights and takeaways for researchers and developers for software vulnerability detection in practice.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1315-1331"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147279271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alejandrina M. Aranda;Oscar Dieste;Jose Ignacio Panach;Natalia Juristo
{"title":"Visibility of Domain Elements in the Elicitation Process Interviews: A Family of Empirical Studies","authors":"Alejandrina M. Aranda;Oscar Dieste;Jose Ignacio Panach;Natalia Juristo","doi":"10.1109/TSE.2026.3662599","DOIUrl":"10.1109/TSE.2026.3662599","url":null,"abstract":"<italic>Background</i>: Various factors determine analyst effectiveness during elicitation. While the literature suggests that elicitation technique and time are influential factors, other attributes could also play a role. <italic>Aim</i>: Determine aspects that may have an influence on analysts’ ability to identify certain elements of the problem domain. <italic>Methodology</i>: We conducted 14 quasi-experiments, inquiring 134 subjects about two problem domains. For each problem domain, we calculated whether the experimental subjects identified the problem domain elements (concepts, processes, and requirements), i.e., the degree to which these domain elements were visible. <italic>Results</i>: Domain element visibility does not appear to be related to either analyst experience or analyst-client interaction. Domain element visibility varies depending on how analysts express the information they have elicited. When analysts are directly asked about the knowledge they acquired during elicitation, visibility increases substantially compared to when they describe the information in a written report. <italic>Conclusions</i>: Further research is required to replicate our results. However, the finding that analysts have difficulty reporting the information they have acquired is useful for identifying alternatives for improving the documentation of elicitation results. We found evidence that other issues, like domain complexity, the relative importance of different elements within the domain, and the interview script, also seem influential.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1332-1351"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11389211","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Liu;Yuqing Niu;Chengyan Ma;Ruidong Han;Wei Ma;Yi Li;Debin Gao;David Lo
{"title":"Towards Secure Program Partitioning for Smart Contracts With LLM’s In-Context Learning","authors":"Ye Liu;Yuqing Niu;Chengyan Ma;Ruidong Han;Wei Ma;Yi Li;Debin Gao;David Lo","doi":"10.1109/TSE.2026.3668858","DOIUrl":"10.1109/TSE.2026.3668858","url":null,"abstract":"Smart contracts are highly susceptible to manipulation attacks due to the leakage of sensitive information. Addressing manipulation vulnerabilities is particularly challenging because they stem from inherent data confidentiality issues rather than straightforward implementation bugs. To tackle this by preventing sensitive information leakage, we present <sc>PartitionGPT</small>, the first LLM-driven approach that combines static analysis with the in-context learning capabilities of large language models (LLMs) to partition smart contracts into critical (privileged) and normal codebases, guided by a few annotated sensitive data variables. We evaluated <sc>PartitionGPT</small> on 18 annotated smart contracts containing 99 sensitive functions. The results demonstrate that <sc>PartitionGPT</small> successfully generates <italic>compilable</i>, and <italic>verified</i> partitions, achieving a precision of 80% while reducing more than 26% code compared to function-level partitioning approach. Furthermore, we evaluated <sc>PartitionGPT</small> on nine real-world manipulation attacks that led to a total loss of 25 million dollars, <sc>PartitionGPT</small> effectively prevents eight cases, highlighting its potential for broad applicability and the necessity for secure program partitioning during smart contract development to diminish manipulation vulnerabilities.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1549-1567"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147350745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dynamic Test Oracle for Quantum Programs With Separable Output States","authors":"Yuechen Li;Kai-Yuan Cai;Beibei Yin","doi":"10.1109/TSE.2026.3670211","DOIUrl":"10.1109/TSE.2026.3670211","url":null,"abstract":"As quantum software engineering advances, testing techniques are required to assess the quality of quantum programs (QPs). In the test process, the test oracle is vital for determining whether the test result indicates a success or a failure. Most related works directly measure the output states and acquire the corresponding test results by comparing the output distribution with the expected one. While attention has been paid to the capability of fault detection, the guarantee for the correctness of the produced test results remains limited. Unlike classical programs (CPs), the output quantum states of QPs should be transformed into probabilistic classical outcomes through quantum measurement. This additional operation of measurement could cause a test oracle to yield the wrong test results. Especially for high-dimensional output spaces, numerous measurement outcomes are required to capture the distribution characteristics, threatening the effectiveness and cost-efficiency of test oracles. Hence, this paper proposes a novel specified test oracle DOSS employing a dynamic scheme to integrate a quantum algorithm (i.e., swap test) with the direct measurement mode. This innovative approach enables the validation of individual outputs rather than their distribution during the testing phase. Considering acceptable cost, DOSS decomposes the fully or partially separable output states to lower the dimensionality and simplify the quantum circuit for testing. Empirical studies demonstrate that DOSS generally gives more correct test results than baselines, and maintains reasonable cost on an ideal simulator. Besides, DOSS’s effectiveness with quantum noise involved is validated via three noisy simulators.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1568-1591"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147361020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Compiler Fault Localization: Getting the Best of Both Worlds by Fusing Dynamic and Historical Data","authors":"Qingyang Li;Yibiao Yang;Maolin Sun;Jiangchang Wu;Qingkai Shi;Yuming Zhou;Baowen Xu","doi":"10.1109/TSE.2026.3666208","DOIUrl":"10.1109/TSE.2026.3666208","url":null,"abstract":"Compilers are prone to bugs that can have severe consequences for downstream applications. Accurately identifying and localizing compiler faults poses unique challenges due to the inherent complexity and large scale of modern compiler infrastructures. Existing studies have proposed various techniques to construct passing and failing executions by generating witness test programs from bug-inducing test cases or by producing adversarial compilation configurations for the same test program. These executions are then leveraged to apply spectrum-based fault localization (SBFL) techniques for isolating compiler faults, yielding promising results. Recently, Yang et al. revisited SBFL-based techniques and showed that a simple yet widely adopted debugging practice—treating files modified in bug-inducing commits (BICs) as potential fault candidates—can surprisingly outperform SBFL-based techniques on the most critical localization metrics. Moreover, they further demonstrated that BIC-based and SBFL-based techniques are highly complementary, as they tend to localize different subsets of compiler faults. Consequently, effectively integrating these two sources of information to improve compiler fault localization remains an open and largely unexplored challenge. To address this problem, we propose <sc>DualTrack</small>, a hybrid approach that integrates dynamic execution information from SBFL with historical information derived from BICs. <sc>DualTrack</small> employs a two-layer framework that first prioritizes files modified in bug-inducing commits and then refines their rankings using suspiciousness scores computed by SBFL formulas. An evaluation on 120 real-world compiler bugs from GCC and LLVM shows that <sc>DualTrack</small> successfully identifies 52% of faulty files at the Top-1 rank, demonstrating a substantial improvement over existing state-of-the-art compiler fault localization techniques.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1462-1477"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146231067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahao Yuan;Weinuo Leng;Xuan Wei;Qi Xin;Xiaoyuan Xie;Jifeng Xuan
{"title":"Studying and Improving the Soundness of Input-Based Feature-Oriented Debloating","authors":"Jiahao Yuan;Weinuo Leng;Xuan Wei;Qi Xin;Xiaoyuan Xie;Jifeng Xuan","doi":"10.1109/TSE.2026.3660377","DOIUrl":"10.1109/TSE.2026.3660377","url":null,"abstract":"The paper presents (1) a systematic study of the soundness of feature-oriented debloating techniques that use inputs for feature specification and (2) <sc>BlockAug</small>, a new blocking method we propose for soundness improvement. Feature-oriented debloating aims to eliminate code bloat corresponding to unneeded program features. Many of these techniques rely on a usage profile—typically a set of inputs—to specify the features that should be preserved. They tend to produce programs overfitted to the provided inputs, introducing soundness issues in the form of bugs and vulnerabilities that threaten program correctness and security. However, no prior work has systematically studied the soundness of existing input-based debloating techniques and analyzed the types and causes of the soundness issues they introduce. To fill this gap, we applied 7 input-based techniques to 18 programs from two existing benchmarks for debloating and used three fuzzers with multiple sanitizers to detect soundness issues. Our results show that current techniques are highly unsound, as they can introduce a number of issues leading to program crashes. A key cause of the issues is the inappropriate deletion of soundness-related code, such as conditional checks for invalid cases, whose removal can lead to unexpected program states and unconditioned execution. To improve the soundness of such input-based debloating, we propose <sc>BlockAug</small>, a blocking method applicable to coverage-based techniques. The core idea is to identify each branch deleted by coverage-based code pruning and, instead of leaving the branch empty, augment it to block any execution from passing it through and causing problems. To assess the effectiveness of <sc>BlockAug</small>, we used it to augment the debloated programs generated by four coverage-based techniques and evaluated the program soundness and generality. We found that <sc>BlockAug</small> can significantly improve soundness, at the cost of slightly increasing the program size. Although <sc>BlockAug</small> can alter program semantics, it does not significantly reduce generality; empirically, it largely preserves the program’s ability to handle feature-related inputs not seen during debloating. Moreover, <sc>BlockAug</small> can forbid unexpected execution of any inputs the program should not have processed, thereby improving the program trustworthiness.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1282-1300"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146101477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}