Paul Kobialka, Andrea Pferscher, Bernhard K. Aichernig, Einar Broch Johnsen, Silvia Lizeth Tapia Tarifa
{"title":"Automata Learning versus Process Mining: The Case for User Journeys","authors":"Paul Kobialka, Andrea Pferscher, Bernhard K. Aichernig, Einar Broch Johnsen, Silvia Lizeth Tapia Tarifa","doi":"10.1109/tse.2026.3679253","DOIUrl":"https://doi.org/10.1109/tse.2026.3679253","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"64 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2026-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147617530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Waseem Akram, Yanjie Jiang, Haris Ali Khan, Furqan Jalil, Hui Liu
{"title":"A Cross-Language Approach to Recommending Method Names According to Functional Descriptions","authors":"Waseem Akram, Yanjie Jiang, Haris Ali Khan, Furqan Jalil, Hui Liu","doi":"10.1109/tse.2026.3679551","DOIUrl":"https://doi.org/10.1109/tse.2026.3679551","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"121 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2026-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147598861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pouya Fathollahzadeh;Mariam El Mezouar;Hao Li;Ying Zou;Ahmed E. Hassan
{"title":"Towards Refining Developer Questions Using LLM-Based Named Entity Recognition for Developer Chatroom Conversations","authors":"Pouya Fathollahzadeh;Mariam El Mezouar;Hao Li;Ying Zou;Ahmed E. Hassan","doi":"10.1109/TSE.2026.3663599","DOIUrl":"10.1109/TSE.2026.3663599","url":null,"abstract":"In software engineering chatrooms, communication is often hindered by imprecise questions that cannot be answered. Recognizing key entities (e.g., programming languages and libraries) and user intent (e.g., learning or requesting a review) can be essential for improving question clarity and facilitating better exchange. However, existing research using natural language processing techniques often overlooks these software-specific nuances. In this paper, we introduce <underline>S</u>oftwar<underline>E</u>-specific <underline>N</u>amed entity recognition, <underline>I</u>ntent detection, and <underline>R</u>esolution classification (SENIR), a labelling approach that leverages a Large Language Model to annotate entities, intents, and resolution status in developer chatroom conversations. To offer quantitative guidance for improving question clarity and resolvability, we build a resolution prediction model that leverages SENIR’s entity and intent labels along with additional predictive features. We evaluate SENIR on the DISCO dataset using a subset of annotated chatroom dialogues. SENIR achieves an 86% F-score for entity recognition, a 71% F-score for intent detection, and an 89% F-score for resolution status classification. Furthermore, our resolution prediction model, tested with various sampling strategies (random undersampling and oversampling with SMOTE) and evaluation methods (5-fold cross-validation, 10-fold cross-validation, and bootstrapping), demonstrates AUC values ranging from 0.7 to 0.8. Key factors influencing resolution include positive sentiment and entities such as <monospace>Programming Language</monospace> and <monospace>User Variable</monospace> across multiple intents, while diagnostic entities (e.g., <monospace>Error Name</monospace>) are more relevant in error-related questions. Moreover, resolution rates vary significantly by intent: questions about <italic>API Usage</i> and <italic>API Change</i> achieve higher resolution rates, whereas <italic>Discrepancy</i> and <italic>Review</i> have lower resolution rates. A Chi-Square analysis confirms the statistical significance of these differences.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1391-1406"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Are Decoder-Only Large Language Models the Silver Bullet for Code Search?","authors":"Yuxuan Chen;Mingwei Liu;Guangsheng Ou;Anji Li;Dekun Dai;Yanlin Wang;Zibin Zheng","doi":"10.1109/TSE.2026.3657353","DOIUrl":"10.1109/TSE.2026.3657353","url":null,"abstract":"Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets. The advent of powerful decoder-only Large Language Models (LLMs) has revolutionized many code intelligence tasks. However, their effectiveness for the retrieval-based task of code search, particularly compared to established encoder-based models, remains underexplored. This paper addresses this gap by presenting a large-scale systematic evaluation of eleven decoder-only LLMs, analyzing their performance across zero-shot and fine-tuned settings. Our results show that fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder, achieving a 40.4% higher Mean Average Precision (MAP) on the CoSQA<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula> benchmark. Our analysis further reveals two crucial nuances for practitioners: first, the relationship between model size and performance is non-monotonic, with mid-sized models often outperforming larger variants; second, the composition of the training data is critical, as a multilingual dataset enhances generalization while a small amount of data from a specific language can act as noise and interfere with model effectiveness. These findings offer a comprehensive guide to selecting and optimizing modern LLMs for code search.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1215-1233"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diagnosing Violations of State-Based Specifications in iCFTL","authors":"Cristina Stratan;Claudio Mandrioli;Domenico Bianculli","doi":"10.1109/TSE.2026.3667445","DOIUrl":"10.1109/TSE.2026.3667445","url":null,"abstract":"As modern software systems grow in complexity and operate in dynamic environments, the need for runtime analysis techniques becomes a more critical part of the verification and validation process. Runtime verification monitors the runtime system behaviour by checking whether an execution trace—a sequence of recorded events—satisfies a given specification, yielding a Boolean or quantitative verdict. However, when a specification is violated, such a verdict is often insufficient to understand why the violation happened. To fill this gap, <i>diagnostics</i> approaches aim to produce more informative verdicts. In this paper, we address the problem of generating informative verdicts for violated Inter-procedural Control-Flow Temporal Logic (iCFTL) specifications that express constraints over program variable values. We propose a diagnostic approach based on backward data-flow analysis to statically determine the relevant statements contributing to the specification violation. Using this analysis, we instrument the program to produce enriched execution traces. Using the enriched execution traces, we perform the runtime analysis and identify the statements whose execution led to the specification violation. We implemented our approach in a prototype tool, <monospace>iCFTLdiagnostics</monospace>, and evaluated it on 112 specifications across 10 software projects. Our tool achieves <inline-formula><tex-math>$90,%$</tex-math></inline-formula> precision in identifying relevant statements for 100 of the 112 specifications. It reduces the number of lines that have to be inspected for diagnosing a violation by at least <inline-formula><tex-math>$90,%$</tex-math></inline-formula>. In terms of computational cost, our experiments show that <monospace>iCFTLdiagnostics</monospace> generates a diagnosis within <inline-formula><tex-math>$7,mathrm{min}$</tex-math></inline-formula>, and requires no more than <inline-formula><tex-math>$25,mathrm{MB}$</tex-math></inline-formula> of memory. The instrumentation required to support diagnostics incurs an execution time overhead of less than <inline-formula><tex-math>$30,%$</tex-math></inline-formula> and a memory overhead below <inline-formula><tex-math>$20,%$</tex-math></inline-formula>.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1495-1514"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11408937","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147279266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multivocal Literature Review on the Effectiveness of Security Threat Modeling","authors":"Anh-Duy Tran;Stef Verreydt;Koen Yskout;Wouter Joosen","doi":"10.1109/TSE.2026.3662996","DOIUrl":"10.1109/TSE.2026.3662996","url":null,"abstract":"The growing need for integrating security throughout the software development lifecycle leads to the adoption of various security activities. Threat modeling is widely recognized as a process that helps assess security issues, especially architectural flaws, due to insecure design, thereby supporting the security-by-design mindset. While many research and industry sources advocate for threat modeling, others highlight issues such as the lack of motivation, its time-consuming nature, and practical difficulties, leading to questions about its overall effectiveness. In this study, we conduct a comprehensive multivocal literature review to systematically examine the empirical evidence for the effectiveness of threat modeling. In short, by analyzing 109 sources from both white and gray literature, we did not encounter any direct, causal evidence (e.g., a controlled experiment) for the effectiveness of threat modeling as a technique to improve the security of a software application. This absence of causal evidence should not be interpreted as evidence that threat modeling is ineffective, though. The existing literature does describe several benefits and challenges related to threat modeling, as well as suggestions for improving the effectiveness of threat modeling activities. Studies on threat modeling often concentrate on benefits such as improved performance, effectiveness, efficiency, and usability of specific tools and methods. Recurring challenges, on the other hand, include a perceived lack of benefits, tool limitations, usability issues, and difficulties integrating threat modeling into the secure software development lifecycle. Suggestions for improvements include providing clear checklists or guidance, defining a clear scope, and involving different stakeholders during threat modeling activities. Based on this review of the literature, researchers are invited to conduct rigorous empirical studies to address the underexplored aspects of threat modeling, thereby strengthening its evidence base and increasing its impact in the real world.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1352-1370"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Spectral Clustering and Structural Alignment for Cross-Project Defect Prediction","authors":"Dipender Singh;Sandeep Kumar","doi":"10.1109/TSE.2026.3667334","DOIUrl":"10.1109/TSE.2026.3667334","url":null,"abstract":"Cross-project defect prediction (CPDP) aims to identify defect-prone modules in a target project by leveraging data from external source projects. Over time, research has shifted from single-source to multi-source CPDP to increase data diversity and capture broader defect patterns. However, naively combining multiple sources can introduce substantial data divergence and degrade performance. A key limitation is that many methods overlook intra-project heterogeneity, treating each source project as uniform. Moreover, data selection typically relies on independent metric matching, which ignores structural relationships among software metrics. To address these limitations, we propose adaptive spectral clustering and structural alignment (ASCSA), a unified framework that combines adaptive spectral clustering and structural alignment. First, adaptive spectral clustering partitions each source project into coherent clusters, mitigating intra-project heterogeneity. Second, a structural alignment method selects source clusters that preserve higher-order metric relationships with the target, avoiding misleading matches based solely on marginal distributions. Finally, a deep neural network with maximum mean discrepancy (MMD) loss minimizes residual distribution gaps to enable effective knowledge transfer. In this study, we conduct comprehensive experiments on 25 widely used software projects. The results show that ASCSA consistently outperforms state-of-the-art methods, achieving improvements in AUC ranging from 5.26% to 23.08%, and in MCC from 15.62% to 68.18%. These findings highlight the effectiveness of jointly addressing intra-project heterogeneity, and structural alignment for reliable cross-project defect prediction.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1478-1494"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147279270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CodeS+: Towards Assessing the Generalization Ability of Code Models Under Distribution Shift","authors":"Ziyue Shi;Junjie Wang;Yuejun Guo;Xiaofei Xie;Qiang Hu;Maxime Cordy;Sen Chen;Mike Papadakis;Yves Le Traon;Yongqiang Lyu","doi":"10.1109/TSE.2026.3668096","DOIUrl":"10.1109/TSE.2026.3668096","url":null,"abstract":"Distribution shift poses a significant challenge for deep learning (DL) models in source code analysis, where test data often follows different distributions from training data, leading to unexpected performance degradation and hindering the practical usage of code models. While our previous work CodeS introduced the first benchmark for studying distribution shift in source code analysis, it has limitations in covering more fine-grained types of real-world distribution shifts and lacks the study of the effectiveness of shift mitigation strategies. In this paper, we present CodeS<inline-formula><tex-math>$+$</tex-math></inline-formula>, an enhanced benchmark that addresses these limitations through two key contributions, (1) expanded shift types, we design more fine-grained distribution shift types, that is, shift introduced by different program element complexity (e.g., different node number of control flow graphs), and (2) investigate the usefulness of fine-tuning based shift mitigation techniques, such as Core-Set. Comprehensive experiments on different pre-trained code models demonstrated that code models significantly suffer from distribution shift, out-of-distribution detectors from other domains (e.g., computer vision) do not generalize to source code, and existing fine-tuning based shift mitigation techniques have limited benefits in enhancing the generalization ability of code models. Our findings highlight the need to pay more attention to OOD issues for code models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1515-1530"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147439831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Baltes;Timo Speith;Brenda Chiteri;Seyedmoein Mohsenimofidi;Shalini Chakraborty;Daniel Buschek
{"title":"On the Need to Rethink Trust in AI Assistants for Software Development: A Critical Review","authors":"Sebastian Baltes;Timo Speith;Brenda Chiteri;Seyedmoein Mohsenimofidi;Shalini Chakraborty;Daniel Buschek","doi":"10.1109/TSE.2026.3659804","DOIUrl":"10.1109/TSE.2026.3659804","url":null,"abstract":"Trust is a fundamental concept in human decision-making and collaboration that has long been studied in philosophy and psychology. However, software engineering (SE) articles often use the term <italic>trust</i> informally; providing an explicit definition or embedding results in established trust models is rare. In SE research on AI assistants, this practice culminates in equating trust with the likelihood of accepting generated content, which, in isolation, does not capture the full conceptual complexity of trust. Without a common definition, true secondary research on trust is impossible. The objectives of our research were: (1) to present the psychological and philosophical foundations of human trust, (2) to systematically study how trust is conceptualized in SE and the related disciplines human-computer interaction and information systems, and (3) to discuss limitations of equating trust with content acceptance, outlining how SE research can adopt existing trust models to overcome the widespread informal use of the term trust. We conducted a literature review across disciplines and a critical review of recent SE articles with a focus on trust conceptualizations. We found that trust is rarely defined or conceptualized in SE articles. Related disciplines commonly embed their methodology and results in established trust models, clearly distinguishing, for example, between <italic>initial trust</i> and <italic>trust formation</i> and between <italic>appropriate</i> and <italic>inappropriate trust</i>. On a meta-scientific level, other disciplines even discuss whether and when trust can be applied to AI assistants at all. Our study reveals a significant maturity gap of trust research in SE compared to other disciplines. We provide concrete recommendations on how SE researchers can adopt established trust models and instruments to study trust in AI assistants beyond the acceptance of generated software artifacts.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 4","pages":"1265-1281"},"PeriodicalIF":5.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146089904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}