{"title":"Experimental investigation of memory-related software aging in LLM systems","authors":"César Santos , Fumio Machida , Ermeson Andrade","doi":"10.1016/j.jss.2025.112653","DOIUrl":"10.1016/j.jss.2025.112653","url":null,"abstract":"<div><div>Large Language Models (LLMs) have been increasingly adopted in a wide range of applications, many of which require long-running inference processes. However, these systems may be subject to software aging phenomena, leading to progressive performance degradation and potential failures. In this work, we experimentally investigate memory-related software aging in LLM inference. We performed 48-hour experiments with three open-source models (Pythia, OPT, and GPT-Neo) under low, medium, and high workloads, monitoring memory consumption at both system and process levels. Using the Mann–Kendall test and Sen’s slope estimator, we observed monotonic growth in RAM usage across all models on Central Processing Units (CPUs), with OPT presenting the steepest slopes. Process-level analysis further revealed that LLM processes were the primary contributors to memory growth, along with background services. Additionally, we conducted identical experiments on Graphics Processing Units (GPUs). Unlike the experiments without a GPU, GPU-based experiments revealed bounded oscillations and abrupt resets likely due to driver-level memory management, while host RAM and process-level monitoring still revealed clear symptoms of aging. These findings demonstrate that software aging manifests differently across execution environments, reinforcing the need for environment-specific monitoring approaches.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112653"},"PeriodicalIF":4.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aitor Aguirre-Ortuzar , Íñigo Elguea-Aguinaco , Nestor Arana-Arexolaleiba , Leire Etxeberria-Elorza , Joseba A. Agirre-Bastegieta
{"title":"Novel framework for automated testing of ill-defined human–robot interaction environments","authors":"Aitor Aguirre-Ortuzar , Íñigo Elguea-Aguinaco , Nestor Arana-Arexolaleiba , Leire Etxeberria-Elorza , Joseba A. Agirre-Bastegieta","doi":"10.1016/j.jss.2025.112654","DOIUrl":"10.1016/j.jss.2025.112654","url":null,"abstract":"<div><div>As automated systems advance in complexity, comprehensive testing becomes crucial, particularly for human–robot interaction (HRI) environments where human unpredictability creates ill-defined testing domains that challenge conventional software testing approaches. In interactive robotics, evaluation criteria extend beyond performance to include critical safety considerations. This paper introduces a novel automated testing framework combining runtime monitoring with constraint-based techniques for HRI environments. The framework employs a three-level cognitive oracle architecture – observation, interpretation, and diagnosis – that automatically evaluates the correctness of human and robot actions without requiring expert human intervention. The approach uses constraint-based modeling to handle the non-deterministic nature of HRI scenarios while ensuring safety compliance. Validation through five test cases in a refrigerator disassembly simulation demonstrates the framework’s effectiveness in detecting safety violations and procedural errors under environmental uncertainties.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112654"},"PeriodicalIF":4.1,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hüseyin Ünlü, Samet Tenekeci, Dhia Eddine Kennouche, Onur Demirörs
{"title":"Automating software size measurement with language models: Insights from industrial case studies","authors":"Hüseyin Ünlü, Samet Tenekeci, Dhia Eddine Kennouche, Onur Demirörs","doi":"10.1016/j.jss.2025.112638","DOIUrl":"10.1016/j.jss.2025.112638","url":null,"abstract":"<div><div>Objective software size measurement is critical for accurate effort estimation, yet many organizations avoid it due to high costs, required expertise, and time-consuming manual effort. This often leads to vague predictions, poor planning, and project overruns. To address this challenge, we investigate the use of pre-trained language models — BERT and SE-BERT — to automate size measurement based on textual requirements using COSMIC and MicroM methods. We constructed one heterogeneous dataset and two industrial datasets, each manually measured by experienced analysts. Models were evaluated in three settings: (i) generic model evaluation, where the models are trained and tested on heterogeneous data, (ii) internal evaluation, where the models are trained and tested on organization-specific data, and (iii) external evaluation, where generic models were tested on organization-specific data. Results show that organization-specific models significantly outperform generic models, indicating that aligning training data with the target organization’s requirement style is critical for accuracy. SE-BERT, a domain-adapted variant of BERT, improves performance, particularly in low-resource settings. These findings highlight the practical potential of tailoring training data for broader adoption and cost-effective software size measurement in industrial contexts.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112638"},"PeriodicalIF":4.1,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing emotion detection in software engineering using a residual multi-embedding fusion network","authors":"Rim Mahouachi","doi":"10.1016/j.jss.2025.112651","DOIUrl":"10.1016/j.jss.2025.112651","url":null,"abstract":"<div><div>Emotions play a crucial role in the development of software, particularly in team dynamics, productivity, and decision making. Developer communications — such as bug reports, code reviews, and online discussions — often include emotional signals. But part of the difficulty in identifying these feelings lies in the technicality and informality of the words, and in the utter scarcity of even critical but rare emotions like fear and surprise. This study aims to improve the detection of both common and minority emotions in software engineering texts, with a focus on better identifying underrepresented classes. We introduce R-MEFN, Residual Multi-Embedding Fusion Network, a network model that employs multiple types of contextual word embeddings to represent the text. Residual connections serve to keep signals of subtle emotionality, especially ones associated with emotions that are infrequent. Cross-validation is performed to choose the best combination of embeddings to be fused. We evaluate R-MEFN on two real-world datasets (StackOverflow and Jira), comparing it to other prior approaches on these benchmarks, as well as to single-embedding and combined-embedding baselines. R-MEFN outperforms other methods that have been evaluated in the same benchmarks for multilabel emotion detection, showing particular improvements on rare classes while keeping a good performance on frequent emotions. Also, it outperforms all single-embedding baselines, as well as all combined-embedding baselines, where embeddings from multiple sources are simply concatenated, showing the strength of the fusion approach. The cross-validated integration of contextual embeddings allows R-MEFN to produce more balanced and expressive representations across all emotion categories. These findings show the effectiveness of using multiple contextual embeddings and residual learning for addressing class imbalance in emotion detection. We see R-MEFN as a useful starting point towards creating emotion aware tools that can allow software teams to track emotional dynamics in a project, and identify hidden risks.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112651"},"PeriodicalIF":4.1,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Bombarda, Silvia Bonfanti, Angelo Gargantini
{"title":"My feature model has changed... What should I do with my tests?","authors":"Andrea Bombarda, Silvia Bonfanti, Angelo Gargantini","doi":"10.1016/j.jss.2025.112645","DOIUrl":"10.1016/j.jss.2025.112645","url":null,"abstract":"<div><div>Software Product Lines (SPLs) evolve over time, driven by changing requirements and advancements in technology. While much research has been dedicated to the evolution of feature models (FMs), less focus has been put on how associated artifacts, such as test cases, should adapt to these changes. Test cases, derived as valid products from an FM, play a critical role in ensuring the correctness of an SPL. However, when an FM evolves, the original test suite may become outdated, requiring either regeneration from scratch or repair of existing test cases to align with the updated FM. In this paper, we address the challenge of evolving test suites upon FM evolution. We introduce novel definitions of test suite dissimilarity and specificity We use these metrics to evaluate three test generation strategies: GFS (generating a new suite from scratch), GFE (repairing and reusing an existing suite), and SPECGEN (maximizing specific tests for the FM evolution). Additionally, we introduce a set of mutations to simulate FM evolution and obtain additional FMs. By using mutants, we conduct our analyses and evaluate the mutation score of test generation strategies. Our experiments, conducted on a set of FMs taken from the literature and on more than 3,200 FMs artificially generated with mutations, reveal that GFE often produces the smallest test suites with high mutation scores, while SPECGEN excels in specificity, particularly for mutations expanding the set of valid products.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112645"},"PeriodicalIF":4.1,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A case study of gender and online team communication in Software Engineering Education","authors":"Rita Garcia , Christoph Treude","doi":"10.1016/j.jss.2025.112644","DOIUrl":"10.1016/j.jss.2025.112644","url":null,"abstract":"<div><div>Collaboration is crucial in Software Engineering (SE), yet factors like gender bias can shape team dynamics and behaviours. This descriptive case study examines an eight-week project involving 39 SE students across eight teams contributing to GitHub projects. Focusing on gender, we used a mixed-methods approach to analyse Slack communications, identifying gender differences in how students respond to initiated communications and comparing how students’ communications influenced other aspects of students’ performance, including learning gains. We found higher help-seeking and leadership behaviours in the all-woman team involved in this case study, while men responded more slowly. Although communication did not directly affect final grades, we identified statistical significance in the correlation between communication and students’ understanding of software development. With this case study showing that some students putting more effort into collaboration, future work can investigate diversity and inclusion training to balance these efforts. In addition, we observed a link between team engagement and a higher understanding of software development, highlighting the potential for teaching strategies that promote help-seeking. These findings could guide future research by integrating intersectionality to address the challenges that SE students face when using communication platforms, thereby fostering more equitable collaboration in SE Education.</div><div><em>Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board</em>.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112644"},"PeriodicalIF":4.1,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BugMentor: Generating answers to follow-up questions from software bug reports using structured information retrieval and neural text generation","authors":"Usmi Mukherjee, Mohammad Masudur Rahman","doi":"10.1016/j.jss.2025.112636","DOIUrl":"10.1016/j.jss.2025.112636","url":null,"abstract":"<div><div>Software bug reports often lack crucial information (e.g., steps to reproduce), which makes bug resolution challenging. Developers thus ask follow-up questions to capture additional information. However, according to existing evidence, bug reporters often face difficulties answering them, which leads to the premature closing of bug reports without any resolution. Recent studies suggest follow-up questions to support the developers, but answering the follow-up questions still remains a major challenge. In this paper, we propose BugMentor, a novel approach that combines structured information retrieval and neural text generation (e.g., Mistral) to generate appropriate answers to the follow-up questions. Our technique identifies the past relevant bug reports to a given bug report, captures contextual information, and then leverages it to generate the answers. We evaluate our generated answers against the ground truth answers using four appropriate metrics, including BLEU Score and Semantic Similarity. We achieve a BLEU Score of up to 72 and Semantic Similarity of up to 92 indicating that our technique can generate understandable and good answers to the follow-up questions according to Google’s AutoML Translation documentation. Our technique also outperforms four existing baselines with a statistically significant margin. We also conduct a developer study involving 23 participants where the answers from our technique were found to be more accurate, more precise, more concise and more useful.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112636"},"PeriodicalIF":4.1,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"“Digital Camouflage”: The LLVM challenge in LLM-based malware detection","authors":"Ekin Böke , Simon Torka","doi":"10.1016/j.jss.2025.112646","DOIUrl":"10.1016/j.jss.2025.112646","url":null,"abstract":"<div><div>Large Language Models (LLMs) have emerged as promising tools for malware detection by analyzing code semantics, identifying vulnerabilities, and adapting to evolving threats. However, their reliability under adversarial compiler-level obfuscation is yet to be discovered. In this study, we empirically evaluate the robustness of three state-of-the-art LLMs: ChatGPT-4o, Gemini Flash 2.5, and Claude Sonnet 4 against compiler-level obfuscation techniques implemented via the LLVM infrastructure. These include control flow flattening, bogus control flow injection, instruction substitution, and split basic blocks, which are widely used to evade detection while preserving malicious behavior. We perform a structured evaluation on 40 C functions (20 vulnerable, 20 secure) sourced from the Devign dataset and obfuscated using LLVM passes. Our results show that these models often fail to correctly classify obfuscated code, with precision, recall, and F1-score dropping significantly after transformation. This reveals a critical limitation: LLMs, despite their language understanding capabilities, can be easily misled by compiler-based obfuscation strategies. To promote reproducibility, we release all evaluation scripts, prompts, and obfuscated code samples in a public repository. We also discuss the implications of these findings for adversarial threat modeling, and outline future directions such as software watermarking, compiler-aware defenses, and obfuscation-resilient model design.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112646"},"PeriodicalIF":4.1,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bahareh Taghavi , Sebastian Weber , Adrian Marin , Bernhard Rumpe , Sebastian Stüber , Jörg Henß , Thomas Weber , Robert Heinrich
{"title":"Modeling the composition of analysis components and automatic constraint checking for semantic soundness","authors":"Bahareh Taghavi , Sebastian Weber , Adrian Marin , Bernhard Rumpe , Sebastian Stüber , Jörg Henß , Thomas Weber , Robert Heinrich","doi":"10.1016/j.jss.2025.112637","DOIUrl":"10.1016/j.jss.2025.112637","url":null,"abstract":"<div><div>Component-based software architecture enables software architects to design complex systems by composing components that interact through well-defined, syntactically specified interfaces. A special kind of component we investigated in our previous work is the analysis components. Analysis components support the evaluation and prediction of system’s functional and non-functional properties. Evaluating these properties early in the development process helps optimize system performance and ensure compliance with requirements. While approaches for modeling and analyzing such systems, such as the Palladio approach, support syntactic validation of the composition, they often lack mechanisms to ensure the semantic soundness of compositions. In this paper, we present a model transformation approach to help architects ensure that system models are semantically sound and behave as expected. This approach enables the transformation of Palladio models into MontiArc models, allowing architects to enrich their system representations with semantic constraints and validate these constraints with the MontiArc workbench. This ensures that component interactions are consistent with both structural composition and intended semantics. We evaluate our approach through two different case studies. From these case studies, we derived several scenarios with varying constraints and states to assess the accuracy and performance of our approach. To evaluate accuracy, we examined our approach’s ability to check semantic constraints and detect violations. We observed high accuracy across the case studies. For performance, we analyze time complexity in different constraint types. The approach performed well when applied to arithmetic constraints, with its effectiveness decreasing when applied to more complex string-centered constraints.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112637"},"PeriodicalIF":4.1,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aron van der Hofstad , Loek Cleophas , Clemens Dubslaff , Jacob Krüger
{"title":"The life of software features: An exploratory case study of 189 feature requests in Marlin","authors":"Aron van der Hofstad , Loek Cleophas , Clemens Dubslaff , Jacob Krüger","doi":"10.1016/j.jss.2025.112647","DOIUrl":"10.1016/j.jss.2025.112647","url":null,"abstract":"<div><div>Features are a widely established notion to organize the functionalities of a software system. For instance, features are used to define variability and commonalities in product lines; feature-driven development is an agile development methodology; and social-coding platforms have explicit support for feature requests. Despite the importance of features, we are not aware of extensive research on their life cycles: how and for what reasons do developers evolve features? As a result, we lack an understanding of how features come to be, how they are evolved, or why they may be removed. To narrow this research gap, we have performed an exploratory case study on the evolution of 189 feature requests of the Marlin 3D-printer firmware. We identified the code introducing a feature and traced all commits touching that code or the feature, resulting in a collection of 1,940 unique commits spanning five years of evolution. We have manually inspected all of these commits to classify their intentions with respect to the features they change, and created process graphs of the features’ life cycles based on these intentions to understand the evolution of features. Our results contribute a first overview and detailed examples of evolving features beyond code metrics, showcasing that features are primarily refactored, exhibit interdependent evolution, and are rarely removed. Serving as a starting point, these contributions can support practitioners in managing features and guide researchers in understanding feature evolution as well as in scoping future studies on this matter.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112647"},"PeriodicalIF":4.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}