Stephen John Warnett , Evangelos Ntentos , Uwe Zdun
{"title":"MLOps pipeline generation for reinforcement learning: A low-code approach using large language models","authors":"Stephen John Warnett , Evangelos Ntentos , Uwe Zdun","doi":"10.1016/j.jss.2025.112760","DOIUrl":"10.1016/j.jss.2025.112760","url":null,"abstract":"<div><div>MLOps (Machine Learning Operations) and its application to Reinforcement Learning (RL) involve various challenges when integrating Machine Learning and RL models into production systems, entailing considerable expertise and manual effort, which can be error-prone and obstruct scalability and rapid deployment. We propose a new approach to address these challenges in generating MLOps pipelines. We present a low-code, template-based approach leveraging Large Language Models (LLMs) to automate RL pipeline generation, validation and deployment. In our approach, the Pipes and Filters pattern allows for the fine-grained generation of MLOps pipeline configuration files. Built-in error detection and correction help maintain high-quality output standards.</div><div>To empirically evaluate our solution, we assess the correctness of pipelines generated with seven LLMs for three open-source RL projects. Our initial approach achieved an average error rate of 0.187 across all seven LLMs. OpenAI GPT-4o performed the best with an error rate of just 0.09, followed by Qwen2.5 Coder with an error rate of 0.15. We implemented a single round of improvements to our implementation and low-code template. We reevaluated our solution on the best-performing LLM from the initial evaluation, achieving perfect results with an overall error rate of zero for OpenAI GPT-4o. Our findings indicate that pipelines generated by our approach have low error rates, potentially enabling rapid scaling and deployment of reliable MLOps for RL pipelines, particularly for practitioners lacking advanced software engineering or DevOps skills. Our approach contributes towards demonstrating increased reliability and trustworthiness in LLM-based solutions, despite the uncertainty hitherto associated with LLMs.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112760"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Denesa Zyberaj , Pascal Hirmer , Marco Aiello , Stefan Wagner
{"title":"Test case specification techniques and system testing tools in the automotive industry: A review","authors":"Denesa Zyberaj , Pascal Hirmer , Marco Aiello , Stefan Wagner","doi":"10.1016/j.jss.2025.112764","DOIUrl":"10.1016/j.jss.2025.112764","url":null,"abstract":"<div><div>The automotive domain is shifting to software-centric development to meet regulation, market pressure, and feature velocity. This shift increases embedded systems’ complexity and strains testing capacity. Despite relevant standards, a coherent system-testing methodology that spans heterogeneous, legacy-constrained toolchains remains elusive, and practice often depends on individual expertise rather than a systematic strategy. We derive challenges and requirements from a systematic literature review (SLR), complemented by industry experience and practice. We map them to test case specification techniques and testing tools, evaluating their suitability for automotive testing using PRISMA. Our contribution is a curated catalog that supports technique/tool selection and can inform future testing frameworks and improvements. We synthesize nine recurring challenge areas across the life cycle, such as requirements quality and traceability, variability management, and toolchain fragmentation. We then provide a prioritized criteria catalog that recommends model-based planning, interoperable and traceable toolchains, requirements uplift, pragmatic automation and virtualization, targeted AI and formal methods, actionable metrics, and lightweight organizational practices.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112764"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A self-sustainable service assembly for decentralized computing environments","authors":"Mauro Caporuscio , Mirko D’Angelo , Vincenzo Grassi , Raffaela Mirandola , Francesca Ricci","doi":"10.1016/j.jss.2025.112755","DOIUrl":"10.1016/j.jss.2025.112755","url":null,"abstract":"<div><div>The landscape of modern computing systems is shifting towards architectures built by combining available services under the “everything as a service” paradigm. These architectures are deployed on distributed cloud-edge infrastructures, aiming to provide innovative services to a wide range of users. However, it is crucial for these systems to address environmental sustainability concerns. This poses challenges in operating such systems in open, dynamic, and uncertain environments while minimizing their energy consumption. To tackle these challenges, we propose a decentralized service assembly approach that ensures the assembly is energetically self-sustainable by relying on locally harvested and stored energy. In our contribution, we introduce a general service selection template that enables the derivation of different selection policies. These policies guide the construction and maintenance of the service assembly. To evaluate their effectiveness in meeting the sustainability requirements, we conduct a comprehensive set of simulation experiments, providing valuable insights.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112755"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI in GUI-based testing: A survey of techniques, tools, and perceived advantages and limitations","authors":"Domenico Amalfitano , Riccardo Coppola , Damiano Distante , Filippo Ricca","doi":"10.1016/j.jss.2025.112751","DOIUrl":"10.1016/j.jss.2025.112751","url":null,"abstract":"<div><div><em>Background:</em> The adoption of Artificial Intelligence (AI) techniques in Software Testing (ST) has grown rapidly, particularly in response to the increasing complexity of modern systems. In GUI-based testing, AI is often cited as a promising means to automate repetitive tasks and improve testing efficiency. However, the actual use of AI in this domain remains underexplored through systematic empirical investigation.</div><div><em>Objective:</em> This study aims to analyze how AI is adopted in GUI-based testing, identifying the techniques and tools employed, the testing activities they support, and the perceived benefits and limitations.</div><div><em>Method:</em> We conducted a large-scale survey involving 107 participants from both academia and industry. The survey focuses on three core testing activities: test case definition, test oracle design, and test case optimization. It extends a prior study based on interviews with 45 industry practitioners.</div><div><em>Results:</em>Findings show that AI is primarily used to support test case definition, with techniques such as Natural Language Processing, Optimization, and Large Language Models (LLMs) being the most common. AI also provides support in test oracle design, where image processing and knowledge representation play key roles, and in test suite optimization, through the use of supervised learning, reinforcement learning, and search-based techniques.</div><div><em>Conclusion:</em> The paper identifies ongoing challenges and outlines future directions, including the need for transparent AI tools, guidelines for LLM integration, and the deployment of a continuously open survey to monitor trends in AI adoption over time.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112751"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The pragmatics of hybridity: A grounded theory of method integration in software engineering projects","authors":"Godfried B Adaba","doi":"10.1016/j.jss.2026.112790","DOIUrl":"10.1016/j.jss.2026.112790","url":null,"abstract":"<div><div>Hybrid project management is becoming a dominant delivery mode in software engineering, yet the mechanisms through which organisations enact and sustain hybrid practices remain insufficiently theorised. Existing accounts often imply a linear or prescriptive integration of governance and agile methods, overlooking the negotiated, context-dependent nature of hybrid work. This study advances a process-based explanation of hybrid delivery by developing the grounded theory of contingent hybridity, derived through a Constructivist Grounded Theory (CGT) study within a multinational IT firm. Drawing on interviews, observations, and project artefacts, the findings show that hybridisation is not simply the coexistence of plan-driven project governance and agile routines, but an emergent socio-technical process shaped by practitioners’ interpretive work and situated adaptation. Four interdependent mechanisms structure this process: structural anchoring, through which governance frameworks provide stability and legitimacy; adaptive enactment, whereby agile practices are tailored and embedded within formal controls; boundary work, involving translators and hybrid ceremonies that reconcile divergent organisational logics; and role hybridisation, in which practitioners fluidly shift between control-oriented and delivery-focused responsibilities. The analysis demonstrates that hybrid practices vary across roles and project phases, with effective integration depending less on adherence to prescribed templates and more on ongoing, context-sensitive negotiation. These insights refine theoretical understandings of hybrid project management by moving beyond static typologies toward a dynamic, practice-centred perspective and offer actionable guidance for organisations seeking to balance agility and control in complex, regulated environments.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112790"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146038340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Tang , Ye Du , Jian-Bo Gao , Ang Li , Ming-Song Yang
{"title":"ISRLNN: A software defect prediction method based on instance similarity reverse loss","authors":"Yu Tang , Ye Du , Jian-Bo Gao , Ang Li , Ming-Song Yang","doi":"10.1016/j.jss.2025.112766","DOIUrl":"10.1016/j.jss.2025.112766","url":null,"abstract":"<div><div>Software defect prediction is a crucial technique for ensuring software reliability. However, software defect datasets often exhibit complex feature dependencies and traditional feature engineering methods have limitations in capturing non-linear relationships between these features.As deep learning can effectively capture these complex relationships, they have the potential to overcome the shortcomings of traditional feature engineering techniques. In this paper, we propose the concept of instance image and transform the software defect prediction problem into an image classification task based on instance images, thus fully leveraging the feature extraction capabilities of deep learning. Additionally, to address the limitations of existing binary cross-entropy loss functions in classification models that they cannot account for instance importance differences, we also design an instance similarity reverse loss function. We first design a method to measure instance similarity and dynamically adjust the instance weights during loss calculation based on this similarity. Next, we use normalized instance similarity loss as the active loss in the active-passive loss framework. Finally, we construct a software defect prediction method based on the <u>I</u>nstance <u>S</u>imilarity <u>R</u>everse <u>L</u>oss (ISRL). The experimental results show that the proposed method improves performance by 5% to 8% compared to existing works.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112766"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mumtahina Ahmed , Md Nahidul Islam Opu , Chanchal Roy , Sujana Islam Suhi , Shaiful Chowdhury
{"title":"Exploring challenges in test mocking: Developer questions and insights from StackOverflow","authors":"Mumtahina Ahmed , Md Nahidul Islam Opu , Chanchal Roy , Sujana Islam Suhi , Shaiful Chowdhury","doi":"10.1016/j.jss.2025.112748","DOIUrl":"10.1016/j.jss.2025.112748","url":null,"abstract":"<div><div>Mocking is a common unit testing technique that is used to simplify tests, reduce flakiness, and improve coverage by replacing real dependencies with simplified implementations. Despite its widespread use in Open Source Software (OSS) projects, there is limited understanding of how and why developers use mocks and the challenges they face. In this study, we have analyzed 25,302 questions related to <em>Mocking</em> on StackOverflow to identify the challenges faced by developers. We have used Latent Dirichlet Allocation (LDA) for topic modeling, identified 30 key topics, and grouped the topics into five key categories. Consequently, we analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions. Trend analysis reveals that categories such as <em>Mocking Techniques</em> and <em>External Services</em> have remained consistently dominant, highlighting evolving developer priorities and ongoing technical challenges. While the questions on <em>Theoretical</em> category declined after 2010, posts regarding <em>Error Handling</em> grew notably from 2009.</div><div>Our findings also show an inverse relationship between a topic’s popularity and its difficulty. Popular topics like <em>Framework Selection</em> tend to have lower difficulty and faster resolution times, while complex topics like <em>HTTP Requests and Responses</em> are more likely to remain unanswered and take longer to resolve. Additionally, we evaluated questions based on the answer status- successful, ordinary, or unsuccessful, and found that topics such as <em>Framework Selection</em> have higher success rates, whereas tool setup and Android-related issues are more often unresolved. A classification of questions into <em>How, Why, What</em>, and <em>Other</em> revealed that over 64 % are <em>How</em> questions, particularly in practical domains like file access, APIs, and databases, indicating a strong need for implementation guidance. <em>Why</em> questions are more prevalent in error-handling contexts, reflecting conceptual challenges in debugging, while <em>What</em> questions are rare and mostly tied to theoretical discussions. These insights offer valuable guidance for improving developer support, tooling, and educational content in the context of mocking and unit testing.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112748"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mitigation strategies for confidentiality violations in software architecture using ranked feature importance","authors":"Nils Niehues , Sebastian Hahner , Robert Heinrich","doi":"10.1016/j.jss.2025.112761","DOIUrl":"10.1016/j.jss.2025.112761","url":null,"abstract":"<div><div>A quality attribute like confidentiality is critical to trustworthy software but unfortunately, very challenging to ensure. This is because modern software systems are complex and interconnected. Architecture-based confidentiality analysis enables the early detection of violations, helping to mitigate risks before deployment. However, uncertainty in software systems and their environments complicates precise and comprehensive architectural analysis. Additionally, the complexity of software models and the exponential growth of uncertainty scenarios pose significant challenges for automated mitigation, often leaving software architects to resolve confidentiality violations manually, a process that is both time-intensive and error-prone.</div><div>In this paper, we extend our machine-learning-based approach to mitigate confidentiality violations. Specifically, we introduce a novel mitigation strategy inspired by TCP Congestion Control, as well as a strategy that capitalizes on clustering techniques to dynamically adjust batch sizes. Our evaluation on three real-world software architectures demonstrates that our extended approach can mitigate confidentiality violations while outperforming the state-of-the-art. Whereas previously the upper limit was 60 times runtime reduction, now we achieve 2298 times reduction, with the median being an elevenfold reduction. Our statistical analysis confirms that the added TCP-inspired strategy is significantly cheaper than the state-of-the-art baseline (Friedman test <span><math><mrow><mi>p</mi><mo>=</mo><mo>.</mo><mn>025</mn></mrow></math></span> and Nemenyi post hoc test <span><math><mrow><mi>p</mi><mo>=</mo><mo>.</mo><mn>039</mn></mrow></math></span>), while also having a strong practical impact (Kendall’s W <span><math><mrow><mo>=</mo><mn>0.721</mn></mrow></math></span>). This extended work deepens our understanding of the nature of uncertainty and also of the techniques optimally suited to mitigating the violations caused by uncertainties. It takes us one step closer to designing trustworthier systems.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112761"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of an automatic class diagram generator using an AI-based GRU classification model and 5W1H heuristic rules","authors":"Seungmo Jung, Woojin Lee","doi":"10.1016/j.jss.2026.112780","DOIUrl":"10.1016/j.jss.2026.112780","url":null,"abstract":"<div><div>In software development, software requirements and class diagrams are core components that are closely related to each other. Software requirements specify the system's functionality in natural language, while class diagrams are created using CASE tools to visually represent the system's structure and behavior based on these requirements. Although software requirements and class diagrams are complementary, ensuring consistency between them is challenging due to the ambiguity and vagueness inherent in natural language. To address this issue, research on automatically transforming natural language into class diagrams is actively being conducted; however, most of these studies focus on requirements written in English. In addition, existing research primarily emphasizes the grammatical structure of natural language requirements, which limits their ability to reflect the conceptual structures of specific domains. To overcome these limitations, this paper proposes a method for developing an automatic class diagram generator that utilizes AI-based GRU classification model and 5W1H-based heuristic rules. The proposed class diagram generator extracts element and class model information from software requirements written in Korean and visualizes class diagrams based on a model interface language. For elements that can be directly extracted from natural language requirements, 5W1H-based heuristic rules considering linguistic characteristics are applied, while domain-specific elements requiring domain knowledge are extracted using an AI-based GRU classification model. Furthermore, when comparing the class diagrams generated by the proposed tool with those manually created by developers, the tool demonstrated high performance in terms of precision, recall, and F1-score.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112780"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sofia Martinez , Luo Xu , Mariam Elnaggar, Eman Abdullah Alomar
{"title":"Software refactoring research with large language models: A systematic literature review","authors":"Sofia Martinez , Luo Xu , Mariam Elnaggar, Eman Abdullah Alomar","doi":"10.1016/j.jss.2025.112762","DOIUrl":"10.1016/j.jss.2025.112762","url":null,"abstract":"<div><div>Background: Code refactoring is the improvement of code internally without changing the external functionalities of the program. Due to its exhaustive nature, developers often avoid manually refactoring code. Researchers have since looked into utilizing Large Language Models (LLMs) to automate the task of refactoring.</div><div>Aim and Method: Despite the promising results, there is a lack of clear understanding of LLMs’ effectiveness in automated refactoring. In order to address this issue, we conducted a Systematic Literature Review (SLR) of 50 primary studies. We categorized the studies into different refactoring methods studied, prompt engineering and techniques conducted, LLM tools used, languages used, and datasets used. We touched upon the benchmarks each studies had used, how accurate LLM-generated refactorings are, and the challenges that this field faces currently.</div><div>Result: From our literature review we found that: (i) There are various tools that different studies use to enhance and study LLM-driving refactoring, with tools that were used to detect code smells, generate code bases, and compare refactoring outcomes. (ii) Various datasets were collected from multiple open-source projects in multiple programming languages for analysis. These platforms included GitHub, Apache, and F-Droid, with the most popular language collected and analyzed being Java. (iii) One-Shot, Few-Shot, Context-Specific, and Chain-of-Thought prompting methods have been shown to be the most effective depending on the language used. In some instances, being capable of reducing code smell by up to 89%. (iv) The definition of “Accuracy” varies significantly across the literature surveyed, as this depends on the context of the study, thus calling for a need to have a standardized measurement for Accuracy. (v) The most often mentioned code smells were Large Class and Long Method, with a lot of studies also not specifying, while the most often applied refactoring type is Extract Method, showing promising results in using LLM to perform this refactoring type. (vi) When working with LLMs, they often generate erroneous code, struggle with more complex refactoring, and often misunderstand the developers’ requests and miss refactoring requests.</div><div>Conclusion: Our study serves to be a collection of knowledge on the topic of LLM refactoring found by various other studies, and to highlight any issues that are often missed by the researchers. We hope to empower and guide the future development of LLM-driven refactoring with our findings.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112762"},"PeriodicalIF":4.1,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}