{"title":"On the use of machine learning for failure prediction after collective changes in automated continuous integration testing","authors":"Ömer Özdemir , Reyhan Aydoğan , Hasan Sözer","doi":"10.1016/j.jss.2026.112791","DOIUrl":"10.1016/j.jss.2026.112791","url":null,"abstract":"<div><div>Continuous Integration (CI) is a development practice where developers regularly merge their code changes into a central repository, enabling simultaneous collaboration across a shared codebase. This frequent integration and automated building process in CI helps to detect and resolve conflicts or errors early in development. However, in large-scale systems, the build process can be costly. Each build incurs expenses, while skipping builds can increase the risk of undetected failures. Accurate predictions can help to identify builds that can be safely skipped to reduce CI costs. This paper presents an empirical study within an industrial setting, investigating the use of machine learning techniques to predict build failures after a set of collective changes. Unlike many existing works that apply random data splitting, our results show that chronological (time-based) splitting offers a more realistic and reliable assessment of model performance in CI environments. We evaluate various models and feature combinations on a dataset derived from real-world industrial projects. We observe high precision but low recall in predicting failed builds, allowing hundreds of successful builds to be correctly skipped, with around a dozen failures potentially being missed. Our analysis shows that this yields substantial time savings of approximately 2.5 h per build on average, while missed failures necessarily result in delayed failure detection, whose practical impact depends on application criticality and operational context.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112791"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João Vitor Souza Rocha , Igor S. Wiese , Ivanilton Polato , Marco Aurélio Graciotto Silva , Reginaldo Ré , Igor Steinmacher , Walter T. Nakamura
{"title":"Investigating the potential of using worked examples to help resolve issues in a GitHub project","authors":"João Vitor Souza Rocha , Igor S. Wiese , Ivanilton Polato , Marco Aurélio Graciotto Silva , Reginaldo Ré , Igor Steinmacher , Walter T. Nakamura","doi":"10.1016/j.jss.2026.112810","DOIUrl":"10.1016/j.jss.2026.112810","url":null,"abstract":"<div><div>The growing popularity of Open-Source Software projects has raised questions about the challenges novice and inexperienced developers face, especially on code contribution platforms like GitHub. This study investigates the effects of using Worked Examples (WEs) to support these developers in solving coding tasks, using eye-tracking and cognitive effort analysis. The research involved 20 undergraduate students analyzing issues from the JabRef repository, with recommendations of high and low-similarity examples provided by a bot. The findings suggest that highly similar WEs effectively guided participants by helping identify relevant directories, files, and code snippets, serving as starting points for task resolution. However, challenges emerged, such as difficulties locating useful information and risks of false proximity between seemingly similar issues. These results highlight the need for improved recommendation strategies beyond textual similarity, incorporating structural elements such as file and method names, while reducing cognitive load through better presentation of relevant information. This work lays the groundwork for exploring WEs in Open-Source Software projects and opens avenues for further research, including validating findings in other repositories and understanding behavioral patterns in using WEs.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112810"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using development environment as code for enhancing developer experience: An action design research study","authors":"Hadi Ghanbari, Tarmo Terimaa, Kari Koskinen","doi":"10.1016/j.jss.2026.112803","DOIUrl":"10.1016/j.jss.2026.112803","url":null,"abstract":"<div><div>Setting up and configuring local development environments is often laborious, time-consuming, and error-prone, which can negatively impact developer experience (DX). Despite a growing body of literature on DX, the existing studies remain silent about the impact of setting up and configuring local development tools and environments on DX. Against this backdrop, we conducted an Action Design Research (ADR) study in a Finnish software company to examine this impact and propose a solution for enhancing the experience. In addition to explaining how DX is shaped when setting up and configuring development environments, we propose the concept of Development Environment as Code (DEaC) as a solution for mitigating some of the issues associated with the process. Based on the lessons learned from the project, we offer four theoretically informed and empirically grounded design principles (DPs) for creating other instances of the artefact with the specific goal of improving DX related to managing their development environments. The paper concludes with recommendations for organisations to consider when adopting DEaC.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112803"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Umar Zeshan, Motunrayo Ibiyo, Claudio Di Sipio, Phuong T. Nguyen, Davide Di Ruscio
{"title":"Many hands make light work: An LLM-based multi-agent system for detecting malicious PyPI packages","authors":"Muhammad Umar Zeshan, Motunrayo Ibiyo, Claudio Di Sipio, Phuong T. Nguyen, Davide Di Ruscio","doi":"10.1016/j.jss.2026.112792","DOIUrl":"10.1016/j.jss.2026.112792","url":null,"abstract":"<div><div>Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains. Traditional rule-based tools often overlook the semantic patterns in source code that are crucial for identifying adversarial components. Large language models (LLMs) show promise for software analysis, yet their use in interpretable and modular security pipelines remains limited.</div><div>This paper presents <span>LAMPS</span>, a multi-agent system that employs collaborative LLMs to detect malicious PyPI packages. The system consists of four role-specific agents for <em>package retrieval, file extraction, classification</em>, and <em>verdict aggregation</em>, coordinated through the CrewAI framework. A prototype combines a fine-tuned CodeBERT model for classification with LLaMA 3 agents for contextual reasoning. <span>LAMPS</span> has been evaluated on two complementary datasets: D<sub>1</sub>, a balanced collection of 6000 <span>setup.py</span> files, and D<sub>2</sub>, a realistic multi-file dataset with 1296 files and natural class imbalance. On D<sub>1</sub>, <span>LAMPS</span> achieves 97.7% accuracy, surpassing <span>MPHunter</span> and TD-IDF stacking models–two state-of-the-art approaches. On D<sub>2</sub>, it reaches 99.5% accuracy and 99.5% balanced accuracy, outperforming RAG-based approaches and fine-tuned single-agent baselines. McNemar’s test confirmed these improvements as highly significant. The results demonstrate the feasibility of distributed LLM reasoning for malicious code detection and highlight the benefits of modular multi-agent designs in software supply chain security.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112792"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José A. García-Berná , Ganeshkumar Pandiarajan , Sofia Ouhbi , Jennifer Gross , Joaquín Nicolás
{"title":"Structured framework for joint software quality and environmental sustainability assessment: A case study in sports websites","authors":"José A. García-Berná , Ganeshkumar Pandiarajan , Sofia Ouhbi , Jennifer Gross , Joaquín Nicolás","doi":"10.1016/j.jss.2026.112798","DOIUrl":"10.1016/j.jss.2026.112798","url":null,"abstract":"<div><div>Research into the environmental impact of software has increased in recent years, although its assessment remains complex due to the diversity of existing technologies. This paper proposes a structured and configurable framework that enables joint assessment of environmental sustainability and a particular software quality dimension or attribute. The framework builds upon established practices in software quality evaluation and considers statistical analysis to empirically explore the interplay between both dimensions. The paper contributes by proposing a methodological approach and illustrating its application through data analysis. The framework was configured to assess environmental sustainability and usability in a set of sports news aggregators, correlating heuristic-based usability metrics with environmental sustainability indicators derived from website performance data. The results revealed notable negative correlations between heuristic H5 (error prevention) and multimedia metrics (<span><math><mrow><mi>r</mi><mo>=</mo><mo>−</mo><mn>0.86</mn></mrow></math></span>), and between H4 (consistency and standards) and operation-related sustainability metrics (<span><math><mrow><mi>r</mi><mo>=</mo><mo>−</mo><mn>0.70</mn></mrow></math></span>). These findings provide insights into how usability factors may influence environmental sustainability and suggest directions for improvement. Overall, the paper provides preliminary insights into environmentally aware software quality assessment, through both the design of the framework and the analytical results obtained.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112798"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tarannum Shaila Zaman , Chadni Islam , Jiangfan Shi , Zihan Shi , Fiona Xian , Tingting Yu
{"title":"SysPro: Reproducing system-level concurrency bugs from bug reports","authors":"Tarannum Shaila Zaman , Chadni Islam , Jiangfan Shi , Zihan Shi , Fiona Xian , Tingting Yu","doi":"10.1016/j.jss.2026.112785","DOIUrl":"10.1016/j.jss.2026.112785","url":null,"abstract":"<div><div>Reproducing system-level concurrency bugs requires both input data and the precise interleaving order of system calls. This process is challenging because such bugs are non-deterministic, and bug reports often lack the detailed information needed. Additionally, the unstructured nature of reports written in natural language makes it difficult to extract necessary details. Existing tools are inadequate to reproduce these bugs due to their inability to manage the specific interleaving at the system call level. To address these challenges, we propose SysPro, a novel approach that automatically extracts relevant system call names from bug reports and identifies their locations in the source code. It generates input data by utilizing information retrieval, regular expression matching, and the category-partition method. This extracted input and interleaving data are then used to reproduce bugs through dynamic source code instrumentation. Our empirical study on real-world benchmarks demonstrates that SysPro is both effective and efficient at localizing and reproducing system-level concurrency bugs from bug reports.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112785"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ache-Fuzz: Constraint-aware fuzzing for vulnerability discovery in distributed deep learning frameworks","authors":"Zhao Zhang , Senlin Luo , Liyuan Liu , Limin Pan","doi":"10.1016/j.jss.2026.112796","DOIUrl":"10.1016/j.jss.2026.112796","url":null,"abstract":"<div><div>Ensuring the reliability and security of deep learning (DL) libraries is essential for the robustness of modern AI systems and large-scale intelligent computing infrastructures. However, the complexity of API semantics and the diversity of parameter constraints make it challenging to generate comprehensive and effective test cases. This paper presents Ache-Fuzz, a fuzzing-based automated testing framework designed to enhance vulnerability discovery in DL libraries such as TensorFlow. Ache-Fuzz integrates constraint-aware test generation with a hierarchical mutation strategy to construct diverse and valid API inputs. It extracts parameter constraint patterns from official API documentation to model structural and attribute dependencies, while the hierarchical mutation mechanism systematically strengthens boundary condition coverage and promotes broader exploration of API functionalities. Experimental evaluation on three versions of TensorFlow shows that Ache-Fuzz achieves over 25% API coverage and identifies 38 previously unknown vulnerabilities, 15 of which have been assigned CVE identifiers. These results demonstrate that Ache-Fuzz offers a scalable and effective approach for improving the robustness and security of large-scale AI software systems.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112796"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Basciani , Luciana Rebelo , Patrizio Pelliccione
{"title":"Reference architecture for autonomy and adaptivity in satellites","authors":"Francesco Basciani , Luciana Rebelo , Patrizio Pelliccione","doi":"10.1016/j.jss.2026.112802","DOIUrl":"10.1016/j.jss.2026.112802","url":null,"abstract":"<div><div><em>Background.</em> Software plays a growing role in the space domain. Autonomy and adaptivity are central for managing mission complexity, handling communication delays, improving efficiency, and enabling operations without ground intervention. <em>Objective.</em> This paper presents a reference architecture (RA) for autonomous and adaptive satellite systems. <em>Method.</em> The RA is grounded in a systematic literature review and validated with industry experts. Following ISO/IEC/IEEE 42010, we identify stakeholders and concerns, define architectural decisions, and present components and connectors. <em>Results.</em> The architecture organizes five functional components along the MAPE-K loop and separates application-level and operational-level roles. We instantiate the RA on the NIMBUS platform, a satellite that Thales Alenia Space Italy developed for the IRIDE constellation. This treatment implementation confirms the feasibility of mapping abstract responsibilities onto real onboard software and subsystems. <em>Conclusion.</em> The proposed RA provides a structured foundation for designing autonomous and adaptive space systems. Its successful instantiation on NIMBUS shows its suitability for partitioned, safety-critical platforms. Future work includes support for learning modules, constellation-level planning, and secure reconfiguration.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112802"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuning Ge , Fangyun Qin , Xiaohui Wan , Yang Liu , Qian Dai , Zheng Zheng
{"title":"ARFT-Transformer: Modeling metric dependencies for cross-project aging-related bug prediction","authors":"Shuning Ge , Fangyun Qin , Xiaohui Wan , Yang Liu , Qian Dai , Zheng Zheng","doi":"10.1016/j.jss.2026.112795","DOIUrl":"10.1016/j.jss.2026.112795","url":null,"abstract":"<div><div>Software systems that run for long periods often suffer from software aging, which is typically caused by Aging-Related Bugs (ARBs). To mitigate the risk of ARBs early in the development phase, ARB prediction has been introduced into software aging research. However, due to the difficulty of collecting ARBs, within-project ARB prediction faces the challenge of data scarcity, leading to the proposal of cross-project ARB prediction. This task faces two major challenges: 1) domain adaptation issue caused by distribution difference between source and target projects; and 2) severe class imbalance between ARB-prone and ARB-free samples. Although various methods have been proposed for cross-project ARB prediction, existing approaches treat the input metrics independently and often neglect the rich inter-metric dependencies, which can lead to overlapping information and misjudgment of metric importance, potentially affecting the model’s performance. Moreover, they typically use cross-entropy as the loss function during training, which cannot distinguish the difficulty of sample classification. To overcome these limitations, we propose ARFT-Transformer, a transformer-based cross-project ARB prediction framework that introduces a metric-level multi-head attention mechanism to capture metric interactions and incorporates Focal Loss function to effectively handle class imbalance. Experiments conducted on three large-scale open-source projects demonstrate that ARFT-Transformer on average outperforms state-of-the-art cross-project ARB prediction methods in both single-source and multi-source cases, achieving up to a 29.54% and 19.92% improvement in Balance metric.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112795"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Zhang , Jinfu Chen , Saihua Cai , Kun Wang , Yisong Liu , Haotong Ding
{"title":"A novel seed scheduling scheme using Thompson sampling for coverage-guided greybox fuzzing","authors":"Wen Zhang , Jinfu Chen , Saihua Cai , Kun Wang , Yisong Liu , Haotong Ding","doi":"10.1016/j.jss.2026.112794","DOIUrl":"10.1016/j.jss.2026.112794","url":null,"abstract":"<div><div>Coverage-guided Greybox Fuzzing (CGF) aims to maximize code area exploration within limited time, achieving higher code coverage. Current methods generally estimate seed potential through attributes like execution speed and size, but often ignore the distribution of explored program space and seed category potential in detecting new coverage, resulting in unbalanced code area exploration and limited detection of complex code. This paper proposes TMS-Fuzz, a new fuzzing seed scheduling method that balances code area exploration by distinguishing execution coverage features of seed inputs. By computing the path similarity between the execution coverage of different seed inputs, TMS-Fuzz dynamically and adaptively clusters them. Additionally, to improve the return on investment (ROI) of fuzzing, TMS-Fuzz uses a customized Thompson sampling algorithm to statistically select a seed group with the highest ROI, meaning the mutations of seeds in this group are most likely to discover new unique paths and crashes. Finally, TMS-Fuzz performs fuzzing on the target program by mutating the seed files in the selected seed group. Evaluations on eight real-world programs, compared with state-of-the-art open-source fuzzers, show that TMS-Fuzz improves edge coverage and crash detection capabilities in real programs.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112794"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}