Shabbab Ali Algamdi, Abdul Wahid Khan, Jamshid Ahmad, Moulay Ibrahim El-Khalil Ghembaza
{"title":"Key Success Factors of Cybersecurity Awareness in Distributed Teams","authors":"Shabbab Ali Algamdi, Abdul Wahid Khan, Jamshid Ahmad, Moulay Ibrahim El-Khalil Ghembaza","doi":"10.1002/smr.70056","DOIUrl":"https://doi.org/10.1002/smr.70056","url":null,"abstract":"<div>\u0000 \u0000 <p>Strong cybersecurity procedures are now more important than ever due to the increased reliance on remote workers. Given the dynamic nature of cyber threats and the necessity of preventative actions, this paper highlights the vital significance of thorough cybersecurity awareness training for remote workers. A customized cybersecurity awareness training model can improve distributed team preparedness and decrease cyberattacks. Organizations should institute regular security awareness programs to educate distributed teams on emerging cyber threats. Vendor businesses should prioritize security education to prevent cyberattacks and protect sensitive data. Our proposed model aims to improve distributed team members' preparedness against cyber threats, enabling organizations to safeguard remote work settings effectively. Our systematic literature review identified key cybersecurity factors, synthesized into 12 groups, including “Unified Governance Framework” and “Secure Mind Initiative.”</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145146327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Merge Conflict Prediction Using Feature Selection and Stacking Heterogeneous Ensembles: An Empirical Investigation","authors":"Reem Alfayez, Amal Alazba","doi":"10.1002/smr.70047","DOIUrl":"https://doi.org/10.1002/smr.70047","url":null,"abstract":"<div>\u0000 \u0000 <p>Merge conflicts arise when multiple developers simultaneously modify the same part of a codebase and attempt to merge their changes. These conflicts occur because the version control system (VCS) cannot automatically determine which changes should take precedence. Resolving such conflicts involves manually reviewing the conflicting changes and deciding how to integrate them to maintain a functional and coherent codebase. This process is often time-consuming, complex, and prone to errors. Consequently, the software engineering community has focused on predicting merge conflicts to warn developers early and allow them to address conflicts before they escalate. Despite several efforts to predict merge conflicts, no perfect solution has been identified. Fortunately, many machine learning techniques have demonstrated potential in improving prediction performance across various contexts. This study aims to empirically investigate the effectiveness of stacking heterogeneous ensembles in enhancing merge conflict prediction performance. We empirically compared the prediction performance of the following individual models: decision trees (DT); support vector machine (SVM) with a linear kernel; naive Bayes (NB) with Bernoulli, Gaussian, and Multinomial variants; logistic regression (LR); multilayer perceptron (MLP); stochastic gradient descent (SGD); and k-nearest neighbors (KNN). Additionally, we evaluated three heterogeneous stacking ensembles: Stack-DT, Stack-SVM, and Stack-LR, which were constructed using the aforementioned individual models as base models. We utilized gain ratio (GR) to identify the most important technical and social features for predicting merge conflicts and assessed the impact of using only these important features on the performance of both individual and stacking models. The study revealed variability in the performance of individual models, with DT demonstrating the best predictive performance among them. Heterogeneous stacking ensembles demonstrated potential to enhance merge conflict prediction, with Stack-SVM emerging as the top-performing model. GR analysis highlighted the importance of both social and technical features in predicting merge conflicts. However, using only the most important features identified by GR led to a decline in the performance of most models compared to using all features. Heterogeneous stacking ensembles significantly improve prediction performance over individual models. Both social and technical features are important in predicting merge conflicts, and utilizing the full set of features instead of only the most important ones generally yields better results.</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145172015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shah Nawaz, Muhammad Yaseen, Gohar Rahman, Jasim Saeed
{"title":"Cluster Analysis of Security Threats in Web Applications: A Multiphase SDLC Analysis","authors":"Shah Nawaz, Muhammad Yaseen, Gohar Rahman, Jasim Saeed","doi":"10.1002/smr.70055","DOIUrl":"https://doi.org/10.1002/smr.70055","url":null,"abstract":"<div>\u0000 \u0000 <p>Security threats in web applications have increasingly become a major concern, particularly as modern web systems grow more complex and interconnected. Addressing these security challenges requires a comprehensive understanding of how threats are distributed across different phases of the software development life cycle (SDLC) and how various threat categories map to specific SDLC stages. Despite significant research into software security, a systematic and structured review focusing on the hierarchical relationships between SDLC phases, security threat categories, and specific threats remains scarce. This paper aims to fill this gap by conducting a clustering-based systematic review of security threats in web applications. Using data from existing literature on software security threats, we applied hierarchical clustering, K-means analysis, and co-occurrence mapping to identify relationships between SDLC phases (Level 1), security threat categories (Level 2), and specific security threats (Level 3). The findings show that the development phase presents the highest risk, more so to threats like weaknesses in architectural security design and input validation issues. Using clustering techniques, we showed how some of the threats appeared in more than one SDLC stage and classified them within the categories of threats most closely associated with the SDLC stage. Taking into account these factors, we propose recommendations for software development process stakeholders allowing for the implementation of more consistent strategies of threat mitigation through the entire SDLC. Considering these observations, it can be concluded that there is an acute deficiency in development for globalization of software security measures towards web applications to control future security threats.</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145111066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Data to Knowledge: Mining Linux Vulnerability Characteristics and Evolution With Knowledge Graphs","authors":"Shiyu Weng, Xiaoxue Wu, Tianci Li, Chen Yao, Wenjing Shan, Xiaobing Sun","doi":"10.1002/smr.70053","DOIUrl":"https://doi.org/10.1002/smr.70053","url":null,"abstract":"<div>\u0000 \u0000 <p>An operating system is the essence of software, serving as the foundation for the operation of various application software. The security of the operating system is crucial for national informatization construction. Data indicate that many cybersecurity incidents result from exploiting security vulnerabilities in the operating system. Linux is currently the most widely used open-source operating system, with thousands of Common Vulnerabilities and Exposures (CVEs) related to Linux systems reported each year. Therefore, research and prevention of vulnerabilities in the Linux system are particularly important. To gain a better understanding of the characteristics of Linux system vulnerabilities, this paper leverages knowledge in the field of software security to analyze nearly 10,000 historical vulnerability data in two core systems of Linux: Linux Kernel and Debian Linux. The study explores the evolutionary patterns of vulnerability characteristics. Specific research contents include the following: (1) data collection and cleaning of vulnerability data in Linux Kernel and Debian Linux systems; (2) cross-statistical analysis of structured data features in vulnerability reports; (3) unstructured data characteristics mining in vulnerability reports based on domain knowledge; (4) analysis of the evolution of vulnerability characteristics. This paper provides empirical lessons and guidance for Linux system vulnerabilities to assist practitioners and researchers in better preventing and detecting vulnerabilities in Linux and Linux-based systems.</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145101229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linyi Han, Hang Li, Xiaowang Zhang, Youmeng Li, Zhiyong Feng
{"title":"UCLP: Unsupervised Classification of Key Aspects in Vulnerability Descriptions Through Label Profile","authors":"Linyi Han, Hang Li, Xiaowang Zhang, Youmeng Li, Zhiyong Feng","doi":"10.1002/smr.70052","DOIUrl":"https://doi.org/10.1002/smr.70052","url":null,"abstract":"<div>\u0000 \u0000 <p>Textual vulnerability descriptions (TVDs) in repositories like NVD and IBM X-Force Exchange are essential for security engineers managing vulnerabilities. Engineers typically search for key aspects in TVDs using specific phrases, but with multiple expressions for each aspect, retrieving all relevant records is challenging. We propose a label-based retrieval framework that classifies key aspects and retrieves TVDs by their broader categories. Given the large data volume, manual labeling is infeasible, making unsupervised classification critical. However, short labels and repeated words diminish semantic clarity, affecting classification accuracy. We introduce Unsupervised Classification through Label Profile (UCLP), which expands label semantics through label profiles inspired by recommendation systems. We construct profiles using neural network weights and apply TF-IDF to calculate similarities, smoothing distributions with an arctangent function. Results show that UCLP significantly outperforms four benchmarks, raising accuracy from 68.3% to 78.9% and improving three real-world applications.</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145038309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UFR-OSFA: Unified Feature Representation and Oppositional Structure Feature Alignment for Mixed-Project Heterogeneous Defect Prediction","authors":"Yifan Zou, Huiqiang Wang, Hongwu Lv, Shuai Zhao","doi":"10.1002/smr.70049","DOIUrl":"https://doi.org/10.1002/smr.70049","url":null,"abstract":"<div>\u0000 \u0000 <p>Heterogeneous defect prediction (HDP) plays a crucial role in software engineering by enabling the early detection of software defects across projects with heterogeneous feature spaces. Recently, some mixed-project HDP (MP-HDP) methods have been proposed, which have demonstrated modest improvements in HDP performance. Nevertheless, existing MP-HDP approaches fail to address feature redundancy and distribution inconsistency simultaneously. To overcome these limitations, this paper proposes a novel MP-HDP approach, UFR-OSFA, based on unified feature representation and oppositional structural feature alignment. Concretely, UFR-OSFA first unifies these features by reducing the distribution differences between source and target projects through matching common features and the Hungarian algorithm based on the Kolmogorov–Smirnov (KS) test. Subsequently, utilizing a generator and two classifiers with oppositional structures, UFR-OSFA separates the features of the source project and clusters those of the target project, addressing the issue of conditional distribution mismatch and enhancing the model's generalization ability in the target project. Extensive experiments on 23 projects from five datasets demonstrate that the proposed approach performs better or comparably to baseline methods.</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145037623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose A. Calvo-Manzano, Tomás San Feliu, Ángel Herranz, Julio Mariño, Lars-Åke Fredlund, Ana M. Moreno
{"title":"CyberESP: An Integrated Cybersecurity Framework for SMEs","authors":"Jose A. Calvo-Manzano, Tomás San Feliu, Ángel Herranz, Julio Mariño, Lars-Åke Fredlund, Ana M. Moreno","doi":"10.1002/smr.70050","DOIUrl":"https://doi.org/10.1002/smr.70050","url":null,"abstract":"<p>Cybersecurity is a critical global concern, particularly for small- and medium-sized enterprises (SMEs) with limited resources and expertise. The authors are developing CyberESP, a tailored cybersecurity framework supported by a semi-automated tool to ensure Spanish SMEs' cybersecurity management. Following the Design Science Research (DSR) methodology and grounded in international standards, the authors identified six requirements to be satisfied by a cybersecurity framework for SMEs, which should support the identification of assets, vulnerabilities, threats, and risks. This paper presents the first part of the CyberESP framework dealing with asset management, particularly their identification and analysis of dimensions and cost. A prototype supporting these activities was developed and validated through a case study in a retail SME, showing the solution's potential and identifying particular improvements. The paper also addresses threats to validity and limitations, noting the framework's focus on hardware, software, and networks. Future work includes vulnerability management and will explore the use of cloud and IoT deployment, positioning CyberESP as a practical solution to enhance SMEs' cybersecurity resilience.</p>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/smr.70050","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145022283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matteo Pancini, Matteo Camilli, Giovanni Quattrocchi, Damian Andrew Tamburri
{"title":"Engineering MLOps Pipelines With Data Quality: A Case Study on Tabular Datasets in Kaggle","authors":"Matteo Pancini, Matteo Camilli, Giovanni Quattrocchi, Damian Andrew Tamburri","doi":"10.1002/smr.70044","DOIUrl":"https://doi.org/10.1002/smr.70044","url":null,"abstract":"<p>Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.</p>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/smr.70044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Multi-Class Socio-Technical Congruence: Assessing Coordination in Collaborative Software Development Settings","authors":"Roshan Namal Rajapakse, Claudia Szabo","doi":"10.1002/smr.70040","DOIUrl":"https://doi.org/10.1002/smr.70040","url":null,"abstract":"<p>Effective coordination between contributors with different functional roles is fundamental for the success of collaboration-centric software development paradigms such as DevSecOps. However, quantitatively assessing coordination in such settings has received limited attention. We introduce multi-class socio-technical congruence (<span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>M</mi>\u0000 <mi>C</mi>\u0000 <mtext>-</mtext>\u0000 <mi>S</mi>\u0000 <mi>T</mi>\u0000 <mi>C</mi>\u0000 </mrow>\u0000 <annotation>$$ MChbox{-} STC $$</annotation>\u0000 </semantics></math>), an extension of the widely studied socio-technical congruence (<span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>S</mi>\u0000 <mi>T</mi>\u0000 <mi>C</mi>\u0000 </mrow>\u0000 <annotation>$$ STC $$</annotation>\u0000 </semantics></math>) framework to address this gap. Our metric enables the assessment of coordination in a setting where contributors with different functional roles or alignments collaborate. Using a large-scale exploratory case study, we evaluated <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>M</mi>\u0000 <mi>C</mi>\u0000 <mtext>-</mtext>\u0000 <mi>S</mi>\u0000 <mi>T</mi>\u0000 <mi>C</mi>\u0000 </mrow>\u0000 <annotation>$$ MChbox{-} STC $$</annotation>\u0000 </semantics></math> for two classes (i.e., <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mn>2</mn>\u0000 <mi>C</mi>\u0000 <mtext>-</mtext>\u0000 <mi>S</mi>\u0000 <mi>T</mi>\u0000 <mi>C</mi>\u0000 </mrow>\u0000 <annotation>$$ 2Chbox{-} STC $$</annotation>\u0000 </semantics></math>). Specifically, we calculated <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mn>2</mn>\u0000 <mi>C</mi>\u0000 <mtext>-</mtext>\u0000 <mi>S</mi>\u0000 <mi>T</mi>\u0000 <mi>C</mi>\u0000 </mrow>\u0000 <annotation>$$ 2Chbox{-} STC $$</annotation>\u0000 </semantics></math> for 100 systematically selected projects from the <i>TravisTorrent</i> dataset, considering developers (<i>dev</i>) and security-focused developers (<i>sf-devs</i>) as the two types of contributors with different functional alignments (i.e., two classes). We hypothesized that the <i>dev</i> and <i>sf-dev</i> interaction would have a quantifiable impact on the <i>vulnerability score</i> (<span></span><math>\u0000 <semantics>\u0000 <mrow","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/smr.70040","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prioritization Method for Crowdsourced Test Report by Integrating Text and Image Information","authors":"Huijie Tu, Xiangjuan Yao, Dunwei Gong, Yan Yang","doi":"10.1002/smr.70043","DOIUrl":"https://doi.org/10.1002/smr.70043","url":null,"abstract":"<div>\u0000 \u0000 <p>Crowdsourcing testing has the advantages of efficiency, speed, and reliability, but an excessive number of test reports makes it a challenge for report reviewers to select high-quality test reports in a limited time. Test reports submitted by crowd workers often tend to be short textual descriptions with a large number of screenshots attached. Most traditional processing methods of test reports target reports that only contain text information, which cannot meet the defect detection requirements of crowdsourced test reports. In view of this, this paper proposes a prioritization method of crowdsourced test reports that integrates text and image information. First, we extract the text and image information from the test reports, based on which the defect detection abilities of the test reports are measured and the similarities between test reports are calculated. Then, a multi-stage prioritization method of the test reports is presented based on the defect detection levels and similarities of the test reports. In the first stage, based on the defect detection levels and the similarities, the test report set is sorted and clustered to obtain the sorting results of partial reports and the similar set for each sorted report; in the second stage, the similar test report set is sorted with the criteria of minimizing the similarity and maximizing the defect detection level; the sorting results of the two stages are combined to form the final priorities of test reports. To validate our approach, we conducted experiments on five crowdsourced test datasets. The results and the analysis show that our approach can detect all faults faster in a limited time. By comprehensively utilizing text and image information to prioritize test reports, better sorting results can be obtained than state-of-the-art methods.</p>\u0000 </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}