{"title":"The role of psychological safety in promoting software quality in agile teams","authors":"Adam Alami, Mansooreh Zahedi, Oliver Krancher","doi":"10.1007/s10664-024-10512-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10512-1","url":null,"abstract":"<p>Psychological safety continues to pique the interest of scholars in a variety of disciplines of study. Recent research indicates that psychological safety fosters knowledge sharing and norm clarity and complements agile values. Although software quality remains a concern in the software industry, academics have yet to investigate whether and how psychologically safe teams provide superior results. In this study, we explore how psychological safety influences agile teams’ quality-related behaviors aimed at enhancing software quality. To widen the empirical coverage and evaluate the results, we chose a two-phase mixed-methods research design with an exploratory qualitative phase (20 interviews) followed by a quantitative phase (survey study, N = 423). Our findings show that, when psychological safety is established in agile software teams, it induces enablers of a social nature that advance the teams’ ability to pursue software quality. For example, admitting mistakes and taking initiatives equally help teams learn and invest their learning in their future decisions related to software quality. Past mistakes become points of reference for avoiding them in the future. Individuals become more willing to take initiatives aimed at enhancing quality practices and mitigating software quality issues. We contribute to our endeavor to understand the circumstances that promote software quality. Psychological safety requires organizations, their management, agile teams, and individuals to maintain and propagate safety principles. Our results also suggest that technological tools and procedures can be utilized alongside social strategies to promote software quality.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"12 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan
{"title":"The impact of concept drift and data leakage on log level prediction models","authors":"Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan","doi":"10.1007/s10664-024-10518-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10518-9","url":null,"abstract":"<p>Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"16 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Trusted Smart Contracts: A Comprehensive Test Suite For Vulnerability Detection","authors":"Andrei Arusoaie, Ștefan-Claudiu Susan","doi":"10.1007/s10664-024-10509-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10509-w","url":null,"abstract":"<p>The term <i>smart contract</i> was originally used to describe automated legal contracts. Nowadays, it refers to special programs that run on blockchain platforms and are popular in decentralized applications. In recent years, vulnerabilities in smart contracts caused significant financial losses. Researchers have proposed methods and tools for detecting them and have demonstrated their effectiveness using various test suites. In this paper, we aim to improve the current approach to measuring the effectiveness of vulnerability detectors in smart contracts. First, we identify several traits of existing test suites used to assess tool effectiveness. We explain how these traits limit the evaluation and comparison of vulnerability detection tools. Next, we propose a new test suite that prioritizes diversity over quantity, utilizing a comprehensive taxonomy to achieve this. Our organized test suite enables insightful evaluations and more precise comparisons among vulnerability detection tools. We demonstrate the benefits of our test suite by comparing several vulnerability detection tools using two sets of metrics. Results show that the tools we included in our comparison cover less than half of the vulnerabilities in the new test suite. Finally, based on our results, we answer several questions that we pose in the introduction of the paper about the effectiveness of the compared tools.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"17 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic title completion for Stack Overflow posts and GitHub issues","authors":"Xiang Chen, Wenlong Pei, Shaoyu Yang, Yanlin Zhou, Zichen Zhang, Jiahua Pei","doi":"10.1007/s10664-024-10513-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10513-0","url":null,"abstract":"<p>Title quality is important for different software engineering communities. For example, in Stack Overflow, posts with low-quality question titles often discourage potential answerers. In GitHub, issues with low-quality titles can make it difficult for developers to grasp the core idea of the problem. In previous studies, researchers mainly focused on generating titles from scratch by analyzing the body contents, such as the post body for Stack Overflow question title generation (SOTG) and the issue body for issue title generation (ISTG). However, the quality of the generated titles is still limited by the information available in the body contents. A more effective way is to provide accurate completion suggestions when developers compose titles. Inspired by this idea, we are the first to study the problem of automatic title completion for software engineering title generation tasks and propose the approach <span>TC4SETG</span>. Specifically, we first preprocess the gathered titles to form incomplete titles (i.e., tip information provided by developers) for simulating the title completion scene. Then we construct the input by concatenating the incomplete title with the body’s content. Finally, we fine-tune the pre-trained model CodeT5 to learn the title completion patterns effectively. To evaluate the effectiveness of <span>TC4SETG</span>, we selected 189,655 high-quality posts from Stack Overflow by covering eight popular programming languages for the SOTG task and 333,563 issues in the top-200 starred repositories on GitHub for the ISTG task. Our empirical results show that compared with the approaches of generating question titles from scratch, our proposed approach <span>TC4SETG</span> is more practical in automatic and human evaluation. Our experimental results demonstrate that <span>TC4SETG</span> outperforms corresponding state-of-the-art baselines in the SOTG task by a minimum of 25.82% and in the ISTG task by at least 45.48% in terms of ROUGE-L. Therefore, our study provides a new direction for studying automatic software engineering title generation and calls for more researchers to investigate this direction in the future.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"23 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Danglot, Jean-Rémy Falleri, Romain Rouvoy
{"title":"Can we spot energy regressions using developers tests?","authors":"Benjamin Danglot, Jean-Rémy Falleri, Romain Rouvoy","doi":"10.1007/s10664-023-10429-1","DOIUrl":"https://doi.org/10.1007/s10664-023-10429-1","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">\u0000<b>Context</b>\u0000</h3><p><i>Software Energy Consumption</i> is gaining more and more attention. In this paper, we tackle the problem of warning developers about the increase of SEC of their programs during <i>Continuous Integration</i> (CI).</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Objective</b>\u0000</h3><p>In this study, we investigate if the CI can leverage developers’ tests to perform <i>energy regression testing</i>. Energy regression is similar to performance regression but focuses on the energy consumption of the program instead of standard performance indicators, like execution time or memory consumption.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Method</b>\u0000</h3><p>We perform an exploratory study of the usage of developers’ tests for energy regression testing. We first investigate if developers’ tests can be used to obtain stable SEC indicators. Then, we evaluate if comparing the SEC of developers’ tests between two versions can pinpoint energy regressions introduced by automated program mutations. Finally, we manually evaluate several real commits pinpointed by our approach.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Results</b>\u0000</h3><p>Our study will pave the way for automated SEC regression tools that can be readily deployed inside an existing CI infrastructure to raise awareness of SEC issues among practitioners.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"61 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Refining the SZZ Algorithm with Bug Discussion Data","authors":"Pooja Rani, Fernando Petrulio, Alberto Bacchelli","doi":"10.1007/s10664-024-10511-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10511-2","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Researchers testing hypotheses related to factors leading to low-quality software often rely on historical data, specifically on details regarding when defects were introduced into a codebase of interest. The prevailing techniques to determine the introduction of defects revolve around variants of the <span>SZZ</span> algorithm. This algorithm leverages information on the lines modified during a bug-fixing commit and finds when these lines were last modified, thereby identifying bug-introducing commits.</p><h3 data-test=\"abstract-sub-heading\">Objectives</h3><p>Despite several improvements and variants, <span>SZZ</span> struggles with accuracy, especially in cases of unrelated modifications or that touch files not involved in the introduction of the bug in the version control systems (aka <i>tangled commit</i> and <i>ghost commits</i>).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Our research investigates whether and how incorporating content retrieved from bug discussions can address these issues by identifying the related and external files and thus improve the efficacy of the <span>SZZ</span> algorithm.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>To conduct our investigation, we take advantage of the links manually inserted by Mozilla developers in bug reports to signal which commits inserted bugs. Thus, we prepared the dataset, <i>RoTEB</i>, comprised of 12,472 bug reports. We first manually inspect a sample of 369 bug reports related to these bug-fixing or bug-introducing commits and investigate whether the files mentioned in these reports could be useful for <span>SZZ</span>. After we found evidence that the mentioned files are relevant, we augment <span>SZZ</span> with this information, using different strategies, and evaluate the resulting approach against multiple <span>SZZ</span> variations.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>We define a taxonomy outlining the rationale behind developers’ references to diverse files in their discussions. We observe that bug discussions often mention files relevant to enhancing the <span>SZZ</span> algorithm’s efficacy. Then, we verify that integrating these file references augments the precision of <span>SZZ</span> in pinpointing bug-introducing commits. Yet, it does not markedly influence recall. These results deepen our comprehension of the usefulness of bug discussions for <span>SZZ</span>. Future work can leverage our dataset and explore other techniques to further address the problem of tangled commits and ghost commits. Data & material: https://zenodo.org/records/11484723.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"94 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Test-based patch clustering for automatically-generated patches assessment","authors":"Matias Martinez, Maria Kechagia, Anjana Perera, Justyna Petke, Federica Sarro, Aldeida Aleti","doi":"10.1007/s10664-024-10503-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10503-2","url":null,"abstract":"<p>Previous studies have shown that Automated Program Repair (<span>apr</span>) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by <span>apr</span> tools need to be validated by human programmers, which can be very costly, and prevents <span>apr</span> tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called <span>xTestCluster</span>, which clusters patches based on their dynamic behavior. <span>xTestCluster</span> is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches for facilitating patch assessment. The novelty of <span>xTestCluster</span> lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from <span>xTestCluster</span> gives developers <i>a)</i> a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, <i>b)</i> additional information (new test cases and their results) attached to each patch. After analyzing 902 plausible patches from 21 Java <span>apr</span> tools, our results show that <span>xTestCluster</span> is able to reduce the number of patches to review and analyze with a median of 50%. <span>xTestCluster</span> can save a significant amount of time for developers that have to review the multitude of patches generated by <span>apr</span> tools, and provides them with new test cases that expose the differences in behavior between generated patches. Moreover, <span>xTestCluster</span> can complement other patch assessment techniques that help detect patch misclassifications.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"35 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Free open source communities sustainability: Does it make a difference in software quality?","authors":"Adam Alami, Raúl Pardo, Johan Linåker","doi":"10.1007/s10664-024-10529-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10529-6","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Free and Open Source Software (FOSS) communities’ ability to stay viable and productive over time is pivotal for society as they maintain the building blocks that digital infrastructure, products, and services depend on. Sustainability may, however, be characterized from multiple aspects, and less is known how these aspects interplay and impact community outputs, and software quality specifically.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study, therefore, aims to empirically explore how the different aspects of FOSS sustainability impact software quality.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>16 sustainability metrics across four categories were sampled and applied to a set of 217 OSS projects sourced from the Apache Software Foundation Incubator program. The impact of a decline in the sustainability metrics was analyzed against eight software quality metrics using Bayesian data analysis, which incorporates probability distributions to represent the regression coefficients and intercepts.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Findings suggest that selected sustainability metrics do not significantly affect defect density or code coverage. However, a positive impact of community age was observed on specific code quality metrics, such as risk complexity, number of very large files, and code duplication percentage. Interestingly, findings show that even when communities are experiencing sustainability, certain code quality metrics are negatively impacted.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Findings imply that code quality practices are not consistently linked to sustainability, and defect management and prevention may be prioritized over the former. Results suggest that growth, resulting in a more complex and large codebase, combined with a probable lack of understanding of code quality standards, may explain the degradation in certain aspects of code quality.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"165 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explaining poor performance of text-based machine learning models for vulnerability detection","authors":"Kollin Napier, Tanmay Bhowmik, Zhiqian Chen","doi":"10.1007/s10664-024-10519-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10519-8","url":null,"abstract":"<p>With an increase of severity in software vulnerabilities, machine learning models are being adopted to combat this threat. Given the possibilities towards usage of such models, research in this area has introduced various approaches. Although models may differ in performance, there is an overall lack of explainability in understanding how a model learns and predicts. Furthermore, recent research suggests that models perform poorly in detecting vulnerabilities when interpreting source code as text, known as “text-based” models. To help explain this poor performance, we explore the dimensions of explainability. From recent studies on text-based models, we experiment with removal of overlapping features present in training and testing datasets, deemed “cross-cutting”. We conduct scenario experiments removing such “cross-cutting” data and reassessing model performance. Based on the results, we examine how removal of these “cross-cutting” features may affect model performance. Our results show that removal of “cross-cutting” features may provide greater performance of models in general, thus leading to explainable dimensions regarding data dependency and agnostic models. Overall, we conclude that model performance can be improved, and explainable aspects of such models can be identified via empirical analysis of the models’ performance.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"36 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueqi Dang, Yinghua Li, Wei Ma, Yuejun Guo, Qiang Hu, Mike Papadakis, Maxime Cordy, Yves Le Traon
{"title":"Towards Exploring the Limitations of Test Selection Techniques on Graph Neural Networks: An Empirical Study","authors":"Xueqi Dang, Yinghua Li, Wei Ma, Yuejun Guo, Qiang Hu, Mike Papadakis, Maxime Cordy, Yves Le Traon","doi":"10.1007/s10664-024-10515-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10515-y","url":null,"abstract":"<p>Graph Neural Networks (GNNs) have gained prominence in various domains, such as social network analysis, recommendation systems, and drug discovery, due to their ability to model complex relationships in graph-structured data. GNNs can exhibit incorrect behavior, resulting in severe consequences. Therefore, testing is necessary and pivotal. However, labeling all test inputs for GNNs can be prohibitively costly and time-consuming, especially when dealing with large and complex graphs. In response to these challenges, test selection has emerged as a strategic approach to alleviate labeling expenses. The objective of test selection is to select a subset of tests from the complete test set. While various test selection techniques have been proposed for traditional deep neural networks (DNNs), their adaptation to GNNs presents unique challenges due to the distinctions between DNN and GNN test data. Specifically, DNN test inputs are independent of each other, whereas GNN test inputs (nodes) exhibit intricate interdependencies. Therefore, it remains unclear whether DNN test selection approaches can perform effectively on GNNs. To fill the gap, we conduct an empirical study that systematically evaluates the effectiveness of various test selection methods in the context of GNNs, focusing on three critical aspects: <b>1) Misclassification detection</b>: selecting test inputs that are more likely to be misclassified; <b>2) Accuracy estimation</b>: selecting a small set of tests to precisely estimate the accuracy of the whole testing set; <b>3) Performance enhancement</b>: selecting retraining inputs to improve the GNN accuracy. Our empirical study encompasses 7 graph datasets and 8 GNN models, evaluating 22 test selection approaches. Our study includes not only node classification datasets but also graph classification datasets. Our findings reveal that: 1) In GNN misclassification detection, confidence-based test selection methods, which perform well in DNNs, do not demonstrate the same level of effectiveness; 2) In terms of GNN accuracy estimation, clustering-based methods, while consistently performing better than random selection, provide only slight improvements; 3) Regarding selecting inputs for GNN performance improvement, test selection methods, such as confidence-based and clustering-based test selection methods, demonstrate only slight effectiveness; 4) Concerning performance enhancement, node importance-based test selection methods are not suitable, and in many cases, they even perform worse than random selection.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"92 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}