Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang
{"title":"An empirical study of untangling patterns of two-class dependency cycles","authors":"Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang","doi":"10.1007/s10664-023-10438-0","DOIUrl":"https://doi.org/10.1007/s10664-023-10438-0","url":null,"abstract":"<p>Dependency cycles pose a significant challenge to software quality and maintainability. However, there is limited understanding of how practitioners resolve dependency cycles in real-world scenarios. This paper presents an empirical study investigating the recurring patterns employed by software developers to resolve dependency cycles between two classes in practice. We analyzed the data from 38 open-source projects across different domains and manually inspected hundreds of cycle untangling cases. Our findings reveal that developers tend to employ five recurring patterns to address dependency cycles. The chosen patterns are not only determined by dependency relations between cyclic classes, but also highly related to their design context, i.e., how cyclic classes depend on or are depended by their neighbor classes. Through this empirical study, we also discovered three common counterintuitive solutions developers usually adopted during cycles’ handling. These recurring patterns and common counterintuitive solutions observed in dependency cycles’ practice can serve as a taxonomy to improve developers’ awareness and also be used as learning materials for students in software engineering and inexperienced developers. Our results also suggest that, in addition to considering the internal structure of dependency cycles, automatic tools need to consider the design context of cycles to provide better support for refactoring dependency cycles.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"12 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140115349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning-based test smell detection","authors":"Valeria Pontillo, Dario Amoroso d’Aragona, Fabiano Pecorelli, Dario Di Nucci, Filomena Ferrucci, Fabio Palomba","doi":"10.1007/s10664-023-10436-2","DOIUrl":"https://doi.org/10.1007/s10664-023-10436-2","url":null,"abstract":"<p>Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. The key findings of the study report a negative result. The performance of the machine learning-based detector is significantly better than heuristic-based techniques, but none of the learners able to overcome an average F-Measure of 51%. We further elaborate and discuss the reasons behind this negative result through a qualitative investigation into the current issues and challenges that prevent the appropriate detection of test smells, which allowed us to catalog the next steps that the research community may pursue to improve test smell detection techniques.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the readability of test code","authors":"","doi":"10.1007/s10664-023-10390-z","DOIUrl":"https://doi.org/10.1007/s10664-023-10390-z","url":null,"abstract":"<h3>Abstract</h3> <span> <h3>Context</h3> <p>The readability of source code is key for understanding and maintaining software systems and tests. Although several studies investigate the readability of source code, there is limited research specifically on the readability of test code and related influence factors.</p> </span> <span> <h3>Objective</h3> <p>In this paper, we aim at investigating the factors that influence the readability of test code from an academic perspective based on scientific literature sources and complemented by practical views, as discussed in grey literature.</p> </span> <span> <h3>Methods</h3> <p>First, we perform a Systematic Mapping Study (SMS) with a focus on scientific literature. Second, we extend this study by reviewing grey literature sources for practical aspects on test code readability and understandability. Finally, we conduct a controlled experiment on the readability of a selected set of test cases to collect additional knowledge on influence factors discussed in practice.</p> </span> <span> <h3>Results</h3> <p>The result set of the SMS includes 19 primary studies from the scientific literature for further analysis. The grey literature search reveals 62 sources for information on test code readability. Based on an analysis of these sources, we identified a combined set of 14 factors that influence the readability of test code. 7 of these factors were found in scientific <em>and</em> grey literature, while some factors were mainly discussed in academia (2) <em>or</em> industry (5) with only limited overlap. The controlled experiment on practically relevant influence factors showed that the investigated factors have a significant impact on readability for half of the selected test cases.</p> </span> <span> <h3>Conclusion</h3> <p>Our review of scientific and grey literature showed that test code readability is of interest for academia and industry with a consensus on key influence factors. However, we also found factors only discussed by practitioners. For some of these factors we were able to confirm an impact on readability in a first experiment. Therefore, we see the need to bring together academic and industry viewpoints to achieve a common view on the readability of software test code.</p> </span>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2016 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139980044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies
{"title":"When less is more: on the value of “co-training” for semi-supervised software defect predictors","authors":"Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies","doi":"10.1007/s10664-023-10418-4","DOIUrl":"https://doi.org/10.1007/s10664-023-10418-4","url":null,"abstract":"<p>Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Traceability and reuse mechanisms, the most important properties of model transformation languages","authors":"Stefan Höppner, Matthias Tichy","doi":"10.1007/s10664-023-10428-2","DOIUrl":"https://doi.org/10.1007/s10664-023-10428-2","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">\u0000<b>Context</b>\u0000</h3><p>Dedicated model transformation languages are claimed to provide many benefits over the use of general purpose languages for developing model transformations. However, the actual advantages and disadvantages associated with the use of model transformation languages are poorly understood empirically. There is little knowledge and even less empirical assessment about what advantages and disadvantages hold in which cases and where they originate from. In a prior interview study, we elicited expert opinions on what advantages result from what factors surrounding model transformation languages as well as a number of moderating factors that moderate the influence.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Objective</b>\u0000</h3><p>We aim to quantitatively asses the interview results to confirm or reject the influences and moderation effects posed by different factors. We further intend to gain insights into how valuable different factors are to the discussion so that future studies can draw on these data for designing targeted and relevant studies.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Method</b>\u0000</h3><p>We gather data on the factors and quality attributes using an online survey. To analyse the data and examine the hypothesised influences and moderations, we use universal structure modelling based on a structural equation model. Universal structure modelling produces significance values and path coefficients for each hypothesised and modelled interdependence between factors and quality attributes that can be used to confirm or reject correlation and to weigh the strength of influence present.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Results</b>\u0000</h3><p>We analyzed 113 responses. The results show that the MTL capabilities Tracing and Reuse Mechanisms are most important overall. Though the observed effects were generally 10 times lower than anticipated. Furthermore, we found that moderation effects need to be individually assessed for each influence on a quality attribute. The moderation effects of a single moderating variable vary significantly for each influence, with the strongest effects being 1000 times higher than the weakest.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Conclusion</b>\u0000</h3><p>The empirical assessment of MTLs is a complex topic that cannot be solved by looking at a single stand-alone factor. Our results provide clear indication that evaluation should consider transformations of different sizes and use-cases that go beyond mapping one elements attributes to another. Language development on the other hand should focus on providing practical, transformation specific reuse mechanisms that allow MTLs to excel in areas such as maintainability and productivity compared to GPLs.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongming Xiang, Yuanchang Lin, Liming Nie, Yaowen Zheng, Zhengzi Xu, Zuohua Ding, Yang Liu
{"title":"An empirical study of attack-related events in DeFi projects development","authors":"Dongming Xiang, Yuanchang Lin, Liming Nie, Yaowen Zheng, Zhengzi Xu, Zuohua Ding, Yang Liu","doi":"10.1007/s10664-024-10447-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10447-7","url":null,"abstract":"<p>Decentralized Finance (DeFi) offers users decentralized financial services that are associated with the security of their assets. If DeFi is attacked, it could lead to considerable losses. Unfortunately, there is a lack of research on how DeFi developers respond to attacks during the development process. This lack of knowledge makes it difficult to identify which attacks to protect against and to create a comprehensive attack response system. This paper presents an empirical study to understand the current state of developers’ response to attacks during the development process. In addition, we conduct an analytical framework to help developers take preventive measures against attacks. Our research has revealed that Overflow Attack-related events are the most frequent (63, 19.75% of all attack-related events), and high-value DeFi projects tend to have more feedback and active development activities. We have observed that most of the attack instances (61, 85.92%) do not have corresponding attack-related development events, which can lead to a lack of trust between project teams and users if it is unclear whether the team responds to attacks. Furthermore, we have noticed that after the resolution of the same attack-related event, some attacks may recur, even though they could have been prevented. Consequently, we suggest some future research directions and provide some advice for DeFi project developers.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"24 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fengyu Yang, Fa Zhong, Guangdong Zeng, Peng Xiao, Wei Zheng
{"title":"LineFlowDP: A Deep Learning-Based Two-Phase Approach for Line-Level Defect Prediction","authors":"Fengyu Yang, Fa Zhong, Guangdong Zeng, Peng Xiao, Wei Zheng","doi":"10.1007/s10664-023-10439-z","DOIUrl":"https://doi.org/10.1007/s10664-023-10439-z","url":null,"abstract":"<p>Software defect prediction plays a key role in guiding resource allocation for software testing. However, previous defect prediction studies still have some limitations: (1) the granularity of defect prediction is still coarse, so high-risk code statements cannot be accurately located; (2) in fine-grained defect prediction, the semantic and structural information available in a single line of code is limited, and the content of code semantic information is not sufficient to achieve semantic differentiation. To address the above problems, we propose a two-phase line-level defect prediction method based on deep learning called LineFlowDP. We first extract the program dependency graph (PDG) of the source files. The lines of code corresponding to the nodes in the PDG are extended semantically with data flow and control flow information and embedded as nodes, and the model is further trained using an relational graph convolutional network. Finally, a graph interpreter GNNExplainer and a social network analysis method are used to rank the lines of code in the defective file according to risk. On 32 datasets from 9 projects, the experimental results show that LineFlowDP is 13%-404% more cost-effective than four state-of-the-art line-level defect prediction methods. The effectiveness of the flow information extension and code line risk ranking methods was also verified via ablation experiments.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"3 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su
{"title":"Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT","authors":"Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su","doi":"10.1007/s10664-023-10405-9","DOIUrl":"https://doi.org/10.1007/s10664-023-10405-9","url":null,"abstract":"<p>Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"4 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati
{"title":"Evaluating the impact of flaky simulators on testing autonomous driving systems","authors":"Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati","doi":"10.1007/s10664-023-10433-5","DOIUrl":"https://doi.org/10.1007/s10664-023-10433-5","url":null,"abstract":"<p>Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of 85%, 82% and 96% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by 31%, 21%, and 13% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository (Github paper 2023).</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"39 Suppl 1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Studying the impact of risk assessment analytics on risk awareness and code review performance","authors":"","doi":"10.1007/s10664-024-10443-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10443-x","url":null,"abstract":"<h3>Abstract</h3> <p>While code review is a critical component of modern software quality assurance, defects can still slip through the review process undetected. Previous research suggests that the main reason for this is a lack of reviewer awareness about the likelihood of defects in proposed changes; even experienced developers may struggle to evaluate the potential risks. If a change’s riskiness is underestimated, it may not receive adequate attention during review, potentially leading to defects being introduced into the codebase. In this paper, we investigate how risk assessment analytics can influence the level of awareness among developers regarding the potential risks associated with code changes; we also study how effective and efficient reviewers are at detecting defects during code review with the use of such analytics. We conduct a controlled experiment using <span>Gherald</span>, a risk assessment prototype tool that analyzes the riskiness of change sets based on historical data. Following a between-subjects experimental design, we assign participants to the treatment (i.e., with access to <span>Gherald</span>) or control group. All participants are asked to perform risk assessment and code review tasks. Through our experiment with 48 participants, we find that the use of <span>Gherald</span> is associated with statistically significant improvements (one-tailed, unpaired Mann-Whitney U test, <span> <span>(alpha )</span> </span> = 0.05) in developer awareness of riskiness of code changes and code review effectiveness. Moreover, participants in the treatment group tend to identify the known defects more quickly than those in the control group; however, the difference between the two groups is not statistically significant. Our results lead us to conclude that the adoption of a risk assessment tool has a positive impact on code review practices, which provides valuable insights for practitioners seeking to enhance their code review process and highlights the importance for further research to explore more effective and practical risk assessment approaches.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139904184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}