{"title":"Comparative analysis of real issues in open-source machine learning projects","authors":"Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, Rajesh Vasa","doi":"10.1007/s10664-024-10467-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10467-3","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>In the last decade of data-driven decision-making, Machine Learning (ML) systems reign supreme. Because of the different characteristics between ML and traditional Software Engineering systems, we do not know to what extent the issue-reporting needs are different, and to what extent these differences impact the issue resolution process.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We aim to compare the differences between ML and non-ML issues in open-source applied AI projects in terms of resolution time and size of fix. This research aims to enhance the predictability of maintenance tasks by providing valuable insights for issue reporting and task scheduling activities.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We collect issue reports from Github repositories of open-source ML projects using an automatic approach, filter them using ML keywords and libraries, manually categorize them using an adapted deep learning bug taxonomy, and compare resolution time and fix size for ML and non-ML issues in a controlled sample.</p><h3 data-test=\"abstract-sub-heading\">Result</h3><p>147 ML issues and 147 non-ML issues are collected for analysis. We found that ML issues take more time to resolve than non-ML issues, the median difference is 14 days. There is no significant difference in terms of size of fix between ML and non-ML issues. No significant differences are found between different ML issue categories in terms of resolution time and size of fix.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Our study provided evidence that the life cycle for ML issues is stretched, and thus further work is required to identify the reason. The results also highlighted the need for future work to design custom tooling to support faster resolution of ML issues.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"5 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140829393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection","authors":"Amal Alazba, Hamoud Aljamaan, Mohammad Alshayeb","doi":"10.1007/s10664-024-10445-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10445-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>The proposed method has been shown to be effective in detecting class-level, and method-level code smells.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"48 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taxonomy of inline code comment smells","authors":"","doi":"10.1007/s10664-023-10425-5","DOIUrl":"https://doi.org/10.1007/s10664-023-10425-5","url":null,"abstract":"<h3>Abstract</h3> <p>Code comments play a vital role in source code comprehension and software maintainability. It is common for developers to write comments to explain a code snippet, and commenting code is generally considered a good practice in software engineering. However, low-quality comments can have a detrimental effect on software quality or be ineffective for code understanding. This study aims to create a taxonomy of inline code comment smells and determine how frequently each smell type occurs in software projects. We conducted a multivocal literature review to define the initial taxonomy of inline comment smells. Afterward, we manually labeled 2447 inline comments from eight open-source projects where half of them were Java, and another half were Python projects. We created a taxonomy of 11 inline code comment smell types and found out that the smells exist in both Java and Python projects with varying degrees. Moreover, we conducted an online survey with 41 software practitioners to learn their opinions on these smells and their impact on code comprehension and software maintainability. The survey respondents generally agreed with the taxonomy; however, they reported that some smell types might have a positive effect on code comprehension in certain scenarios. We also opened pull requests and issues fixing the comment smells in the sampled projects, where we got a 27% acceptance rate. We share our manually labeled dataset online and provide implications for software engineering practitioners, researchers, and educators.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"114 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher S. Timperley, Gijs van der Hoorn, André Santos, Harshavardhan Deshpande, Andrzej Wąsowski
{"title":"ROBUST: 221 bugs in the Robot Operating System","authors":"Christopher S. Timperley, Gijs van der Hoorn, André Santos, Harshavardhan Deshpande, Andrzej Wąsowski","doi":"10.1007/s10664-024-10440-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10440-0","url":null,"abstract":"<p>As robotic systems such as autonomous cars and delivery drones assume greater roles and responsibilities within society, the likelihood and impact of catastrophic software failure within those systems is increased. To aid researchers in the development of new methods to measure and assure the safety and quality of robotics software, we systematically curated a dataset of 221 bugs across 7 popular and diverse software systems implemented via the Robot Operating System (ROS). We produce historically accurate recreations of each of the 221 defective software versions in the form of Docker images, and use a grounded theory approach to examine and categorize their corresponding faults, failures, and fixes. Finally, we reflect on the implications of our findings and outline future research directions for the community.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"121 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang
{"title":"An empirical study of untangling patterns of two-class dependency cycles","authors":"Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang","doi":"10.1007/s10664-023-10438-0","DOIUrl":"https://doi.org/10.1007/s10664-023-10438-0","url":null,"abstract":"<p>Dependency cycles pose a significant challenge to software quality and maintainability. However, there is limited understanding of how practitioners resolve dependency cycles in real-world scenarios. This paper presents an empirical study investigating the recurring patterns employed by software developers to resolve dependency cycles between two classes in practice. We analyzed the data from 38 open-source projects across different domains and manually inspected hundreds of cycle untangling cases. Our findings reveal that developers tend to employ five recurring patterns to address dependency cycles. The chosen patterns are not only determined by dependency relations between cyclic classes, but also highly related to their design context, i.e., how cyclic classes depend on or are depended by their neighbor classes. Through this empirical study, we also discovered three common counterintuitive solutions developers usually adopted during cycles’ handling. These recurring patterns and common counterintuitive solutions observed in dependency cycles’ practice can serve as a taxonomy to improve developers’ awareness and also be used as learning materials for students in software engineering and inexperienced developers. Our results also suggest that, in addition to considering the internal structure of dependency cycles, automatic tools need to consider the design context of cycles to provide better support for refactoring dependency cycles.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"12 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140115349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning-based test smell detection","authors":"Valeria Pontillo, Dario Amoroso d’Aragona, Fabiano Pecorelli, Dario Di Nucci, Filomena Ferrucci, Fabio Palomba","doi":"10.1007/s10664-023-10436-2","DOIUrl":"https://doi.org/10.1007/s10664-023-10436-2","url":null,"abstract":"<p>Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. The key findings of the study report a negative result. The performance of the machine learning-based detector is significantly better than heuristic-based techniques, but none of the learners able to overcome an average F-Measure of 51%. We further elaborate and discuss the reasons behind this negative result through a qualitative investigation into the current issues and challenges that prevent the appropriate detection of test smells, which allowed us to catalog the next steps that the research community may pursue to improve test smell detection techniques.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the readability of test code","authors":"","doi":"10.1007/s10664-023-10390-z","DOIUrl":"https://doi.org/10.1007/s10664-023-10390-z","url":null,"abstract":"<h3>Abstract</h3> <span> <h3>Context</h3> <p>The readability of source code is key for understanding and maintaining software systems and tests. Although several studies investigate the readability of source code, there is limited research specifically on the readability of test code and related influence factors.</p> </span> <span> <h3>Objective</h3> <p>In this paper, we aim at investigating the factors that influence the readability of test code from an academic perspective based on scientific literature sources and complemented by practical views, as discussed in grey literature.</p> </span> <span> <h3>Methods</h3> <p>First, we perform a Systematic Mapping Study (SMS) with a focus on scientific literature. Second, we extend this study by reviewing grey literature sources for practical aspects on test code readability and understandability. Finally, we conduct a controlled experiment on the readability of a selected set of test cases to collect additional knowledge on influence factors discussed in practice.</p> </span> <span> <h3>Results</h3> <p>The result set of the SMS includes 19 primary studies from the scientific literature for further analysis. The grey literature search reveals 62 sources for information on test code readability. Based on an analysis of these sources, we identified a combined set of 14 factors that influence the readability of test code. 7 of these factors were found in scientific <em>and</em> grey literature, while some factors were mainly discussed in academia (2) <em>or</em> industry (5) with only limited overlap. The controlled experiment on practically relevant influence factors showed that the investigated factors have a significant impact on readability for half of the selected test cases.</p> </span> <span> <h3>Conclusion</h3> <p>Our review of scientific and grey literature showed that test code readability is of interest for academia and industry with a consensus on key influence factors. However, we also found factors only discussed by practitioners. For some of these factors we were able to confirm an impact on readability in a first experiment. Therefore, we see the need to bring together academic and industry viewpoints to achieve a common view on the readability of software test code.</p> </span>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2016 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139980044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies
{"title":"When less is more: on the value of “co-training” for semi-supervised software defect predictors","authors":"Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies","doi":"10.1007/s10664-023-10418-4","DOIUrl":"https://doi.org/10.1007/s10664-023-10418-4","url":null,"abstract":"<p>Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Traceability and reuse mechanisms, the most important properties of model transformation languages","authors":"Stefan Höppner, Matthias Tichy","doi":"10.1007/s10664-023-10428-2","DOIUrl":"https://doi.org/10.1007/s10664-023-10428-2","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">\u0000<b>Context</b>\u0000</h3><p>Dedicated model transformation languages are claimed to provide many benefits over the use of general purpose languages for developing model transformations. However, the actual advantages and disadvantages associated with the use of model transformation languages are poorly understood empirically. There is little knowledge and even less empirical assessment about what advantages and disadvantages hold in which cases and where they originate from. In a prior interview study, we elicited expert opinions on what advantages result from what factors surrounding model transformation languages as well as a number of moderating factors that moderate the influence.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Objective</b>\u0000</h3><p>We aim to quantitatively asses the interview results to confirm or reject the influences and moderation effects posed by different factors. We further intend to gain insights into how valuable different factors are to the discussion so that future studies can draw on these data for designing targeted and relevant studies.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Method</b>\u0000</h3><p>We gather data on the factors and quality attributes using an online survey. To analyse the data and examine the hypothesised influences and moderations, we use universal structure modelling based on a structural equation model. Universal structure modelling produces significance values and path coefficients for each hypothesised and modelled interdependence between factors and quality attributes that can be used to confirm or reject correlation and to weigh the strength of influence present.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Results</b>\u0000</h3><p>We analyzed 113 responses. The results show that the MTL capabilities Tracing and Reuse Mechanisms are most important overall. Though the observed effects were generally 10 times lower than anticipated. Furthermore, we found that moderation effects need to be individually assessed for each influence on a quality attribute. The moderation effects of a single moderating variable vary significantly for each influence, with the strongest effects being 1000 times higher than the weakest.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Conclusion</b>\u0000</h3><p>The empirical assessment of MTLs is a complex topic that cannot be solved by looking at a single stand-alone factor. Our results provide clear indication that evaluation should consider transformations of different sizes and use-cases that go beyond mapping one elements attributes to another. Language development on the other hand should focus on providing practical, transformation specific reuse mechanisms that allow MTLs to excel in areas such as maintainability and productivity compared to GPLs.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongming Xiang, Yuanchang Lin, Liming Nie, Yaowen Zheng, Zhengzi Xu, Zuohua Ding, Yang Liu
{"title":"An empirical study of attack-related events in DeFi projects development","authors":"Dongming Xiang, Yuanchang Lin, Liming Nie, Yaowen Zheng, Zhengzi Xu, Zuohua Ding, Yang Liu","doi":"10.1007/s10664-024-10447-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10447-7","url":null,"abstract":"<p>Decentralized Finance (DeFi) offers users decentralized financial services that are associated with the security of their assets. If DeFi is attacked, it could lead to considerable losses. Unfortunately, there is a lack of research on how DeFi developers respond to attacks during the development process. This lack of knowledge makes it difficult to identify which attacks to protect against and to create a comprehensive attack response system. This paper presents an empirical study to understand the current state of developers’ response to attacks during the development process. In addition, we conduct an analytical framework to help developers take preventive measures against attacks. Our research has revealed that Overflow Attack-related events are the most frequent (63, 19.75% of all attack-related events), and high-value DeFi projects tend to have more feedback and active development activities. We have observed that most of the attack instances (61, 85.92%) do not have corresponding attack-related development events, which can lead to a lack of trust between project teams and users if it is unclear whether the team responds to attacks. Furthermore, we have noticed that after the resolution of the same attack-related event, some attacks may recur, even though they could have been prevented. Consequently, we suggest some future research directions and provide some advice for DeFi project developers.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"24 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}