Nathan Cassee, Andrei Agaronian, Eleni Constantinou, Nicole Novielli, Alexander Serebrenik
{"title":"Transformers and meta-tokenization in sentiment analysis for software engineering","authors":"Nathan Cassee, Andrei Agaronian, Eleni Constantinou, Nicole Novielli, Alexander Serebrenik","doi":"10.1007/s10664-024-10468-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10468-2","url":null,"abstract":"<p>Sentiment analysis has been used to study aspects of software engineering, such as issue resolution, toxicity, and self-admitted technical debt. To address the peculiarities of software engineering texts, sentiment analysis tools often consider the specific technical lingo practitioners use. To further improve the application of sentiment analysis, there have been two recommendations: Using pre-trained transformer models to classify sentiment and replacing non-natural language elements with meta-tokens. In this work, we benchmark five different sentiment analysis tools (two pre-trained transformer models and three machine learning tools) on 2 gold-standard sentiment analysis datasets. We find that pre-trained transformers outperform the best machine learning tool on only one of the two datasets, and that even on that dataset the performance difference is a few percentage points. Therefore, we recommend that software engineering researchers should not just consider predictive performance when selecting a sentiment analysis tool because the best-performing sentiment analysis tools perform very similarly to each other (within 4 percentage points). Meanwhile, we find that meta-tokenization does not improve the predictive performance of sentiment analysis tools. Both of our findings can be used by software engineering researchers who seek to apply sentiment analysis tools to software engineering data.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards graph-anonymization of software analytics data: empirical study on JIT defect prediction","authors":"Akshat Malik, Bram Adams, Ahmed Hassan","doi":"10.1007/s10664-024-10464-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10464-6","url":null,"abstract":"<p>As the usage of software analytics for understanding different organizational practices becomes prevalent, it is important that data for these practices is shared across different organizations to build a common understanding of software systems and processes. Yet, organizations are hesitant to share this data and trained models with one another due to concerns around privacy, e.g., because of the risk of reverse engineering the training data of the models. To facilitate data sharing, tabular anonymization techniques like MORPH, LACE and LACE2 have been proposed to provide privacy to defect prediction data. However, said techniques treat data points as individual elements, and lose the context between different features when performing anonymization. We study the effect of four anonymization techniques, i.e., Random Add/Delete, Random Switch, k-DA and Generalization, on the privacy score and performance in six large, long-lived projects. To measure privacy, we use the IPR metric, which is a measure of the inability of an attacker to extract information about sensitive attributes from the anonymized data. We find that all four graph anonymization techniques are able to provide privacy scores higher than 65% in all the datasets, while Random Add/ Delete and Random Switch are even able to achieve privacy scores of 80% and greater in all datasets. For techniques achieving privacy scores of 65%, the AUC and Recall decreased by a median of 1.45% and 5.35%, respectively. For techniques with privacy scores 80% or greater, the AUC and Recall of privatized models decreased by a median of 6.44% and 20.29%, respectively. The state-of-the-art tabular techniques like MORPH, LACE and LACE2 provide high privacy scores (89%-99%); however, they have a higher impact on performance with a median decrease of 21.15% in AUC and 80.34% in Recall. Furthermore, since privacy scores 65% or greater are adequate for sharing, the graph anonymization techniques are able to provide more configurable results where one can make trade-offs between privacy and performance. When compared to unsupervised techniques like a JIT variant of ManualDown, the GA techniques perform comparable or significantly better for AUC, G-Mean and FPR metrics. Our work shows that graph anonymization can be an effective way of providing privacy while preserving model performance.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"88 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141190200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Idowu, Osman Osman, Daniel Strüber, Thorsten Berger
{"title":"Machine learning experiment management tools: a mixed-methods empirical study","authors":"Samuel Idowu, Osman Osman, Daniel Strüber, Thorsten Berger","doi":"10.1007/s10664-024-10444-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10444-w","url":null,"abstract":"<p>Machine Learning (ML) experiment management tools support ML practitioners and software engineers when building intelligent software systems. By managing large numbers of ML experiments comprising many different ML assets, they not only facilitate engineering ML models and ML-enabled systems, but also managing their evolution—for instance, tracing system behavior to concrete experiments when the model performance drifts. However, while ML experiment management tools have become increasingly popular, little is known about their effectiveness in practice, as well as their actual benefits and challenges. We present a mixed-methods empirical study of experiment management tools and the support they provide to users. First, our survey of 81 ML practitioners sought to determine the benefits and challenges of ML experiment management and of the existing tool landscape. Second, a controlled experiment with 15 student developers investigated the effectiveness of ML experiment management tools. We learned that 70% of our survey respondents perform ML experiments using specialized tools, while out of those who do not use such tools, 52% are unaware of experiment management tools or of their benefits. The controlled experiment showed that experiment management tools offer valuable support to users to systematically track and retrieve ML assets. Using ML experiment management tools reduced error rates and increased completion rates. By presenting a user’s perspective on experiment management tools, and the first controlled experiment in this area, we hope that our results foster the adoption of these tools in practice, as well as they direct tool builders and researchers to improve the tool landscape overall.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"56 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do Agile scaling approaches make a difference? an empirical comparison of team effectiveness across popular scaling approaches","authors":"Christiaan Verwijs, Daniel Russo","doi":"10.1007/s10664-024-10481-5","DOIUrl":"https://doi.org/10.1007/s10664-024-10481-5","url":null,"abstract":"<p>With the prevalent use of Agile methodologies, organizations are grappling with the challenge of scaling development across numerous teams. This has led to the emergence of diverse scaling strategies, from complex ones such as “SAFe\", to more simplified methods e.g., “LeSS\", with some organizations devising their unique approaches. While there have been multiple studies exploring the organizational challenges associated with different scaling approaches, so far, no one has compared these strategies based on empirical data derived from a uniform measure. This makes it hard to draw robust conclusions about how different scaling approaches affect Agile team effectiveness. Thus, the objective of this study is to assess the effectiveness of Agile teams across various scaling approaches, including “SAFe\", “LeSS\", “Scrum of Scrums\", and custom methods, as well as those not using scaling. This study focuses initially on responsiveness, stakeholder concern, continuous improvement, team autonomy, management approach, and overall team effectiveness, followed by an evaluation based on stakeholder satisfaction regarding value, responsiveness, and release frequency. To achieve this, we performed a comprehensive survey involving 15,078 members of 4,013 Agile teams to measure their effectiveness, combined with satisfaction surveys from 1,841 stakeholders of 529 of those teams. We conducted a series of inferential statistical analyses, including Analysis of Variance and multiple linear regression, to identify any significant differences, while controlling for team experience and organizational size. The findings of the study revealed some significant differences, but their magnitude and effect size were considered too negligible to have practical significance. In conclusion, the choice of Agile scaling strategy does not markedly influence team effectiveness, and organizations are advised to choose a method that best aligns with their previous experiences with Agile, organizational culture, and management style.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"68 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Levén, Hampus Broman, Terese Besker, Richard Torkar
{"title":"The broken windows theory applies to technical debt","authors":"William Levén, Hampus Broman, Terese Besker, Richard Torkar","doi":"10.1007/s10664-024-10456-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10456-6","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context:</h3><p>The term <i>technical debt</i> (TD) describes the aggregation of sub-optimal solutions that serve to impede the evolution and maintenance of a system. Some claim that the <i>broken windows theory</i> (BWT), a concept borrowed from criminology, also applies to software development projects. The theory states that the presence of indications of previous crime (such as a broken window) will increase the likelihood of further criminal activity; TD could be considered the <i>broken windows</i> of software systems.</p><h3 data-test=\"abstract-sub-heading\">Objective:</h3><p>To empirically investigate the causal relationship between the TD density of a system and the propensity of developers to introduce new TD during the extension of that system.</p><h3 data-test=\"abstract-sub-heading\">Method:</h3><p>The study used a mixed-methods research strategy consisting of a controlled experiment with an accompanying survey and follow-up interviews. The experiment had a total of 29 developers of varying experience levels completing system extension tasks in already existing systems with high or low TD density.</p><h3 data-test=\"abstract-sub-heading\">Results:</h3><p>The analysis revealed significant effects of TD level on the subjects’ tendency to re-implement (rather than reuse) functionality, choose non-descriptive variable names, and introduce other <i>code smells</i> identified by the software tool <span>SonarQube</span>, all with at least <span>(95%)</span> credible intervals.</p><h3 data-test=\"abstract-sub-heading\">Coclusions:</h3><p>Three separate significant results along with a validating qualitative result combine to form substantial evidence of the BWT’s existence in software engineering contexts. This study finds that existing TD can have a major impact on developers propensity to introduce new TD of various types during development.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"282 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141152952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella
{"title":"Two is better than one: digital siblings to improve autonomous driving testing","authors":"Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella","doi":"10.1007/s10664-024-10458-4","DOIUrl":"https://doi.org/10.1007/s10664-024-10458-4","url":null,"abstract":"<p>Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of <i>digital siblings</i>—a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-keeping component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2015 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni
{"title":"Semantic matching in GUI test reuse","authors":"Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni","doi":"10.1007/s10664-023-10406-8","DOIUrl":"https://doi.org/10.1007/s10664-023-10406-8","url":null,"abstract":"<p>Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across <span>Android</span> apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with <span>Test Reuse</span> approaches on migrating test cases across <span>Android</span> apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the <span>Semantic Matching Algorithm</span>. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall <span>Test Reuse</span> approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"66 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell
{"title":"Just-in-Time crash prediction for mobile apps","authors":"Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell","doi":"10.1007/s10664-024-10455-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10455-7","url":null,"abstract":"<p>Just-In-Time (JIT) defect prediction aims to identify defects early, at commit time. Hence, developers can take precautions to avoid defects when the code changes are still fresh in their minds. However, the utility of JIT defect prediction has not been investigated in relation to crashes of mobile apps. We therefore conducted a multi-case study employing both quantitative and qualitative analysis. In the quantitative analysis, we used machine learning techniques for prediction. We collected 113 reliability-related metrics for about 30,000 commits from 14 Android apps and selected 14 important metrics for prediction. We found that both standard JIT metrics and static analysis warnings are important for JIT prediction of mobile app crashes. We further optimized prediction performance, comparing seven state-of-the-art defect prediction techniques with hyperparameter optimization. Our results showed that Random Forest is the best performing model with an AUC-ROC of 0.83. In our qualitative analysis, we manually analysed a sample of 642 commits and identified different types of changes that are common in crash-inducing commits. We explored whether different aspects of changes can be used as metrics in JIT models to improve prediction performance. We found these metrics improve the prediction performance significantly. Hence, we suggest considering static analysis warnings <i>and</i> Android-specific metrics to adapt standard JIT defect prediction models for a mobile context to predict crashes. Finally, we provide recommendations to bridge the gap between research and practice and point to opportunities for future research.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"205 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing and revivifying function signature inference using deep learning","authors":"Yan Lin, Trisha Singhal, Debin Gao, David Lo","doi":"10.1007/s10664-024-10453-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10453-9","url":null,"abstract":"<p>Function signature plays an important role in binary analysis and security enhancement, with typical examples in bug finding and control-flow integrity enforcement. However, recovery of function signatures by static binary analysis is challenging since crucial information vital for such recovery is stripped off during compilation. Although function signature recovery using deep learning (DL) is proposed in an effort to handle such challenges, the reported accuracy is low for binaries compiled with optimizations. In this paper, we first perform a systematic study to quantify the extent to which compiler optimizations (negatively) impact the accuracy of existing DL techniques based on Recurrent Neural Network (RNN) for function signature recovery. Our experiments show that the state-of-the-art DL technique has its accuracy dropped from 98.7% to 87.7% when training and testing optimized binaries. We further investigate the type of instructions that existing RNN model deems most important in inferring function signatures with the help of saliency map. The results show that existing RNN model mistakenly considers non-argument-accessing instructions to infer the number of arguments, especially when dealing with optimized binaries. Finally, we identify specific weaknesses in such existing approaches and propose an enhanced DL approach named ReSIL to incorporate compiler-optimization-specific domain knowledge into the learning process. Our experimental results show that ReSIL significantly improves the accuracy and F1 score in inferring function signatures, e.g., with accuracy in inferring the number of arguments for callees compiled with optimization flag O1 from 84.83% to 92.68%. Meanwhile, ReSIL correctly considers the argument-accessing instructions as the most important ones to perform the inferencing. We also demonstrate security implications of ReSIL in Control-Flow Integrity enforcement in stopping potential Counterfeit Object-Oriented Programming (COOP) attacks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2021 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ethics in AI through the practitioner’s view: a grounded theory literature review","authors":"Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, Burak Turhan","doi":"10.1007/s10664-024-10465-5","DOIUrl":"https://doi.org/10.1007/s10664-024-10465-5","url":null,"abstract":"<p>The term ethics is widely used, explored, and debated in the context of developing Artificial Intelligence (AI) based software systems. In recent years, numerous incidents have raised the profile of ethical issues in AI development and led to public concerns about the proliferation of AI technology in our everyday lives. But what do we know about the views and experiences of those who develop these systems – the AI practitioners? We conducted a grounded theory literature review (GTLR) of 38 primary empirical studies that included AI practitioners’ views on ethics in AI and analysed them to derive five categories: practitioner <i>awareness</i>, <i>perception</i>, <i>need</i>, <i>challenge</i>, and <i>approach</i>. These are underpinned by multiple codes and concepts that we explain with evidence from the included studies. We present a <i>taxonomy of ethics in AI from practitioners’ viewpoints</i> to assist AI practitioners in identifying and understanding the different aspects of AI ethics. The taxonomy provides a landscape view of the key aspects that concern AI practitioners when it comes to ethics in AI. We also share an agenda for future research studies and recommendations for practitioners, managers, and organisations to help in their efforts to better consider and implement ethics in AI.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"161 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}