Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni
{"title":"Semantic matching in GUI test reuse","authors":"Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni","doi":"10.1007/s10664-023-10406-8","DOIUrl":"https://doi.org/10.1007/s10664-023-10406-8","url":null,"abstract":"<p>Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across <span>Android</span> apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with <span>Test Reuse</span> approaches on migrating test cases across <span>Android</span> apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the <span>Semantic Matching Algorithm</span>. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall <span>Test Reuse</span> approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"66 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell
{"title":"Just-in-Time crash prediction for mobile apps","authors":"Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell","doi":"10.1007/s10664-024-10455-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10455-7","url":null,"abstract":"<p>Just-In-Time (JIT) defect prediction aims to identify defects early, at commit time. Hence, developers can take precautions to avoid defects when the code changes are still fresh in their minds. However, the utility of JIT defect prediction has not been investigated in relation to crashes of mobile apps. We therefore conducted a multi-case study employing both quantitative and qualitative analysis. In the quantitative analysis, we used machine learning techniques for prediction. We collected 113 reliability-related metrics for about 30,000 commits from 14 Android apps and selected 14 important metrics for prediction. We found that both standard JIT metrics and static analysis warnings are important for JIT prediction of mobile app crashes. We further optimized prediction performance, comparing seven state-of-the-art defect prediction techniques with hyperparameter optimization. Our results showed that Random Forest is the best performing model with an AUC-ROC of 0.83. In our qualitative analysis, we manually analysed a sample of 642 commits and identified different types of changes that are common in crash-inducing commits. We explored whether different aspects of changes can be used as metrics in JIT models to improve prediction performance. We found these metrics improve the prediction performance significantly. Hence, we suggest considering static analysis warnings <i>and</i> Android-specific metrics to adapt standard JIT defect prediction models for a mobile context to predict crashes. Finally, we provide recommendations to bridge the gap between research and practice and point to opportunities for future research.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"205 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing and revivifying function signature inference using deep learning","authors":"Yan Lin, Trisha Singhal, Debin Gao, David Lo","doi":"10.1007/s10664-024-10453-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10453-9","url":null,"abstract":"<p>Function signature plays an important role in binary analysis and security enhancement, with typical examples in bug finding and control-flow integrity enforcement. However, recovery of function signatures by static binary analysis is challenging since crucial information vital for such recovery is stripped off during compilation. Although function signature recovery using deep learning (DL) is proposed in an effort to handle such challenges, the reported accuracy is low for binaries compiled with optimizations. In this paper, we first perform a systematic study to quantify the extent to which compiler optimizations (negatively) impact the accuracy of existing DL techniques based on Recurrent Neural Network (RNN) for function signature recovery. Our experiments show that the state-of-the-art DL technique has its accuracy dropped from 98.7% to 87.7% when training and testing optimized binaries. We further investigate the type of instructions that existing RNN model deems most important in inferring function signatures with the help of saliency map. The results show that existing RNN model mistakenly considers non-argument-accessing instructions to infer the number of arguments, especially when dealing with optimized binaries. Finally, we identify specific weaknesses in such existing approaches and propose an enhanced DL approach named ReSIL to incorporate compiler-optimization-specific domain knowledge into the learning process. Our experimental results show that ReSIL significantly improves the accuracy and F1 score in inferring function signatures, e.g., with accuracy in inferring the number of arguments for callees compiled with optimization flag O1 from 84.83% to 92.68%. Meanwhile, ReSIL correctly considers the argument-accessing instructions as the most important ones to perform the inferencing. We also demonstrate security implications of ReSIL in Control-Flow Integrity enforcement in stopping potential Counterfeit Object-Oriented Programming (COOP) attacks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2021 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ethics in AI through the practitioner’s view: a grounded theory literature review","authors":"Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, Burak Turhan","doi":"10.1007/s10664-024-10465-5","DOIUrl":"https://doi.org/10.1007/s10664-024-10465-5","url":null,"abstract":"<p>The term ethics is widely used, explored, and debated in the context of developing Artificial Intelligence (AI) based software systems. In recent years, numerous incidents have raised the profile of ethical issues in AI development and led to public concerns about the proliferation of AI technology in our everyday lives. But what do we know about the views and experiences of those who develop these systems – the AI practitioners? We conducted a grounded theory literature review (GTLR) of 38 primary empirical studies that included AI practitioners’ views on ethics in AI and analysed them to derive five categories: practitioner <i>awareness</i>, <i>perception</i>, <i>need</i>, <i>challenge</i>, and <i>approach</i>. These are underpinned by multiple codes and concepts that we explain with evidence from the included studies. We present a <i>taxonomy of ethics in AI from practitioners’ viewpoints</i> to assist AI practitioners in identifying and understanding the different aspects of AI ethics. The taxonomy provides a landscape view of the key aspects that concern AI practitioners when it comes to ethics in AI. We also share an agenda for future research studies and recommendations for practitioners, managers, and organisations to help in their efforts to better consider and implement ethics in AI.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"161 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michel Maes-Bermejo, Alexander Serebrenik, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesús María González Barahona
{"title":"Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing","authors":"Michel Maes-Bermejo, Alexander Serebrenik, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesús María González Barahona","doi":"10.1007/s10664-024-10479-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10479-z","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Finding code changes that introduced bugs is important both for practitioners and researchers, but doing it precisely is a manual, effort-intensive process. The <i>perfect test</i> method is a theoretical construct aimed at detecting Bug-Introducing Changes (BIC) through a theoretical <i>perfect test</i>. This <i>perfect test</i> always fails if the bug is present, and passes otherwise.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To explore a possible automatic operationalization of the <i>perfect test</i> method.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>To use regression tests as substitutes for the <i>perfect test</i>. For this, we transplant the regression tests to past snapshots of the code, and use them to identify the BIC, on a well-known collection of bugs from the Defects4J dataset.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>From 809 bugs in the dataset, when running our operationalization of the perfect test method, for 95 of them the BIC was identified precisely and in the remaining 4 cases, a list of candidates including the BIC was provided.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>We demonstrate that the operationalization of the <i>perfect test</i> method through regression tests is feasible and can be completely automated in practice when tests can be transplanted and run in past snapshots of the code. Given that implementing regression tests when a bug is fixed is considered a good practice, when developers follow it, they can detect effortlessly bug-introducing changes by using our operationalization of the <i>perfect test</i> method.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VioDroid-Finder: automated evaluation of compliance and consistency for Android apps","authors":"Junren Chen, Cheng Huang, Jiaxuan Han","doi":"10.1007/s10664-024-10470-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10470-8","url":null,"abstract":"<p>Rapid growth in the variety and quantity of apps makes it difficult for users to protect their privacy, although existing regulations have been introduced and the Android ecosystem is constantly being improved, there are still violations as privacy policies may not fully comply with regulations, and app behavior may not be fully consistent with privacy policies. To solve such issues, this paper proposes an automated method called VioDroid-Finder aiming at the evaluation of compliance and consistency for Android apps. We first study existing common regulations and conclude the privacy policy content into 7 aspects (i.e., privacy categories), for privacy policies, different compliance rules are required to be complied with in each privacy category. Secondly, we present a policy structure parser model based on the structure extraction/rebuilding method (which can convert the unstructured text to an XML tree) and subtitle similarity calculation algorithm. Thirdly, we propose a violation analyzer using the BERT model to classify each sentence in the privacy policy, we collect existing issues and combine them with manual observations to define 6 types of violations and detect them based on classification results. Then, we propose an inconsistency analyzer that converts permissions, APIs, and GUI into a set of personal information based on static analysis, inconsistencies are detected by comparing that set with personal information declared in the privacy policy. Finally, we evaluate 600 Chinese apps using the proposed method, from which we detect many violations and inconsistencies reflecting the current widespread privacy violation issues.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eduardo Guerra, Everaldo Gomes, Jeferson Ferreira, Igor Wiese, Phyllipe Lima, Marco Gerosa, Paulo Meirelles
{"title":"How do annotations affect Java code readability?","authors":"Eduardo Guerra, Everaldo Gomes, Jeferson Ferreira, Igor Wiese, Phyllipe Lima, Marco Gerosa, Paulo Meirelles","doi":"10.1007/s10664-024-10460-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10460-w","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Code annotations have gained widespread popularity in programming languages, offering developers the ability to attach metadata to code elements to define custom behaviors. Many modern frameworks and APIs use annotations to keep integration less verbose and located nearer to the corresponding code element. Despite these advantages, practitioners’ anecdotal evidence suggests that annotations might negatively affect code readability.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To better understand this effect, this paper systematically investigates the relationship between code annotations and code readability.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>In a survey with software developers (n=332), we present 15 pairs of Java code snippets with and without code annotations. These pairs were designed considering five categories of annotation used in real-world Java frameworks and APIs. Survey participants selected the code snippet they considered more readable for each pair and answered an open question about how annotations affect the code’s readability.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Preferences were scattered for all categories of annotation usage, revealing no consensus among participants. The answers were spread even when segregated by participants’ programming or annotation-related experience. Nevertheless, some participants showed a consistent preference in favor or against annotations across all categories, which may indicate a personal preference. Our qualitative analysis of the open-ended questions revealed that participants often praise annotation impacts on design, maintainability, and productivity but expressed contrasting views on understandability and code clarity.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Software developers and API designers can consider our results when deciding whether to use annotations, equipped with the insight that developers express contrasting views of the annotations’ impact on code readability.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"150 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Patterns of multi-container composition for service orchestration with Docker Compose","authors":"Kalvin Eng, Abram Hindle, Eleni Stroulia","doi":"10.1007/s10664-024-10462-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10462-8","url":null,"abstract":"<p>Software design patterns present general code solutions to common software design problems. Modern software systems rely heavily on containers for running their constituent service components. Yet, despite the prevalence of ready-to-use Docker service images ready to participate in multi-container service compositions of applications, developers do not have much guidance on how to compose their own Docker service orchestrations. Thus in this work, we curate a dataset of successful projects that employ Docker Compose as an orchestration tool to run multiple service containers; then, we engage in qualitative and quantitative analysis of Docker Compose configurations. The collection of data and analysis enables the identification and naming of repeating multi-container composition patterns that are used in numerous successful open-source projects, much like software design patterns. These patterns highlight how software systems are orchestrated in the real-world and can give examples to anybody wishing to compose their own service orchestrations. These contributions also advance empirical research in software engineering patterns as evidence is provided about how Docker Compose is used.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"15 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaoyu Yang, Xiang Chen, Ke Liu, Guang Yang, Chi Yu
{"title":"Automatic bi-modal question title generation for Stack Overflow with prompt learning","authors":"Shaoyu Yang, Xiang Chen, Ke Liu, Guang Yang, Chi Yu","doi":"10.1007/s10664-024-10466-4","DOIUrl":"https://doi.org/10.1007/s10664-024-10466-4","url":null,"abstract":"<p>When drafting question posts for Stack Overflow, developers may not accurately summarize the core problems in the question titles, which can cause these questions to not get timely help. Therefore, improving the quality of question titles has attracted the wide attention of researchers. An initial study aimed to automatically generate the titles by only analyzing the code snippets in the question body. However, this study ignored the helpful information in their corresponding problem descriptions. Therefore, we propose an approach <span>SOTitle+</span> by considering bi-modal information (i.e., the code snippets and the problem descriptions) in the question body. Then we formalize the title generation for different programming languages as separate but related tasks and utilize multi-task learning to solve these tasks. Later we fine-tune the pre-trained language model CodeT5 to automatically generate the titles. Unfortunately, the inconsistent inputs and optimization objectives between the pre-training task and our investigated task may make fine-tuning hard to fully explore the knowledge of the pre-trained model. To solve this issue, <span>SOTitle+</span> further prompt-tunes CodeT5 with hybrid prompts (i.e., mixture of hard and soft prompts). To verify the effectiveness of <span>SOTitle+</span>, we construct a large-scale high-quality corpus from recent data dumps shared by Stack Overflow. Our corpus includes 179,119 high-quality question posts for six popular programming languages. Experimental results show that <span>SOTitle+</span> can significantly outperform four state-of-the-art baselines in both automatic evaluation and human evaluation. In addition, our ablation studies also confirm the effectiveness of component settings (such as bi-modal information, prompt learning, hybrid prompts, and multi-task learning) of <span>SOTitle+</span>. Our work indicates that considering bi-modal information and prompt learning in Stack Overflow title generation is a promising exploration direction.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"20 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel De Stefano, Dario Di Nucci, Fabio Palomba, Andrea De Lucia
{"title":"An empirical study into the effects of transpilation on quantum circuit smells","authors":"Manuel De Stefano, Dario Di Nucci, Fabio Palomba, Andrea De Lucia","doi":"10.1007/s10664-024-10461-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10461-9","url":null,"abstract":"<p>Quantum computing is a promising field that can solve complex problems beyond traditional computers’ capabilities. Developing high-quality quantum software applications, called quantum software engineering, has recently gained attention. However, quantum software development faces challenges related to code quality. A recent study found that many open-source quantum programs are affected by quantum-specific code smells, with <i>long circuit</i> being the most common. While the study provided relevant insights into the prevalence of code smells in quantum circuits, it did not explore the potential effect of transpilation, a necessary step for executing quantum computer programs, on the emergence of code smells. Indeed, transpilation might alter those characteristics employed to detect the presence of a smell on a circuit. To address this limitation, we present a new study investigating the impact of transpilation on quantum-specific code smells and how different target gate sets affect the results. We conducted experiments on 17 open-source quantum programs alongside a set of 100 synthetic circuits. We found that transpilation can significantly alter the metrics that are used to detect code smells, even into previously smell-free circuits, with the <i>long circuit</i> smell being the most susceptible to transpilation. Furthermore, the choice of the gate set significantly influences the presence and severity of code smells in transpiled circuits, highlighting the need for careful gate set selection to mitigate their impact. These findings have implications for circuit optimization and high-quality quantum software development. Further research is needed to understand the consequences of code smells and their potential impact on quantum computations, considering the characteristics and constraints of different gate sets and hardware platforms.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"9 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140829388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}