Wenxin Jiang, Vishnu Banna, Naveen Vivek, Abhinav Goel, Nicholas Synovic, George K. Thiruvathukal, James C. Davis
{"title":"Challenges and practices of deep learning model reengineering: A case study on computer vision","authors":"Wenxin Jiang, Vishnu Banna, Naveen Vivek, Abhinav Goel, Nicholas Synovic, George K. Thiruvathukal, James C. Davis","doi":"10.1007/s10664-024-10521-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10521-0","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Many engineering organizations are reimplementing and extending deep neural networks from the research community. We describe this process as deep learning model reengineering. Deep learning model reengineering — reusing, replicating, adapting, and enhancing state-of-the-art deep learning approaches — is challenging for reasons including under-documented reference models, changing requirements, and the cost of implementation and testing.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>Prior work has characterized the challenges of deep learning model development, but as yet we know little about the deep learning model reengineering process and its common challenges. Prior work has examined DL systems from a “product” view, examining defects from projects regardless of the engineers’ purpose. Our study is focused on reengineering activities from a “process” view, and focuses on engineers specifically engaged in the reengineering process.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>Our goal is to understand the characteristics and challenges of deep learning model reengineering. We conducted a mixed-methods case study of this phenomenon, focusing on the context of computer vision. Our results draw from two data sources: defects reported in open-source reeengineering projects, and interviews conducted with practitioners and the leaders of a reengineering team. From the defect data source, we analyzed 348 defects from 27 open-source deep learning projects. Meanwhile, our reengineering team replicated 7 deep learning models over two years; we interviewed 2 open-source contributors, 4 practitioners, and 6 reengineering team leaders to understand their experiences.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Our results describe how deep learning-based computer vision techniques are reengineered, quantitatively analyze the distribution of defects in this process, and qualitatively discuss challenges and practices. We found that most defects (58%) are reported by re-users, and that reproducibility-related defects tend to be discovered during training (68% of them are). Our analysis shows that most environment defects (88%) are interface defects, and most environment defects (46%) are caused by API defects. We found that training defects have diverse symptoms and root causes. We identified four main challenges in the DL reengineering process: model operationalization, performance debugging, portability of DL operations, and customized data pipeline. Integrating our quantitative and qualitative data, we propose a novel reengineering workflow.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Our findings inform several conclusion, including: standardizing model reengineering practices, developing validation tools to support model reengineering, automated support beyond manual model reengineering, and measuring additional unknown aspects of model reengineering.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"7 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How does parenthood affect an ICT practitioner’s work? A survey study with fathers","authors":"Larissa Rocha, Edna Dias Canedo, Claudia Pinto Pereira, Carla Bezerra, Fabiana Freitas Mendes","doi":"10.1007/s10664-024-10534-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10534-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Many studies have investigated the perception of software development teams about gender bias, inclusion policies, and the impact of remote work on productivity. The studies indicate that mothers and fathers working in the software industry had to reconcile homework, work activities, and child care.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study investigates the impact of parenthood on Information and Communications Technology (ICT) and how the fathers perceive the mothers’ challenges. Recognizing their difficulties and knowing the mothers’ challenges can be the first step towards making the work environment friendlier for everyone.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We surveyed 155 fathers from industry and academia from 10 different countries, however, most of the respondents are from Brazil (92.3%). Data was analyzed quantitatively and qualitatively. We employed Grounded Theory to identify factors related to (i) paternity leave, (ii) working time after paternity, (iii) childcare-related activities, (iv) prejudice at work after paternity, (v) fathers’ perception of prejudice against mothers, (vi) challenges and difficulties in paternity. We also conduct a correlational and regression analysis to explore some research questions further.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>In general, fathers do not suffer harassment or prejudice at work for being fathers. However, they perceive that mothers suffer distrust in the workplace and live with work overload because they have to dedicate themselves to many activities. They also suggested actions to mitigate parents’ difficulties.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Despite some fathers wanting to participate more in taking care of their children, others do not even recognize the difficulties that mothers can face in the work. Therefore, it is important to explore the problems and implement actions to build a more parent-friendly work environment.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"15 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recommendations for analysing and meta-analysing small sample size software engineering experiments","authors":"Barbara Kitchenham, Lech Madeyski","doi":"10.1007/s10664-024-10504-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10504-1","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (<i>StdMD</i>) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing <i>StdMD</i> with two robust alternatives: the probability of superiority (<span>(hat{p})</span>) and Cliffs’ <i>d</i>.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, <span>(hat{p})</span> and Cliff’s <i>d</i> consistently outperformed <i>StdMD</i> in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on <span>(hat{p})</span> always had better or equal power than tests based on Cliff’s <i>d</i>, and across all but one simulation condition, <span>(hat{p})</span> Type 1 error rates were less biased.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Using <span>(hat{p})</span> is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"281 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augmented testing to support manual GUI-based regression testing: An empirical study","authors":"Andreas Bauer, Julian Frattini, Emil Alégroth","doi":"10.1007/s10664-024-10522-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10522-z","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Manual graphical user interface (GUI) software testing presents a substantial part of the overall practiced testing efforts, despite various research efforts to further increase test automation. Augmented Testing (AT), a novel approach for GUI testing, aims to aid manual GUI-based testing through a tool-supported approach where an intermediary visual layer is rendered between the system under test (SUT) and the tester, superimposing relevant test information.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>The primary objective of this study is to gather empirical evidence regarding AT’s efficiency compared to manual GUI-based regression testing. Existing studies involving testing approaches under the AT definition primarily focus on exploratory GUI testing, leaving a gap in the context of regression testing. As a secondary objective, we investigate AT’s benefits, drawbacks, and usability issues when deployed with the demonstrator tool, Scout.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We conducted an experiment involving 13 industry professionals, from six companies, comparing AT to manual GUI-based regression testing. These results were complemented by interviews and Bayesian data analysis (BDA) of the study’s quantitative results.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The results of the Bayesian data analysis revealed that the use of AT shortens test durations in 70% of the cases on average, concluding that AT is more efficient. When comparing the means of the total duration to perform all tests, AT reduced the test duration by 36% in total. Participant interviews highlighted nine benefits and eleven drawbacks of AT, while observations revealed four usability issues.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>This study presents empirical evidence of improved efficiency using AT in the context of manual GUI-based regression testing. We further report AT’s benefits, drawbacks, and usability issues. The majority of identified usability issues and drawbacks can be attributed to the tool implementation of AT and, thus, can serve as valuable input for future tool development.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"59 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The best ends by the best means: ethical concerns in app reviews","authors":"Neelam Tjikhoeri, Lauren Olson, Emitzá Guzmán","doi":"10.1007/s10664-024-10463-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10463-7","url":null,"abstract":"<p>This work analyzes ethical concerns found in users’ app store reviews. We performed this study because ethical concerns in mobile applications (apps) are widespread, pose severe threats to end users and society, and lack systematic analysis and methods for detection and classification. In addition, app store reviews allow practitioners to collect users’ perspectives, crucial for identifying software flaws, from a geographically distributed and large-scale audience. For our analysis, we collected five million user reviews, developed a set of ethical concerns representative of user preferences, and manually labeled a sample of these reviews. We found that (1) users highly report ethical concerns about censorship, identity theft, and safety (2) user reviews with ethical concerns are longer, more popular, and lowly rated, and (3) there is high automation potential for the classification and filtering of these reviews. Our results highlight the relevance of using app store reviews for the systematic consideration of ethical concerns during software evolution.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"4 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Mouli Chekkapalli, Jiawei Wang, Li Li, Eric Bodden
{"title":"Static analysis driven enhancements for comprehension in machine learning notebooks","authors":"Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Mouli Chekkapalli, Jiawei Wang, Li Li, Eric Bodden","doi":"10.1007/s10664-024-10525-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10525-w","url":null,"abstract":"<p>Jupyter notebooks have emerged as the predominant tool for data scientists to develop and share machine learning solutions, primarily using Python as the programming language. Despite their widespread adoption, a significant fraction of these notebooks, when shared on public repositories, suffer from insufficient documentation and a lack of coherent narrative. Such shortcomings compromise the readability and understandability of the notebook. Addressing this shortcoming, this paper introduces <span>HeaderGen</span>, a tool-based approach that automatically augments code cells in these notebooks with descriptive markdown headers, derived from a predefined taxonomy of machine learning operations. Additionally, it systematically classifies and displays function calls in line with this taxonomy. The mechanism that powers <span>HeaderGen</span> is an enhanced call graph analysis technique, building upon the foundational analysis available in <i>PyCG</i>. To improve precision, <span>HeaderGen</span> extends PyCG’s analysis with return-type resolution of external function calls, type inference, and flow-sensitivity. Furthermore, leveraging type information, <span>HeaderGen</span> employs pattern matching techniques on the code syntax to annotate code cells. We conducted an empirical evaluation on 15 real-world Jupyter notebooks sourced from Kaggle. The results indicate a high accuracy in call graph analysis, with precision at 95.6% and recall at 95.3%. The header generation has a precision of 85.7% and a recall rate of 92.8% with regard to headers created manually by experts. A user study corroborated the practical utility of <span>HeaderGen</span>, revealing that users found <span>HeaderGen</span> useful in tasks related to comprehension and navigation. To further evaluate the type inference capability of static analysis tools, we introduce <span>TypeEvalPy</span>, a framework for evaluating type inference tools for Python with an in-built micro-benchmark containing 154 code snippets and 845 type annotations in the ground truth. Our comparative analysis on four tools revealed that <span>HeaderGen</span> outperforms other tools in exact matches with the ground truth.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"34 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Causal inference of server- and client-side code smells in web apps evolution","authors":"Américo Rio, Fernando Brito e Abreu, Diana Mendes","doi":"10.1007/s10664-024-10478-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10478-0","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Code smells (CS) are symptoms of poor design and implementation choices that may lead to increased defect incidence, decreased code comprehension, and longer times to release. Web applications and systems are seldom studied, probably due to the heterogeneity of platforms (server and client-side) and languages, and to study web code smells, we need to consider CS covering that diversity. Furthermore, the literature provides little evidence for the claim that CS are a symptom of poor design, leading to future problems in web apps.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To study the quantitative evolution and inner relationship of CS in web apps on the server- and client-sides, and their impact on maintainability and app time-to-release (TTR).</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We collected and analyzed 18 server-side, and 12 client-side code smells, aka web smells, from consecutive official releases of 12 PHP typical web apps, i.e., with server- and client-code in the same code base, summing 811 releases. Additionally, we collected metrics, maintenance issues, reported bugs, and release dates. We used several methodologies to devise causality relationships among the considered irregular time series, such as Granger-causality and Information Transfer Entropy(TE) with CS from previous one to four releases (lag 1 to 4).</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The CS typically evolve the same way inside their group and its possible to analyze them as groups. The CS group trends are: Server, slowly decreasing; Client-side embed, decreasing and JavaScript,increasing. Studying the relationship between CS groups we found that the \"lack of code quality\", measured with CS density proxies, propagates from client code to server code and JavaScript in half of the applications. We found causality relationships between CS and issues. We also found causality from CS groups to bugs in Lag 1, decreasing in the subsequent lags. The values are 15% (lag1), 10% (lag2), and then decrease. The group of client-side embed CS still impacts up to 3 releases before. In group analysis, server-side CS and JavaScript contribute more to bugs. There are causality relationships from individual CS to TTR on lag 1, decreasing on lag 2, and from all CS groups to TTR in lag1, decreasing in the other lags, except for client CS.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>There is statistical inference between CS groups. There is also evidence of statistical inference from the CS to web applications’ issues, bugs, and TTR. Client and server-side CS contribute globally to the quality of web applications, this contribution is low, but significant. Depending on the outcome variable (issues, bugs, time-to-release), the contribution quantity from CS is between 10% and 20%.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"76 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools","authors":"Aurora Papotti, Ranindya Paramitha, Fabio Massacci","doi":"10.1007/s10664-024-10506-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10506-z","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We investigated whether (possibly wrong) security patches suggested by Automated Program Repairs (APR) for real world projects are recognized by human reviewers. We also investigated whether knowing that a patch was produced by an allegedly specialized tool does change the decision of human reviewers.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We perform an experiment with <span>(n= 72)</span> Master students in Computer Science. In the first phase, using a balanced design, we propose to human reviewers a combination of patches proposed by APR tools for different vulnerabilities and ask reviewers to adopt or reject the proposed patches. In the second phase, we tell participants that some of the proposed patches were generated by security-specialized tools (even if the tool was actually a ‘normal’ APR tool) and measure whether the human reviewers would change their decision to adopt or reject a patch.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>It is easier to identify wrong patches than correct patches, and correct patches are not confused with partially correct patches. Also patches from APR Security tools are adopted more often than patches suggested by generic APR tools but there is not enough evidence to verify if ‘bogus’ security claims are distinguishable from ‘true security’ claims. Finally, the number of switches to the patches suggested by security tool is significantly higher after the security information is revealed irrespective of correctness.</p><h3 data-test=\"abstract-sub-heading\">Limitations</h3><p>The experiment was conducted in an academic setting, and focused on a limited sample of popular APR tools and popular vulnerability types.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"52 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa
{"title":"IRJIT: A simple, online, information retrieval approach for just-in-time software defect prediction","authors":"Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa","doi":"10.1007/s10664-024-10514-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10514-z","url":null,"abstract":"<p>Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less <i>explainable</i>, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not <i>explainable</i> might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is <i>online</i> and <i>explainable</i> as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"182 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström
{"title":"Industrial adoption of machine learning techniques for early identification of invalid bug reports","authors":"Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström","doi":"10.1007/s10664-024-10502-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10502-3","url":null,"abstract":"<p>Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}