{"title":"A Multi-solution Study on GDPR AI-enabled Completeness Checking of DPAs","authors":"Muhammad Ilyas Azeem, Sallam Abualhaija","doi":"10.1007/s10664-024-10491-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10491-3","url":null,"abstract":"<p>Specifying legal requirements for software systems to ensure their compliance with the applicable regulations is a major concern of requirements engineering. Personal data which is collected by an organization is often shared with other organizations to perform certain processing activities. In such cases, the General Data Protection Regulation (GDPR) requires issuing a data processing agreement (DPA) which regulates the processing and further ensures that personal data remains protected. Violating GDPR can lead to huge fines reaching to billions of Euros. Software systems involving personal data processing must adhere to the legal obligations stipulated both at a general level in GDPR as well as the obligations outlined in DPAs highlighting specific business. In other words, a DPA is yet another source from which requirements engineers can elicit legal requirements. However, the DPA must be complete according to GDPR to ensure that the elicited requirements cover the complete set of obligations. Therefore, checking the completeness of DPAs is a prerequisite step towards developing a compliant system. Analyzing DPAs with respect to GDPR entirely manually is time consuming and requires adequate legal expertise. In this paper, we propose an automation strategy that addresses the completeness checking of DPAs against GDPR provisions as a text classification problem. Specifically, we pursue ten alternative solutions which are enabled by different technologies, namely traditional machine learning, deep learning, language modeling, and few-shot learning. The goal of our work is to empirically examine how these different technologies fare in the legal domain. We computed F<span>(_2)</span> score on a set of 30 real DPAs. Our evaluation shows that best-performing solutions yield F<span>(_2)</span> score of 86.7% and 89.7% are based on pre-trained BERT and RoBERTa language models. Our analysis further shows that other alternative solutions based on deep learning (e.g., BiLSTM) and few-shot learning (e.g., SetFit) can achieve comparable accuracy, yet are more efficient to develop.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP","authors":"Lukas Schulte, Benjamin Ledel, Steffen Herbold","doi":"10.1007/s10664-024-10469-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10469-1","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>The identification of bugs within issues reported to an issue tracking system is crucial for triage. Machine learning models have shown promising results for this task. However, we have only limited knowledge of how such models identify bugs. Explainable AI methods like LIME and SHAP can be used to increase this knowledge.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We want to understand if explainable AI provides explanations that are reasonable to us as humans and align with our assumptions about the model’s decision-making. We also want to know if the quality of predictions is correlated with the quality of explanations.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>We conduct a study where we rate LIME and SHAP explanations based on their quality of explaining the outcome of an issue type prediction model. For this, we rate the quality of the explanations, i.e., if they align with our expectations and help us understand the underlying machine learning model.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>We found that both LIME and SHAP give reasonable explanations and that correct predictions are well explained. Further, we found that SHAP outperforms LIME due to a lower ambiguity and a higher contextuality that can be attributed to the ability of the deep SHAP variant to capture sentence fragments.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>We conclude that the model finds explainable signals for both bugs and non-bugs. Also, we recommend that research dealing with the quality of explanations for classification tasks reports and investigates rater agreement, since the rating of explanations is highly subjective.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"52 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How far are we with automated machine learning? characterization and challenges of AutoML toolkits","authors":"Md Abdullah Al Alamin, Gias Uddin","doi":"10.1007/s10664-024-10450-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10450-y","url":null,"abstract":"<p>Automated Machine Learning aka AutoML toolkits are low/no-code software that aim to democratize ML system application development by ensuring rapid prototyping of ML models and by enabling collaboration across different stakeholders in ML system design (e.g., domain experts, data scientists, etc.). It is thus important to know the state of current AutoML toolkits and the challenges ML practitioners face while using those toolkits. In this paper, we first offer a characterization of currently available AutoML toolits by analyzing 37 top AutoML tools and platforms. We find that the top AutoML platforms are mostly cloud-based. Most of the tools are optimized for the adoption of shallow ML models. Second, we present an empirical study of 14.3K AutoML related posts from Stack Overflow (SO) that we analyzed using topic modelling algorithm LDA (Latent Dirichlet Allocation) to understand the challenges of ML practitioners while using the AutoML toolkits. We find 13 topics in the AutoML related discussions in SO. The 13 topics are grouped into four categories: MLOps (43% of all questions), Model (28% questions), Data (27% questions), and Documentation (2% questions). Most questions are asked during Model training (29%) and Data preparation (25%) phases. AutoML practitioners find the MLOps topic category most challenging. Topics related to the MLOps category are the most prevalent and popular for cloud-based AutoML toolkits. Based on our study findings, we provide 15 recommendations to improve the adoption and development of AutoML toolkits.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"61 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An empirical study of fault localization in Python programs","authors":"Mohammad Rezaalipour, Carlo A. Furia","doi":"10.1007/s10664-024-10475-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10475-3","url":null,"abstract":"<p>Despite its massive popularity as a programming language, especially in novel domains like data science programs, there is comparatively little research about fault localization that targets Python. Even though it is plausible that several findings about programming languages like C/C++ and Java—the most common choices for fault localization research—carry over to other languages, whether the dynamic nature of Python and how the language is used in practice affect the capabilities of classic fault localization approaches remain open questions to investigate. This paper is the first multi-family large-scale empirical study of fault localization on real-world Python programs and faults. Using Zou et al.’s recent large-scale empirical study of fault localization in Java (Zou et al. 2021) as the basis of our study, we investigated the effectiveness (i.e., localization accuracy), efficiency (i.e., runtime performance), and other features (e.g., different entity granularities) of seven well-known fault-localization techniques in four families (spectrum-based, mutation-based, predicate switching, and stack-trace based) on 135 faults from 13 open-source Python projects from the <span>BugsInPy</span> curated collection (Widyasari et al. 2020). The results replicate for Python several results known about Java, and shed light on whether Python’s peculiarities affect the capabilities of fault localization. The replication package that accompanies this paper includes detailed data about our experiments, as well as the tool <span>FauxPy</span> that we implemented to conduct the study.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"68 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adoption of automated software engineering tools and techniques in Thailand","authors":"Chaiyong Ragkhitwetsagul, Jens Krinke, Morakot Choetkiertikul, Thanwadee Sunetnanta, Federica Sarro","doi":"10.1007/s10664-024-10472-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10472-6","url":null,"abstract":"<p>Readiness for the adoption of Automated Software Engineering (ASE) tools and techniques can vary according to the size and maturity of software companies. ASE tools and techniques have been adopted by large or ultra-large software companies. However, little is known about the adoption of ASE tools and techniques in small and medium-sized software enterprises (SSMEs) in emerging countries, and the challenges faced by such companies. We study the adoption of ASE tools and techniques for software measurement, static code analysis, continuous integration, and software testing, and the respective challenges faced by software developers in Thailand, a developing country with a growing software economy which mainly consists of SSMEs (similar to other developing countries). Based on the answers from 103 Thai participants in an online survey, we found that Thai software developers are somewhat familiar with ASE tools and agree that adopting such tools would be beneficial. Most of the developers do not use software measurement or static code analysis tools due to a lack of knowledge or experience but agree that their use would be useful. Continuous integration tools have been used with some difficulties. Lastly, although automated testing tools are adopted despite several serious challenges, many developers are still testing the software manually. We call for improvements in ASE tools to be easier to use in order to lower the barrier to adoption in small and medium-sized software enterprises (SSMEs) in developing countries.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"61 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wachiraphan Charoenwet, Patanamon Thongtanunam, Van-Thuan Pham, Christoph Treude
{"title":"Toward effective secure code reviews: an empirical study of security-related coding weaknesses","authors":"Wachiraphan Charoenwet, Patanamon Thongtanunam, Van-Thuan Pham, Christoph Treude","doi":"10.1007/s10664-024-10496-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10496-y","url":null,"abstract":"<p>Identifying security issues early is encouraged to reduce the latent negative impacts on the software systems. Code review is a widely-used method that allows developers to manually inspect modified code, catching security issues during a software development cycle. However, existing code review studies often focus on known vulnerabilities, neglecting coding weaknesses, which can introduce real-world security issues that are more visible through code review. The practices of code reviews in identifying such coding weaknesses are not yet fully investigated. To better understand this, we conducted an empirical case study in two large open-source projects, OpenSSL and PHP. Based on 135,560 code review comments, we found that reviewers raised security concerns in 35 out of 40 coding weakness categories. Surprisingly, some coding weaknesses related to past vulnerabilities, such as memory errors and resource management, were discussed less often than the vulnerabilities. Developers attempted to address raised security concerns in many cases (39%-41%), but a substantial portion was merely acknowledged (30%-36%), and some went unfixed due to disagreements about solutions (18%-20%). This highlights that coding weaknesses can slip through code review even when identified. Our findings suggest that reviewers can identify various coding weaknesses leading to security issues during code reviews. However, these results also reveal shortcomings in current code review practices, indicating the need for more effective mechanisms or support for increasing awareness of security issue management in code reviews.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"204 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Jamil Ahmad, Katerina Goseva-Popstojanova, Robyn R. Lutz
{"title":"The untold impact of learning approaches on software fault-proneness predictions: an analysis of temporal aspects","authors":"Mohammad Jamil Ahmad, Katerina Goseva-Popstojanova, Robyn R. Lutz","doi":"10.1007/s10664-024-10454-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10454-8","url":null,"abstract":"<p>This paper aims to improve software fault-proneness prediction by investigating the unexplored effects on classification performance of the temporal decisions made by practitioners and researchers regarding (i) the interval for which they will collect longitudinal features (software metrics data), and (ii) the interval for which they will predict software bugs (the target variable). We call these specifics of the data used for training and of the target variable being predicted the <i>learning approach</i>, and explore the impact of the two most common learning approaches on the performance of software fault-proneness prediction, both within a single release of a software product and across releases. The paper presents empirical results from a study based on data extracted from 64 releases of twelve open-source projects. Results show that the learning approach has a substantial, and typically unacknowledged, impact on classification performance. Specifically, we show that one learning approach leads to significantly better performance than the other, both within-release and across-releases. Furthermore, this paper uncovers that, for within-release predictions, the difference in classification performance is due to different levels of class imbalance in the two learning approaches. Our findings show that improved specification of the learning approach is essential to understanding and explaining the performance of fault-proneness prediction models, as well as to avoiding misleading comparisons among them. The paper concludes with some practical recommendations and research directions based on our findings toward improved software fault-proneness prediction.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"65 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges, adaptations, and fringe benefits of conducting software engineering research with human participants during the COVID-19 pandemic","authors":"Anuradha Madugalla, Tanjila Kanij, Rashina Hoda, Dulaji Hidellaarachchi, Aastha Pant, Samia Ferdousi, John Grundy","doi":"10.1007/s10664-024-10490-4","DOIUrl":"https://doi.org/10.1007/s10664-024-10490-4","url":null,"abstract":"<p>The COVID-19 pandemic changed the way we live, work and the way we conduct research. With the restrictions of lockdowns and social distancing, various impacts were experienced by many software engineering researchers, especially whose studies depend on human participants. We conducted a mixed methods study to understand the extent of this impact. Through a detailed survey with 89 software engineering researchers working with human participants around the world and a further nine follow-up interviews, we identified the key challenges faced, the adaptations made, and the surprising fringe benefits of conducting research involving human participants during the pandemic. Our findings also revealed that in retrospect, many researchers did not wish to revert to the old ways of conducting human-orienfted research. Based on our analysis and insights, we share recommendations on how to conduct remote studies with human participants effectively in an increasingly hybrid world when face-to-face engagement is not possible or where remote participation is preferred.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"238 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingfang Wu, Eric Laufer, Heng Li, Foutse Khomh, Santhosh Srinivasan, Jayden Luo
{"title":"Characterizing and classifying developer forum posts with their intentions","authors":"Xingfang Wu, Eric Laufer, Heng Li, Foutse Khomh, Santhosh Srinivasan, Jayden Luo","doi":"10.1007/s10664-024-10487-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10487-z","url":null,"abstract":"<p>With the rapid growth of the developer community, the amount of posts on online technical forums has been growing rapidly, which poses difficulties for users to filter useful posts and find important information. Tags provide a concise feature dimension for users to locate their interested posts and for search engines to index the most relevant posts according to the queries. Most tags are only focused on the technical perspective (e.g., program language, platform, tool). In most cases, forum posts in online developer communities reveal the author’s intentions to solve a problem, ask for advice, share information, etc. The modeling of the intentions of posts can provide an extra dimension to the current tag taxonomy. By referencing previous studies and learning from industrial perspectives, we create a refined taxonomy for the intentions of technical forum posts. Through manual labeling and analysis on a sampled post dataset extracted from online forums, we understand the relevance between the constitution of posts (code, error messages) and their intentions. Furthermore, inspired by our manual study, we design a pre-trained transformer-based model to automatically predict post intentions. The best variant of our intention prediction framework, which achieves a Micro F1-score of 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787, outperforms the state-of-the-art baseline approach. Our characterization and automated classification of forum posts regarding their intentions may help forum maintainers or third-party tool developers improve the organization and retrieval of posts on technical forums.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"25 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyang Ma, Shouvick Mondal, Tse-Hsun (Peter) Chen, Haoxiang Zhang, Ahmed E. Hassan
{"title":"VulNet: Towards improving vulnerability management in the Maven ecosystem","authors":"Zeyang Ma, Shouvick Mondal, Tse-Hsun (Peter) Chen, Haoxiang Zhang, Ahmed E. Hassan","doi":"10.1007/s10664-024-10448-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10448-6","url":null,"abstract":"<p>Developers rely on software ecosystems such as Maven to manage and reuse external libraries (i.e., dependencies). Due to the complexity of the used dependencies, developers may face challenges in choosing which library to use and whether they should upgrade or downgrade a library. One important factor that affects this decision is the number of potential vulnerabilities in a library and its dependencies. Therefore, state-of-the-art platforms such as Maven Repository (MVN) and Open Source Insights (OSI) help developers in making such a decision by presenting vulnerability information associated with every dependency. In this paper, we first conduct an empirical study to understand how the two platforms, MVN and OSI, present and categorize vulnerability information. We found that these two platforms may either overestimate or underestimate the number of associated vulnerabilities in a dependency, and they lack prioritization mechanisms on which dependencies are more likely to cause an issue. Hence, we propose a tool named VulNet to address the limitations we found in MVN and OSI. Through an evaluation of 19,886 versions of the top 200 popular libraries, we find VulNet includes 90.5% and 65.8% of the dependencies that were omitted by MVN and OSI, respectively. VulNet also helps reduce 27% of potentially unreachable or less impactful vulnerabilities listed by OSI in test dependencies. Finally, our user study with 24 participants gave VulNet an average rating of 4.5/5 in presenting and prioritizing vulnerable dependencies, compared to 2.83 (MVN) and 3.14 (OSI).</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"38 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}