Idriss Riouak, G. Hedin, Christoph Reichenbach, Niklas Fors
{"title":"JFeature: Know Your Corpus","authors":"Idriss Riouak, G. Hedin, Christoph Reichenbach, Niklas Fors","doi":"10.1109/SCAM55253.2022.00033","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00033","url":null,"abstract":"Software corpora are crucial for evaluating research artifacts and ensuring repeatability of outcomes. Corpora such as DaCapo and Defects4J provide a collection of real-world open-source projects for evaluating the robustness and performance of software tools like static analysers. However, what do we know about these corpora? What do we know about their composition? Are they really suited for our particular problem? We developed JFEATURE, an extensible static analysis tool that extracts syntactic and semantic features from Java programs, to assist developers in answering these questions. We demonstrate the potential of JFEATURE by applying it to four widely-used corpora in the program analysis area, and we suggest other applications, including longitudinal studies of individual Java projects and the creation of new corpora.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124984108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Modal Code Summarization with Retrieved Summary","authors":"Lile Lin, Zhiqiu Huang, Yaoshen Yu, Ya-Ping Liu","doi":"10.1109/SCAM55253.2022.00020","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00020","url":null,"abstract":"A high-quality code summary describes the functionality and purpose of a code snippet concisely, which is key to program comprehension. Automatic code summarization aims to generate natural language summaries from code snippets automatically, which can save developers time and improve efficiency in development and maintenance. Recently, researchers mainly use neural machine translation (NMT) based approaches to fill this task. They apply a neural model to translate code snippets into natural language summaries. However, the performance of existing NMT-based approaches is limited. Although a summary and a code snippet are semantically related, they may not share common lexical tokens or language structures. Such a semantic gap between codes and summaries hinders the effect of NMT-based models. Only using code tokens to represent a code snippet cannot help NMT-based models overcome this gap. To solve this problem, in this paper, we propose a code summarization approach that incorporates lexical, syntactic and semantic modalities of codes. We treat code tokens as the lexical modality and the abstract syntax tree (AST) as the syntactic modality. To obtain the semantic modality, inspired by translation memory (TM) in NMT, we use the information retrieval (IR) technique to retrieve a relevant summary for a code snippet to describe its functionality. We propose a novel approach based on contrastive learning to build a retrieval model to retrieve semantically similar summaries. Our approach learns and fuses those different modalities using Transformer. We evaluate our approach on a large Java dataset, experiment results show that our approach outperforms the state-of-the-art approaches on automatic evaluation metrics BLEU, ROUGE and METEOR by 10%, 8% and 9%.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121615446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining for Framework Instantiation Pattern Interplays","authors":"Yunior Pacheco, Ahmed Zerouali, Coen De Roover","doi":"10.1109/SCAM55253.2022.00019","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00019","url":null,"abstract":"Software frameworks define generic application blueprints which can be instantiated into an application through application-specific instantiation actions such as overriding a method or providing an object that implements an interface. In case the framework's documentation falls short, developers may use other instantiations of the same framework as a guide to the required instantiation actions. In this paper, we propose an automated approach to mining framework instantiation patterns from existing open-source instantiations. The approach leverages a graph-based representation to capture the common ways of implementing instantiation actions as well as their interplays, so called instantiation interplays. As a case study, we mined for patterns in a set of 2,028 Java projects that instantiate four of the most popular Java frameworks. We also classify the extracted interplays according to the different contexts in which they occur. We found that our approach discovers relevant practices and interplays that are not covered by previous approaches. Our results will allow developers to have a better understanding of the frameworks they instantiate.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125223603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Don't DIY: Automatically transform legacy Python code to support structural pattern matching","authors":"B. Rózsa, Gábor Antal, R. Ferenc","doi":"10.1109/SCAM55253.2022.00024","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00024","url":null,"abstract":"As data becomes more and more complex as technology evolves, the need to support more complex data types in programming languages has grown. However, without proper storage and manipulation capabilities, handling such data can result in hard-to-read, difficult-to-maintain code. Therefore, programming languages continuously evolve to provide more and more ways to handle complex data. Python 3.10 introduced structural pattern matching, which serves this exact purpose: we can split complex data into relevant parts by examining its structure, and store them for later processing. Previously, we could only use the traditional conditional branching, which could have led to long chains of nested conditionals. Maintaining such code fragments can be cumbersome. In this paper, we present a complete framework to solve the aforementioned problem. Our software is capable of examining Python source code and transforming relevant conditionals into structural pattern matching. Moreover, it is able to handle nested conditionals and it is also easily extensible, thus the set of possible transformations can be easily increased.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116240956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-Automatic Refactoring to C++20 Modules: A Semi-Success Story","authors":"Richárd Szalay, Z. Porkoláb","doi":"10.1109/SCAM55253.2022.00011","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00011","url":null,"abstract":"The component-based design of software projects is a desired property both for development and ease of code comprehension. Programming languages have long allowed component-based development (e.g., Java packages, Python modules); however, other languages, especially C and C++, had stuck to the “translation unit” model where every source file is individually compiled. The Modules system of C++20 was expected to allow cleaner encapsulation of concern. In this paper, we investigate the effort of a (semi-)automatic modularisation of existing C++ projects. Based on our investigation, upgrading existing software systems to the new Modules feature is extremely hard due to coupling issues arising from necessarily legacy design. Implementing real transition requires a significant redesign of both project-internal and user-facing programming interfaces.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114886147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Plug and Analyze: Usable Dynamic Taint Tracker for Android Apps","authors":"Hiroki Inayoshi, S. Kakei, S. Saito","doi":"10.1109/SCAM55253.2022.00008","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00008","url":null,"abstract":"Taint analyses, especially static taint analyses, are utilized to uncover hidden and suspicious behaviors in Android apps. However, current static taint analyzers use imprecise Android models, producing unreliable results and increasing the result verification cost. On the other hand, current dynamic taint trackers accurately detect execution paths. However, they depend on specific Android versions and modified devices, reducing their usability. Also, the users may not be able to analyze prepared datasets comprehensively. The results of the current analyses would be biased and less trustworthy. This paper presents a new dynamic taint analyzer called T-Recs that tracks information flows by recording the app execution at the app's bytecode level on an Android device and reconstructing the execution on a server independently of specific Android versions and devices. The users can instantly start analyzing apps with T-Recs after plugging an unmodified device into their computer. We implemented and evaluated T-Recs with 158 apps of DroidBench 3.0 in comparison with current taint analyzers: FlowDroid (w/ and w/o IC3), Amandroid, DroidSafe, and TaintDroid (w/ and w/o IntelliDroid), and only T-Recs achieved 100% accuracy. The result of privacy leak detection in 96 popular Google Play apps shows that T-Recs detected 43 true positives, the highest among compared tools. Also, T-Recs analyzed 39,480 apps from Google Play and Anzhi, showing that T-Recs can be applied to apps that vary in supported SDK versions. Further, the result of ID leak detection in 158 popular apps from Google Play in 2021 shows that T-Recs can detect leaks in recently-developed apps. T-Recs is one of the promising tools for future app analysis.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124151079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deriving Modernity Signatures for PHP Systems with Static Analysis","authors":"Wouter Van den Brink, M. Gerhold, V. Zaytsev","doi":"10.1109/SCAM55253.2022.00027","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00027","url":null,"abstract":"The PHP language has undergone many changes in its syntax and grammar, with respect to both features the language has to offer as well as the distribution of language features used by programmers in their projects. We present a novel method of using grammar usage statistics to calculate a modernity signature for a PHP system, so that we can determine its age. The system will aid developers in choosing whether or not to execute or use a PHP system, without having to perform an extensive inspection.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128815831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naelson D. C. Oliveira, Márcio Ribeiro, R. Bonifácio, Rohit Gheyi, I. Wiese, B. Neto
{"title":"Lint-Based Warnings in Python Code: Frequency, Awareness and Refactoring","authors":"Naelson D. C. Oliveira, Márcio Ribeiro, R. Bonifácio, Rohit Gheyi, I. Wiese, B. Neto","doi":"10.1109/SCAM55253.2022.00030","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00030","url":null,"abstract":"Python is a popular programming language characterized by its simple syntax and easy learning curve. Like many languages, Python has a set of best practices that should be followed to avoid bugs and improve other quality attributes (such as maintenance and readability). In this context, non-compliance to these practices can be detected by using linting tools. Previous work conducted studies to better understand the frequency of a class of problems that can be found using Python linters: warnings, here named as lint-based warnings. However, they either rely on small datasets or focus on few domains, such as machine learning or web-systems projects. In this paper, we provide a mixed-method study where we analyze the frequency of six lint-based warnings in 1,119 different open-source general-purpose Python projects. To go further, we also conduct a survey to check whether developers are aware of the lint-based warnings we study here. In particular, we intend to check whether they are able to identify the six lint-based warnings. To remove the lint-based warnings, we suggest the application of simple refactorings. Last but not least, we evaluate the suggestions by submitting pull requests to remove lint-based warnings from open-source projects. Our results show that 39% of the 1,119 projects have at least one lint-based warning. After analyzing the survey data, we also show that developers prefer Python code without lint-based warnings. Regarding the pull requests, we achieve a 71.8% of acceptance rate.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122732664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aurel Ikama, Vincent Du, Philippe Belias, B. Muse, Foutse Khomh, Mohammad Hamdaqa
{"title":"Revisiting the Impact of Anti-patterns on Fault-Proneness: A Differentiated Replication","authors":"Aurel Ikama, Vincent Du, Philippe Belias, B. Muse, Foutse Khomh, Mohammad Hamdaqa","doi":"10.1109/SCAM55253.2022.00012","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00012","url":null,"abstract":"Anti-patterns manifesting on software code through code smells have been investigated in terms of their prevalence, detection, refactoring, and impact on software quality attributes. In particular, leveraging heuristics to identify fault-fixing commits, Khomh et al. have found that anti-patterns and code smells have an impact on the fault-proneness of a software system. Similarly, Saboury et al. found a relationship between anti-pattern occurrences and fault-proneness, using heuristic to identify fault-fixing commits and fault-inducing changes. However, recent studies question the accuracy of heuristics, and thus the validity of empirical studies that leverage it. Hence, in this work, we would like to investigate to what extent the results of empirical studies using heuristics to identify bug fix commits are affected by the limitations of the heuristics based approach using manually validated bug fix commits as a ground truth. In particular, we conduct a differentiated replication of the work by Khomh et al. We particularly focused on the impact of anti-patterns on fault-proneness as it is the only dependent variable that may be affected by noise in the collected faults data. In our differentiated replication study, (1) we expanded the number of subject systems from 5 to 38, (2) utilized a manually validated dataset of bug-fixing commits from the work of Herbold et al., and (3) answered research questions from Khomh et al., that are related to the relationship between anti-pattern occurrences and fault-proneness. (4) We added an additional research question to investigate if combining results from several heuristic-based approaches could help reduce the impact of noise. Our findings show that the impact of the noise generated by the automatic algorithm heuristic based is negligible for the studied subject systems; meaning that the reported relation observed on noisy data still holds on the clean data. However, we also observed that combining results from several heuristic based approaches do not reduce this noise, quite the contrary.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127271458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Checking Refactoring Detection Results Using Code Changes Encoding for Improved Accuracy","authors":"Liang Tan, Christoph Bockisch","doi":"10.1109/SCAM55253.2022.00016","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00016","url":null,"abstract":"For example during software maintenance, it is often important to know the reason for a code change and therefore tools are researched to automatically detect changes due to refactorings. The tool RefDiff can achieve this supporting multiple programming languages. It provides a good precision, but at the cost of a large number of false negative results due to the necessary use of a high threshold in refactoring candidate selection. We have created a result checker that improves the overall performance of RefDiff by including more candidates and reducing false positives from RefDiff detection results afterwards. The checker encodes the textual differences (so-called diffs) corresponding to the results and uses machine learning to predict the contained refactoring type. The main contribution of this paper is the approach for extracting the diffs from the detection results and encoding them as image data for machine learning processing, as well as the training of the machine learning algorithm. We have shown that lowering the candidate threshold in conjunction with the checker improves not only the recall of RefDiff, also the precision is increased. Our approach improves the RefDiff detection results to 99.5% precision and 95.2% recall.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127834795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}