Daniel Atzberger, Nico Scordialo, Tim Cech, W. Scheibel, Matthias Trapp, J. Döllner
{"title":"CodeCV: Mining Expertise of GitHub Users from Coding Activities","authors":"Daniel Atzberger, Nico Scordialo, Tim Cech, W. Scheibel, Matthias Trapp, J. Döllner","doi":"10.1109/SCAM55253.2022.00021","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00021","url":null,"abstract":"The number of software projects developed collaboratively on social coding platforms is steadily increasing. One of the motivations for developers to participate in open-source software development is to make their development activities easier accessible to potential employers, e.g., in the form of a resume for their interests and skills. However, manual review of source code activities is time-consuming and requires detailed knowledge of the technologies used. Existing approaches are limited to a small subset of actual source code activity and metadata and do not provide explanations for their results. In this work, we present CodeCV, an approach to analyzing the commit activities of a GitHub user concerning the use of programming languages, software libraries, and higher-level concepts, e.g., Machine Learning or Cryptocurrency. Skills in using software libraries and programming languages are analyzed based on syntactic structures in the source code. Based on Labeled Latent Dirichlet Allocation, an automatically generated corpus of GitHub projects is used to learn the concept-specific vocabulary in identifier names and comments. This enables the capture of expertise on abstract concepts from a user's commit history. CodeCV further explains the results through links to the relevant commits in an interactive web dashboard. We tested our system on selected GitHub users who mainly contribute to popular projects to demonstrate that our approach is able to capture developers' expertise effectively.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128628647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experimental Evaluation of A New Ranking Formula for Spectrum based Fault Localization","authors":"Q. Sarhan, Árpád Beszédes","doi":"10.1109/SCAM55253.2022.00038","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00038","url":null,"abstract":"Spectrum-Based Fault Localization (SBFL) uses a mathematical formula to determine a suspicion score for each program element (such as a statement, method, or class) based on fundamental statistics (e.g., how many times each element is executed and not executed in passed and failed tests) taken from test coverage and results. Based on the calculated scores, program elements are then ordered from most suspicious to least suspicious. The elements with the highest scores are thought to be the most prone to error. The final ranking list of program elements aids developers in debugging when looking for the source of a fault in the program under test. In this paper, we present a new SBFL ranking formula that enhances a base formula by ranking code elements slightly higher than others that are executed by more failed tests and less passing ones. Its novelty is that it breaks ties between the elements that share the same suspicion score of the base formula. Experiments were conducted on six single-fault programs of the Defects4J dataset to evaluate the effectiveness of the proposed formula. The results show that our new formula when compared to three widely-studied SBFL formulas, achieved a better performance in terms of average ranking. It also achieved positive results in all of the Top-N categories and increased the number of cases where the faulty element became the top-ranked element by 13–23%.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"335 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127575716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed Latif Siddiq, Shafayat H. Majumder, Maisha R. Mim, Sourov Jajodia, Joanna C. S. Santos
{"title":"An Empirical Study of Code Smells in Transformer-based Code Generation Techniques","authors":"Mohammed Latif Siddiq, Shafayat H. Majumder, Maisha R. Mim, Sourov Jajodia, Joanna C. S. Santos","doi":"10.1109/SCAM55253.2022.00014","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00014","url":null,"abstract":"Prior works have developed transformer-based language learning models to automatically generate source code for a task without compilation errors. The datasets used to train these techniques include samples from open source projects which may not be free of security flaws, code smells, and violations of standard coding practices. Therefore, we investigate to what extent code smells are present in the datasets of coding generation techniques and verify whether they leak into the output of these techniques. To conduct this study, we used Pylint and Bandit to detect code smells and security smells in three widely used training sets (CodeXGlue, APPS, and Code Clippy). We observed that Pylint caught 264 code smell types, whereas Bandit located 44 security smell types in these three datasets used for training code generation techniques. By analyzing the output from ten different configurations of the open-source fine-tuned transformer-based GPT-Neo 125M parameters model, we observed that this model leaked the smells and non-standard practices to the generated source code. When analyzing GitHub Copilot's suggestions, a closed source code generation tool, we observed that it contained 18 types of code smells, including substandard coding patterns and 2 security smell types.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126001591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Weighted-SBFL by Blocking Spectrum","authors":"Haruka Yoshioka, Yoshiki Higo, S. Kusumoto","doi":"10.1109/SCAM55253.2022.00036","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00036","url":null,"abstract":"Debugging is a costly process in software development, and computer-aided debugging is expected to reduce the cost. In debugging, fault localization is used to identify the location of potentially faulty code. Spectrum-based fault localization (SBFL) identifies program statements that contain faults based on program spectra collected during the execution of the test cases. Conventional SBFL treats all test cases as having equal importance. A weighting technique that assigns importance to test cases based on the similarity of program spectra (where higher similarity indicates higher importance) has been proposed. However, this technique does not significantly improve fault localization accuracy. We attribute this lack of improvement to the presence of sequential program statements, which negatively affect the weighting. In this study, we apply blocking and the weighting of spectra to improve accuracy. We conduct experiments to compare the proposed technique with conventional SBFL and a recent SBFL technique. We show that the proposed technique identifies faulty program statements with higher accuracy than previous SBFL techniques. Weighting based on the similarity of spectra after blocking is thus effective.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128335771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Test Transplantation through Dynamic Test Slicing","authors":"Mehrdad Abdi, S. Demeyer","doi":"10.1109/SCAM55253.2022.00009","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00009","url":null,"abstract":"Previous research has demonstrated that the test coverage of libraries can be expanded by using existing test inputs from their dependent projects. In this paper, we propose an algorithm for test transplantation based on test slicing. The algorithm extracts test inputs, isolates them by creating mocks, and then transplants the test code onto the test suite of the libraries. To achieve test slicing, we dynamically execute the tests in the dependent project and create its graph of histories. Then, we traverse back from the interesting object state and collect the corresponding edges. Finally, we reverse the collected edges and create a sequence of method calls to reconstruct the same object state. We have implemented a proof-of-concept in Pharo-Smalltalk, in this paper we discuss the lessons learned so far.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127997780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Multimodal Architecture for Detection of Long Parameter List and Switch Statements using DistilBERT","authors":"Anushka Bhave, Roopak Sinha","doi":"10.1109/SCAM55253.2022.00018","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00018","url":null,"abstract":"Code smell detection and refactoring are crucial to sustain quality, reduce complexity and increase the efficiency of a software application. Code smells are observable patterns in the source code of a program that indicate deeper structural issues. Most traditional methods for code smell classification rely exclusively on structural object-oriented metrics and manually-designed heuristics. We propose a novel multimodal deep learning approach that combines structural and semantic information to detect two commonly-encountered code smells: Long Parameter Lists and Switch Statements. The presented architecture applies transfer learning on DistilBERT to generate vector embeddings representing classes and methods concatenated with numerical metrics for joint feature extraction using CNN, to build a complex mapping between the features and predict the output as smelly or non-smelly. Subsequently, to perform a holistic comparative analysis we also implement two multimodal machine learning pipelines, the first employs a sci-kit learn TF-IDF Vectorizer with Random Forest Classifier, and the second merges CNN with Bi-LSTM. Our approach achieves an accuracy of 91.2% as corroborated by experimental evaluation, outperforming the state-of-the-art techniques.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125946700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards the Detection of Hidden Familial Type Correlations in Java Code","authors":"Alin-Petru Roşu, Petru Florin Mihancea","doi":"10.1109/SCAM55253.2022.00022","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00022","url":null,"abstract":"Family polymorphism is an object-oriented programming feature which facilitates the definition of groups of classes (families) that are allowed to be used together while statically forbidding them to be mixed with classes outside their families. Unfortunately, this feature has not been yet adopted by mainstream industrial-strength programming languages. Consequently, in Java, the idea of non-mixable families is prone to be implemented in a statically unsafe fashion, affecting the programs' intelligibility. In order to support program comprehension, we present an approach to detect code fragments where types of references are correlated within a family; nonetheless, these correlations are hidden behind the references' declarations. We obtained promising results during the initial design and evaluation iterations, based on the analysis of a software system where the presence of families was previously reported.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126517772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ching-Chi Chuang, Luís Cruz, R. V. Dalen, Vladimir Mikovski, A. Deursen
{"title":"Removing dependencies from large software projects: are you really sure?","authors":"Ching-Chi Chuang, Luís Cruz, R. V. Dalen, Vladimir Mikovski, A. Deursen","doi":"10.1109/SCAM55253.2022.00017","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00017","url":null,"abstract":"When developing and maintaining large software systems, a great deal of effort goes into dependency management. During the whole lifecycle of a software project, the set of dependencies keeps changing to accommodate the addition of new features or changes in the running environment. Package management tools are quite popular to automate this process, making it fairly easy to automate the addition of new dependencies and respective versions. However, over the years, a software project might evolve in a way that no longer needs a particular technology or dependency. But the choice of removing that dependency is far from trivial: one cannot be entirely sure that the dependency is not used in any part of the project. Hence, developers have a hard time confidently removing dependencies and trusting that it will not break the system in production. In this paper, we propose a decision framework to improve the detection of unused dependencies. Our approach builds on top of the existing dependency analysis tool DepClean. We start by improving the support of Java dynamic features in DepClean. We do so by augmenting the analysis with the state-of-the-art call graph generation tool OPAL. Then, we analyze the potentially unused dependencies detected by classifying their logical relationship with the other components to decide on follow-up steps, which we provide in the form of a decision diagram. Results show that developers can focus their efforts on maintaining bloated dependencies by following the recommendations of our decision framework. When applying our approach to a large industrial software project, we can reduce one-third of false positives when compared to the state-of-the-art. We also validate our approach by analyzing dependencies that were removed in the history of open-source projects. Results show consistency between our approach and the decisions taken by open-source developers.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. D. Stefano, Fabiano Pecorelli, D. D. Nucci, A. D. Lucia
{"title":"A preliminary evaluation on the relationship among architectural and test smells","authors":"M. D. Stefano, Fabiano Pecorelli, D. D. Nucci, A. D. Lucia","doi":"10.1109/SCAM55253.2022.00013","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00013","url":null,"abstract":"Software maintenance is the software life cycle's longest and most challenging phase. Bad architectural decisions or sub-optimal solutions might lead to architectural erosion, i.e., the process that causes the system's architecture to deviate from its original design. The so-called architectural smells are the most common signs of architectural erosion. Architectural smells might affect several quality aspects of a software system, including testability. When a system is not prone to testing, sub-optimal solutions may be introduced in the test code, a.k.a. test smells. This paper explores the possible relations between architectural and test smells. By mining 798 releases of 40 open-source Java systems, we studied the correlation between class-level architectural and test smells. In particular, Eager Test and Assertion Roulette smells often occur in conjunction with Cyclically-dependent Modularization, Deficient Encapsulation, and Insufficient Encapsulation architectural smells.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114894935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niloofar Mansoor, Tukaram Muske, Alexander Serebrenik, Bonita Sharif
{"title":"An Empirical Assessment on Merging and Repositioning of Static Analysis Alarms","authors":"Niloofar Mansoor, Tukaram Muske, Alexander Serebrenik, Bonita Sharif","doi":"10.1109/SCAM55253.2022.00031","DOIUrl":"https://doi.org/10.1109/SCAM55253.2022.00031","url":null,"abstract":"Static analysis tools generate a large number of alarms that require manual inspection. In prior work, repositioning of alarms is proposed to (1) merge multiple similar alarms together and replace them by a fewer alarms, and (2) report alarms as close as possible to the causes for their generation. The premise is that the proposed merging and repositioning of alarms will reduce the manual inspection effort. To evaluate the premise, this paper presents an empirical study with 249 developers on the proposed merging and repositioning of static alarms. The study is conducted using static analysis alarms generated on $C$ programs, where the alarms are representative of the merging vs. non-merging and repositioning vs. non-repositioning situations in real-life code. Developers were asked to manually inspect and determine whether assertions added corresponding to alarms in $C$ code hold. Additionally, two spatial cognitive tests are also done to determine relationship in performance. The empirical evaluation results indicate that, in contrast to expectations, there was no evidence that merging and repositioning of alarms reduces manual inspection effort or improves the inspection accuracy (at times a negative impact was found). Results on cognitive abilities correlated with comprehension and alarm inspection accuracy.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128101072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}