{"title":"How do Trivial Refactorings Affect Classification Prediction Models?","authors":"Darwin Pinheiro, C. Bezerra, Anderson G. Uchôa","doi":"10.1145/3559712.3559720","DOIUrl":null,"url":null,"abstract":"Refactoring is defined as a transformation that changes the internal structure of the source code without changing the external behavior. Keeping the external behavior means that after applying the refactoring activity, the software must produce the same output as before the activity. The refactoring activity can bring several benefits, such as: removing code with low structural quality, avoiding or reducing technical debt, improving code maintainability, reuse or readability. In this way, the benefits extend to internal and external quality attributes. The literature on software refactoring suggests carrying out studies that invest in improving automated solutions for detecting and correcting refactoring. Furthermore, few studies investigate the influence that a less complex type of refactoring can have on predicting more complex refactorings. This paper investigates how less complex (trivial) refactorings affect the prediction of more complex (non-trivial) refactorings. To do this, we classify refactorings based on their triviality, extract metrics from the code, contextualize the data and train machine learning algorithms to investigate the effect caused. Our results suggest that: (i) machine learning with tree-based models (Random Forest and Decision Tree) performed very well when trained with code metrics to detect refactorings; (ii) separating trivial from non-trivial refactorings into different classes resulted in a more efficient model, indicative of improving the accuracy of automated solutions based on machine learning; and, (iii) using balancing techniques that increase or decrease samples randomly is not the best strategy to improve datasets composed of code metrics.","PeriodicalId":119656,"journal":{"name":"Proceedings of the 16th Brazilian Symposium on Software Components, Architectures, and Reuse","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th Brazilian Symposium on Software Components, Architectures, and Reuse","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3559712.3559720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Refactoring is defined as a transformation that changes the internal structure of the source code without changing the external behavior. Keeping the external behavior means that after applying the refactoring activity, the software must produce the same output as before the activity. The refactoring activity can bring several benefits, such as: removing code with low structural quality, avoiding or reducing technical debt, improving code maintainability, reuse or readability. In this way, the benefits extend to internal and external quality attributes. The literature on software refactoring suggests carrying out studies that invest in improving automated solutions for detecting and correcting refactoring. Furthermore, few studies investigate the influence that a less complex type of refactoring can have on predicting more complex refactorings. This paper investigates how less complex (trivial) refactorings affect the prediction of more complex (non-trivial) refactorings. To do this, we classify refactorings based on their triviality, extract metrics from the code, contextualize the data and train machine learning algorithms to investigate the effect caused. Our results suggest that: (i) machine learning with tree-based models (Random Forest and Decision Tree) performed very well when trained with code metrics to detect refactorings; (ii) separating trivial from non-trivial refactorings into different classes resulted in a more efficient model, indicative of improving the accuracy of automated solutions based on machine learning; and, (iii) using balancing techniques that increase or decrease samples randomly is not the best strategy to improve datasets composed of code metrics.