Detection of code smells using machine learning techniques combined with data-balancing methods

International Journal of Advances in Intelligent Informatics Pub Date : 2023-11-01 DOI:10.26555/ijain.v9i3.981

Nasraldeen Alnor Adam Khleel, Károly Nehéz

{"title":"Detection of code smells using machine learning techniques combined with data-balancing methods","authors":"Nasraldeen Alnor Adam Khleel, Károly Nehéz","doi":"10.26555/ijain.v9i3.981","DOIUrl":null,"url":null,"abstract":"Code smells are prevalent issues in software design that arise when implementation or design principles are violated. These issues manifest as symptoms or anomalies in the source code. Timely identification of code smells plays a crucial role in enhancing software quality and facilitating software maintenance. Previous studies have shown that code smell detection can be accomplished through the utilization of machine learning (ML) methods. However, despite their increasing popularity, research suggests that the suitability of these methods are not always appropriate due to the problem of imbalanced data. Consequently, the effectiveness of ML models may be negatively affected. This study aims to propose a novel method for detecting code smells by employing five ML algorithms, namely decision tree (DT), k-nearest neighbors (K-NN), support vector machine (SVM), XGboost (XGB), and multi-layer perceptron (MLP). Additionally, to tackle the challenge of imbalanced data, the proposed method incorporates the random oversampling technique. Experiments were conducted in this study using four datasets that encompassed code smells, specifically god-class, data-class, long-method, and feature-envy. The experimental outcomes were evaluated and compared using various performance metrics. Upon comparing the outcomes of our models on both the balanced and original datasets, we found that the XGB model achieved the highest accuracy of 100% for detecting the data class and long method on the original datasets. In contrast, the highest accuracy of 100% was obtained for the data class and long method using DT, SVM, and XGB models on the balanced datasets. According to the empirical findings, there is significant promise in using ML techniques for the accurate prediction of code smells.","PeriodicalId":52195,"journal":{"name":"International Journal of Advances in Intelligent Informatics","volume":"30 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advances in Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26555/ijain.v9i3.981","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Code smells are prevalent issues in software design that arise when implementation or design principles are violated. These issues manifest as symptoms or anomalies in the source code. Timely identification of code smells plays a crucial role in enhancing software quality and facilitating software maintenance. Previous studies have shown that code smell detection can be accomplished through the utilization of machine learning (ML) methods. However, despite their increasing popularity, research suggests that the suitability of these methods are not always appropriate due to the problem of imbalanced data. Consequently, the effectiveness of ML models may be negatively affected. This study aims to propose a novel method for detecting code smells by employing five ML algorithms, namely decision tree (DT), k-nearest neighbors (K-NN), support vector machine (SVM), XGboost (XGB), and multi-layer perceptron (MLP). Additionally, to tackle the challenge of imbalanced data, the proposed method incorporates the random oversampling technique. Experiments were conducted in this study using four datasets that encompassed code smells, specifically god-class, data-class, long-method, and feature-envy. The experimental outcomes were evaluated and compared using various performance metrics. Upon comparing the outcomes of our models on both the balanced and original datasets, we found that the XGB model achieved the highest accuracy of 100% for detecting the data class and long method on the original datasets. In contrast, the highest accuracy of 100% was obtained for the data class and long method using DT, SVM, and XGB models on the balanced datasets. According to the empirical findings, there is significant promise in using ML techniques for the accurate prediction of code smells.

查看原文本刊更多论文

使用机器学习技术结合数据平衡方法检测代码气味

代码气味是软件设计中普遍存在的问题，当实现或设计原则被违反时就会出现。这些问题在源代码中表现为症状或异常。及时识别代码气味对于提高软件质量和促进软件维护具有至关重要的作用。以前的研究表明，代码气味检测可以通过利用机器学习(ML)方法来完成。然而，尽管它们越来越受欢迎，但研究表明，由于数据不平衡的问题，这些方法的适用性并不总是合适的。因此，ML模型的有效性可能会受到负面影响。本研究旨在通过采用决策树(DT)、k近邻(K-NN)、支持向量机(SVM)、XGboost (XGB)和多层感知器(MLP)五种机器学习算法，提出一种检测代码气味的新方法。此外，为了解决数据不平衡的问题，该方法采用了随机过采样技术。本研究使用包含代码气味的四个数据集进行了实验，特别是神类、数据类、长方法和特征羡慕。使用各种性能指标对实验结果进行评估和比较。通过比较我们的模型在平衡数据集和原始数据集上的结果，我们发现XGB模型在原始数据集上检测数据类别和长方法的准确率最高，达到100%。相比之下，在平衡数据集上使用DT、SVM和XGB模型的数据类和长方法获得了100%的最高准确率。根据经验发现，使用ML技术准确预测代码气味有很大的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Advances in Intelligent Informatics Computer Science-Computer Vision and Pattern Recognition

CiteScore

3.00

自引率

0.00%

发文量