The impact of feature selection and feature reduction techniques for code smell detection: A comprehensive empirical study

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-05-16 DOI:10.1007/s10515-025-00524-6

Zexian Zhang, Lin Zhu, Shuang Yin, Wenhua Hu, Shan Gao, Haoxuan Chen, Fuyang Li

{"title":"The impact of feature selection and feature reduction techniques for code smell detection: A comprehensive empirical study","authors":"Zexian Zhang, Lin Zhu, Shuang Yin, Wenhua Hu, Shan Gao, Haoxuan Chen, Fuyang Li","doi":"10.1007/s10515-025-00524-6","DOIUrl":null,"url":null,"abstract":"<div><p>Code smell detection using machine/deep learning methods aims to classify code instances as smelly or non-smelly based on extracted features. Accurate detection relies on optimizing feature sets by focusing on relevant features while discarding those that are redundant or irrelevant. However, prior studies on feature selection and reduction techniques for code smell detection have yielded inconsistent results, possibly due to limited exploration of available techniques. To address this gap, we comprehensively analyze 33 feature selection and 6 feature reduction techniques across seven classification models and six code smell datasets. And we apply the Scott-Knott effect size difference test for comparing performance and McNemar’s test for assessing prediction diversity. The results show that (1) Not all feature selection and reduction techniques significantly improve detection performance. (2) Feature extraction techniques generally perform worse than feature selection techniques. (3) Probabilistic significance is recommended as a “generic” feature selection technique due to its higher consistency in identifying smelly instances. (4) High-frequency features selected by the top feature selection techniques vary by dataset, highlighting their specific relevance for identifying the corresponding code smells. Based on these findings, we provide implications for further code smell detection research.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00524-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Code smell detection using machine/deep learning methods aims to classify code instances as smelly or non-smelly based on extracted features. Accurate detection relies on optimizing feature sets by focusing on relevant features while discarding those that are redundant or irrelevant. However, prior studies on feature selection and reduction techniques for code smell detection have yielded inconsistent results, possibly due to limited exploration of available techniques. To address this gap, we comprehensively analyze 33 feature selection and 6 feature reduction techniques across seven classification models and six code smell datasets. And we apply the Scott-Knott effect size difference test for comparing performance and McNemar’s test for assessing prediction diversity. The results show that (1) Not all feature selection and reduction techniques significantly improve detection performance. (2) Feature extraction techniques generally perform worse than feature selection techniques. (3) Probabilistic significance is recommended as a “generic” feature selection technique due to its higher consistency in identifying smelly instances. (4) High-frequency features selected by the top feature selection techniques vary by dataset, highlighting their specific relevance for identifying the corresponding code smells. Based on these findings, we provide implications for further code smell detection research.

Abstract Image

查看原文本刊更多论文

特征选择和特征约简技术对代码气味检测的影响：一项全面的实证研究

使用机器/深度学习方法的代码气味检测旨在根据提取的特征将代码实例分类为有气味或无气味。准确的检测依赖于通过关注相关特征而丢弃冗余或不相关的特征来优化特征集。然而，之前关于代码气味检测的特征选择和约简技术的研究得出了不一致的结果，这可能是由于对可用技术的探索有限。为了解决这一差距，我们综合分析了7种分类模型和6种代码气味数据集的33种特征选择和6种特征约简技术。我们采用Scott-Knott效应大小差异检验来比较绩效，采用McNemar检验来评估预测多样性。结果表明：(1)并非所有的特征选择和约简技术都能显著提高检测性能。(2)特征提取技术通常不如特征选择技术。(3)概率显著性被推荐为一种“通用”特征选择技术，因为它在识别臭实例方面具有更高的一致性。(4)由顶级特征选择技术选择的高频特征因数据集而异，突出了它们与识别相应代码气味的特定相关性。基于这些发现，我们为进一步的代码气味检测研究提供了启示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.