Zexian Zhang, Lin Zhu, Shuang Yin, Wenhua Hu, Shan Gao, Haoxuan Chen, Fuyang Li
{"title":"The impact of feature selection and feature reduction techniques for code smell detection: A comprehensive empirical study","authors":"Zexian Zhang, Lin Zhu, Shuang Yin, Wenhua Hu, Shan Gao, Haoxuan Chen, Fuyang Li","doi":"10.1007/s10515-025-00524-6","DOIUrl":null,"url":null,"abstract":"<div><p>Code smell detection using machine/deep learning methods aims to classify code instances as smelly or non-smelly based on extracted features. Accurate detection relies on optimizing feature sets by focusing on relevant features while discarding those that are redundant or irrelevant. However, prior studies on feature selection and reduction techniques for code smell detection have yielded inconsistent results, possibly due to limited exploration of available techniques. To address this gap, we comprehensively analyze 33 feature selection and 6 feature reduction techniques across seven classification models and six code smell datasets. And we apply the Scott-Knott effect size difference test for comparing performance and McNemar’s test for assessing prediction diversity. The results show that (1) Not all feature selection and reduction techniques significantly improve detection performance. (2) Feature extraction techniques generally perform worse than feature selection techniques. (3) Probabilistic significance is recommended as a “generic” feature selection technique due to its higher consistency in identifying smelly instances. (4) High-frequency features selected by the top feature selection techniques vary by dataset, highlighting their specific relevance for identifying the corresponding code smells. Based on these findings, we provide implications for further code smell detection research.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00524-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Code smell detection using machine/deep learning methods aims to classify code instances as smelly or non-smelly based on extracted features. Accurate detection relies on optimizing feature sets by focusing on relevant features while discarding those that are redundant or irrelevant. However, prior studies on feature selection and reduction techniques for code smell detection have yielded inconsistent results, possibly due to limited exploration of available techniques. To address this gap, we comprehensively analyze 33 feature selection and 6 feature reduction techniques across seven classification models and six code smell datasets. And we apply the Scott-Knott effect size difference test for comparing performance and McNemar’s test for assessing prediction diversity. The results show that (1) Not all feature selection and reduction techniques significantly improve detection performance. (2) Feature extraction techniques generally perform worse than feature selection techniques. (3) Probabilistic significance is recommended as a “generic” feature selection technique due to its higher consistency in identifying smelly instances. (4) High-frequency features selected by the top feature selection techniques vary by dataset, highlighting their specific relevance for identifying the corresponding code smells. Based on these findings, we provide implications for further code smell detection research.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.