The impact of unsupervised feature selection techniques on the performance and interpretation of defect prediction models

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-04-16 DOI:10.1007/s10515-025-00510-y

Zhiqiang Li, Wenzhi Zhu, Hongyu Zhang, Yuantian Miao, Jie Ren

{"title":"The impact of unsupervised feature selection techniques on the performance and interpretation of defect prediction models","authors":"Zhiqiang Li, Wenzhi Zhu, Hongyu Zhang, Yuantian Miao, Jie Ren","doi":"10.1007/s10515-025-00510-y","DOIUrl":null,"url":null,"abstract":"<div><p>The performance and interpretation of a defect prediction model depend on the software metrics utilized in its construction. Feature selection techniques can enhance model performance and interpretation by effectively removing redundant, correlated, and irrelevant metrics from defect datasets. Previous empirical studies have scrutinized the impact of feature selection techniques on the performance and interpretation of defect prediction models. However, most feature selection techniques examined in these studies are primarily supervised. In particular, the impact of unsupervised feature selection (UFS) techniques on defect prediction remains unknown and needs to be explored extensively. To address this gap, we systematically apply 21 UFS techniques to evaluate their impact on the performance and interpretation of unsupervised defect prediction models in binary classification and effort-aware ranking scenarios. Extensive experiments are conducted on the 28 versions from 8 projects using 4 unsupervised models. We observe that: (1) 10–100% of the selected metrics are inconsistent between each pair of UFS techniques. (2) 29–100% of the selected metrics are inconsistent among different software modules. (3) For unsupervised defect prediction models, some UFS techniques (e.g., AutoSpearman, LS, and FMIUFS) exhibit the ability to effectively reduce the number of metrics while maintaining or even improving model performance. (4) UFS techniques alter the ranking of the top 3 groups of metrics in defect models, affecting the interpretation of these models. Based on these findings, we recommend that software practitioners utilize UFS techniques for unsupervised defect prediction. However, caution should be exercised when deriving insights and interpretations from defect prediction models.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00510-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The performance and interpretation of a defect prediction model depend on the software metrics utilized in its construction. Feature selection techniques can enhance model performance and interpretation by effectively removing redundant, correlated, and irrelevant metrics from defect datasets. Previous empirical studies have scrutinized the impact of feature selection techniques on the performance and interpretation of defect prediction models. However, most feature selection techniques examined in these studies are primarily supervised. In particular, the impact of unsupervised feature selection (UFS) techniques on defect prediction remains unknown and needs to be explored extensively. To address this gap, we systematically apply 21 UFS techniques to evaluate their impact on the performance and interpretation of unsupervised defect prediction models in binary classification and effort-aware ranking scenarios. Extensive experiments are conducted on the 28 versions from 8 projects using 4 unsupervised models. We observe that: (1) 10–100% of the selected metrics are inconsistent between each pair of UFS techniques. (2) 29–100% of the selected metrics are inconsistent among different software modules. (3) For unsupervised defect prediction models, some UFS techniques (e.g., AutoSpearman, LS, and FMIUFS) exhibit the ability to effectively reduce the number of metrics while maintaining or even improving model performance. (4) UFS techniques alter the ranking of the top 3 groups of metrics in defect models, affecting the interpretation of these models. Based on these findings, we recommend that software practitioners utilize UFS techniques for unsupervised defect prediction. However, caution should be exercised when deriving insights and interpretations from defect prediction models.

查看原文本刊更多论文

无监督特征选择技术对缺陷预测模型的性能和解释的影响

缺陷预测模型的性能和解释依赖于在其构建中使用的软件度量。特征选择技术可以通过有效地从缺陷数据集中去除冗余、相关和不相关的度量来增强模型的性能和解释。以前的实证研究已经仔细研究了特征选择技术对缺陷预测模型的性能和解释的影响。然而，在这些研究中检验的大多数特征选择技术主要是监督的。特别是，无监督特征选择（UFS）技术对缺陷预测的影响仍然是未知的，需要广泛的探索。为了解决这一差距，我们系统地应用了21种UFS技术来评估它们在二元分类和努力感知排序场景下对无监督缺陷预测模型的性能和解释的影响。使用4个无监督模型对来自8个项目的28个版本进行了广泛的实验。我们观察到：(1)每对UFS技术之间10-100%的选择指标不一致。(2) 29-100%的选择指标在不同的软件模块之间不一致。(3)对于无监督缺陷预测模型，一些UFS技术（例如AutoSpearman、LS和FMIUFS）显示出在保持甚至改进模型性能的同时有效减少度量的数量的能力。(4) UFS技术改变了缺陷模型中前3组度量的排名，影响了这些模型的解释。基于这些发现，我们建议软件从业者利用UFS技术进行无监督缺陷预测。然而，当从缺陷预测模型中获得见解和解释时，应该谨慎行事。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.