基于聚类集合学习的软件缺陷预测方法

IF 1.3 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software Pub Date : 2024-11-19 DOI:10.1049/2024/6294422

Hongwei Tao, Qiaoling Cao, Haoran Chen, Yanting Li, Xiaoxu Niu, Tao Wang, Zhenhao Geng, Songtao Shang

{"title":"基于聚类集合学习的软件缺陷预测方法","authors":"Hongwei Tao, Qiaoling Cao, Haoran Chen, Yanting Li, Xiaoxu Niu, Tao Wang, Zhenhao Geng, Songtao Shang","doi":"10.1049/2024/6294422","DOIUrl":null,"url":null,"abstract":"<p>The technique of software defect prediction aims to assess and predict potential defects in software projects and has made significant progress in recent years within software development. In previous studies, this technique largely relied on supervised learning methods, requiring a substantial amount of labeled historical defect data to train the models. However, obtaining these labeled data often demands significant time and resources. In contrast, software defect prediction based on unsupervised learning does not depend on known labeled data, eliminating the need for large-scale data labeling, thereby saving considerable time and resources while providing a more flexible solution for ensuring software quality. This paper conducts software defect prediction using unsupervised learning methods on data from 16 projects across two public datasets (PROMISE and NASA). During the feature selection step, a chi-squared sparse feature selection method is proposed. This feature selection strategy combines chi-squared tests with sparse principal component analysis (SPCA). Specifically, the chi-squared test is first used to filter out the most statistically significant features, and then the SPCA is applied to reduce the dimensionality of these significant features. In the clustering step, the dot product matrix and Pearson correlation coefficient (PCC) matrix are used to construct weighted adjacency matrices, and a clustering overlap method is proposed. This method integrates spectral clustering, Newman clustering, fluid clustering, and Clauset–Newman–Moore (CNM) clustering through ensemble learning. Experimental results indicate that, in the absence of labeled data, using the chi-squared sparse method for feature selection demonstrates superior performance, and the proposed clustering overlap method outperforms or is comparable to the effectiveness of the four baseline clustering methods.</p>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/6294422","citationCount":"0","resultStr":"{\"title\":\"Software Defect Prediction Method Based on Clustering Ensemble Learning\",\"authors\":\"Hongwei Tao, Qiaoling Cao, Haoran Chen, Yanting Li, Xiaoxu Niu, Tao Wang, Zhenhao Geng, Songtao Shang\",\"doi\":\"10.1049/2024/6294422\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The technique of software defect prediction aims to assess and predict potential defects in software projects and has made significant progress in recent years within software development. In previous studies, this technique largely relied on supervised learning methods, requiring a substantial amount of labeled historical defect data to train the models. However, obtaining these labeled data often demands significant time and resources. In contrast, software defect prediction based on unsupervised learning does not depend on known labeled data, eliminating the need for large-scale data labeling, thereby saving considerable time and resources while providing a more flexible solution for ensuring software quality. This paper conducts software defect prediction using unsupervised learning methods on data from 16 projects across two public datasets (PROMISE and NASA). During the feature selection step, a chi-squared sparse feature selection method is proposed. This feature selection strategy combines chi-squared tests with sparse principal component analysis (SPCA). Specifically, the chi-squared test is first used to filter out the most statistically significant features, and then the SPCA is applied to reduce the dimensionality of these significant features. In the clustering step, the dot product matrix and Pearson correlation coefficient (PCC) matrix are used to construct weighted adjacency matrices, and a clustering overlap method is proposed. This method integrates spectral clustering, Newman clustering, fluid clustering, and Clauset–Newman–Moore (CNM) clustering through ensemble learning. Experimental results indicate that, in the absence of labeled data, using the chi-squared sparse method for feature selection demonstrates superior performance, and the proposed clustering overlap method outperforms or is comparable to the effectiveness of the four baseline clustering methods.</p>\",\"PeriodicalId\":50378,\"journal\":{\"name\":\"IET Software\",\"volume\":\"2024 1\",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/6294422\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/2024/6294422\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Software","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/2024/6294422","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

软件缺陷预测技术旨在评估和预测软件项目中的潜在缺陷，近年来在软件开发领域取得了重大进展。在以往的研究中，该技术主要依赖于监督学习方法，需要大量标注的历史缺陷数据来训练模型。然而，获取这些标注数据往往需要大量的时间和资源。相比之下，基于无监督学习的软件缺陷预测不依赖于已知的标记数据，无需进行大规模的数据标记，从而节省了大量的时间和资源，同时为确保软件质量提供了更灵活的解决方案。本文使用无监督学习方法对两个公共数据集（PROMISE 和 NASA）中 16 个项目的数据进行了软件缺陷预测。在特征选择步骤中，提出了一种奇平方稀疏特征选择方法。这种特征选择策略结合了卡方检验和稀疏主成分分析（SPCA）。具体来说，首先使用卡方检验筛选出统计意义最显著的特征，然后应用 SPCA 降低这些显著特征的维度。在聚类步骤中，利用点积矩阵和皮尔逊相关系数（PCC）矩阵构建加权邻接矩阵，并提出一种聚类重叠方法。该方法通过集合学习将光谱聚类、纽曼聚类、流体聚类和克劳塞特-纽曼-摩尔（CNM）聚类整合在一起。实验结果表明，在没有标注数据的情况下，使用秩方稀疏法进行特征选择表现出更优越的性能，而所提出的聚类重叠方法则优于或相当于四种基线聚类方法的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Software Defect Prediction Method Based on Clustering Ensemble Learning

查看原文本刊更多论文

Software Defect Prediction Method Based on Clustering Ensemble Learning

The technique of software defect prediction aims to assess and predict potential defects in software projects and has made significant progress in recent years within software development. In previous studies, this technique largely relied on supervised learning methods, requiring a substantial amount of labeled historical defect data to train the models. However, obtaining these labeled data often demands significant time and resources. In contrast, software defect prediction based on unsupervised learning does not depend on known labeled data, eliminating the need for large-scale data labeling, thereby saving considerable time and resources while providing a more flexible solution for ensuring software quality. This paper conducts software defect prediction using unsupervised learning methods on data from 16 projects across two public datasets (PROMISE and NASA). During the feature selection step, a chi-squared sparse feature selection method is proposed. This feature selection strategy combines chi-squared tests with sparse principal component analysis (SPCA). Specifically, the chi-squared test is first used to filter out the most statistically significant features, and then the SPCA is applied to reduce the dimensionality of these significant features. In the clustering step, the dot product matrix and Pearson correlation coefficient (PCC) matrix are used to construct weighted adjacency matrices, and a clustering overlap method is proposed. This method integrates spectral clustering, Newman clustering, fluid clustering, and Clauset–Newman–Moore (CNM) clustering through ensemble learning. Experimental results indicate that, in the absence of labeled data, using the chi-squared sparse method for feature selection demonstrates superior performance, and the proposed clustering overlap method outperforms or is comparable to the effectiveness of the four baseline clustering methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IET Software 工程技术-计算机：软件工程

CiteScore

4.20

自引率

0.00%

发文量

审稿时长

9 months

期刊介绍： IET Software publishes papers on all aspects of the software lifecycle, including design, development, implementation and maintenance. The focus of the journal is on the methods used to develop and maintain software, and their practical application. Authors are especially encouraged to submit papers on the following topics, although papers on all aspects of software engineering are welcome: Software and systems requirements engineering Formal methods, design methods, practice and experience Software architecture, aspect and object orientation, reuse and re-engineering Testing, verification and validation techniques Software dependability and measurement Human systems engineering and human-computer interaction Knowledge engineering; expert and knowledge-based systems, intelligent agents Information systems engineering Application of software engineering in industry and commerce Software engineering technology transfer Management of software development Theoretical aspects of software development Machine learning Big data and big code Cloud computing Current Special Issue. Call for papers: Knowledge Discovery for Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_KDSD.pdf Big Data Analytics for Sustainable Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_BDASSD.pdf