一种将n维数据集转换为二维数据集以提高软件缺陷预测精度的新方法

e Informatica Softw. Eng. J. Pub Date : 2020-11-01 DOI:10.17706/jsw.15.6.147-162

Rayhanul Islam, A. Satter, Atish Kumar Dipongkor, Md. Saeed Siddik, K. Sakib

{"title":"一种将n维数据集转换为二维数据集以提高软件缺陷预测精度的新方法","authors":"Rayhanul Islam, A. Satter, Atish Kumar Dipongkor, Md. Saeed Siddik, K. Sakib","doi":"10.17706/jsw.15.6.147-162","DOIUrl":null,"url":null,"abstract":"Software defect prediction model is trained using code metrics and historical defect information to identify probable software defects. The accuracy and performance of a prediction model largely depend on the training dataset. In order to provide proper training dataset, it is required to make the dataset clustered with less variabilities using clustering algorithms. However, clustering process is hampered due to multiple attributes of dataset such as Coupling between Objects, Response for Class, Lines of Code, etc. This research will aim to predict software defects through reducing code metrics dimensions to two latent variables. It will finally help the clustering algorithms to group data properly for the defect prediction model. In this paper, the dataset similarities are analyzed by reducing code metrics’ attributes into two latent variables based on their impacts to defects. Their impacts to defects can be analyzed using regression analysis because it identifies the relationship among a set of dependent and independent variables. Then, the code metrics are merged into two variables PosImpactValue and NegImpactValue based on their positive or negative impact, respectively. As a result, multi-dimensional dataset is mapped into two-dimensional dataset. Plotting those dimensions reduced datasets enable distance-based clustering algorithms to group those datasets based on their similarities. Experiments have been performed on 18 releases of 6 open source software datasets such as jEdit, Ant, Xalan, Synapse, Tomcat and Camel. For comparative analysis, one of the most commonly used dimension reduction techniques named Principle Component Analysis (PCA) and two popular clustering techniques in defect prediction – DBSCAN and WHERE have been used in the experiment. First, the dimensions of the experimental datasets have been reduced using the proposed technique and PCA separately. Then, the reduced datasets have been clustered using DBSCAN and WHERE independently for identifying number of defects accurately. The comparative result analysis shows that the defect prediction models based on the clustering algorithms are more accurate for the dataset reduced by the proposed technique than PCA.","PeriodicalId":11452,"journal":{"name":"e Informatica Softw. Eng. J.","volume":"64 1","pages":"147-162"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Novel Approach for Converting N-Dimensional Dataset into Two Dimensions to Improve Accuracy in Software Defect Prediction\",\"authors\":\"Rayhanul Islam, A. Satter, Atish Kumar Dipongkor, Md. Saeed Siddik, K. Sakib\",\"doi\":\"10.17706/jsw.15.6.147-162\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software defect prediction model is trained using code metrics and historical defect information to identify probable software defects. The accuracy and performance of a prediction model largely depend on the training dataset. In order to provide proper training dataset, it is required to make the dataset clustered with less variabilities using clustering algorithms. However, clustering process is hampered due to multiple attributes of dataset such as Coupling between Objects, Response for Class, Lines of Code, etc. This research will aim to predict software defects through reducing code metrics dimensions to two latent variables. It will finally help the clustering algorithms to group data properly for the defect prediction model. In this paper, the dataset similarities are analyzed by reducing code metrics’ attributes into two latent variables based on their impacts to defects. Their impacts to defects can be analyzed using regression analysis because it identifies the relationship among a set of dependent and independent variables. Then, the code metrics are merged into two variables PosImpactValue and NegImpactValue based on their positive or negative impact, respectively. As a result, multi-dimensional dataset is mapped into two-dimensional dataset. Plotting those dimensions reduced datasets enable distance-based clustering algorithms to group those datasets based on their similarities. Experiments have been performed on 18 releases of 6 open source software datasets such as jEdit, Ant, Xalan, Synapse, Tomcat and Camel. For comparative analysis, one of the most commonly used dimension reduction techniques named Principle Component Analysis (PCA) and two popular clustering techniques in defect prediction – DBSCAN and WHERE have been used in the experiment. First, the dimensions of the experimental datasets have been reduced using the proposed technique and PCA separately. Then, the reduced datasets have been clustered using DBSCAN and WHERE independently for identifying number of defects accurately. The comparative result analysis shows that the defect prediction models based on the clustering algorithms are more accurate for the dataset reduced by the proposed technique than PCA.\",\"PeriodicalId\":11452,\"journal\":{\"name\":\"e Informatica Softw. Eng. J.\",\"volume\":\"64 1\",\"pages\":\"147-162\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"e Informatica Softw. Eng. J.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17706/jsw.15.6.147-162\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"e Informatica Softw. Eng. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17706/jsw.15.6.147-162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

软件缺陷预测模型使用代码度量和历史缺陷信息来识别可能的软件缺陷。预测模型的准确性和性能在很大程度上取决于训练数据集。为了提供合适的训练数据集，需要使用聚类算法使数据集以较少的变量聚类。然而，由于数据集的多属性，如对象间耦合、类响应、代码行数等，影响了聚类过程。本研究旨在通过将代码度量维度降为两个潜在变量来预测软件缺陷。最后将有助于聚类算法对缺陷预测模型的数据进行适当的分组。本文基于代码度量对缺陷的影响，通过将代码度量的属性分解为两个潜在变量来分析数据集的相似度。它们对缺陷的影响可以使用回归分析来分析，因为它确定了一组依赖变量和独立变量之间的关系。然后，代码度量分别基于它们的正面或负面影响合并为两个变量PosImpactValue和NegImpactValue。将多维数据集映射为二维数据集。绘制这些维数减少的数据集使基于距离的聚类算法能够根据它们的相似性对这些数据集进行分组。在jEdit、Ant、Xalan、Synapse、Tomcat、Camel等6个开源软件数据集的18个版本上进行了实验。为了进行比较分析，实验中使用了最常用的降维技术之一主成分分析(PCA)和两种常用的缺陷预测聚类技术DBSCAN和WHERE。首先，分别使用本文提出的方法和主成分分析法对实验数据集进行降维。然后，利用DBSCAN和WHERE分别对简化后的数据集进行聚类，以准确识别缺陷数量。对比结果分析表明，基于聚类算法的缺陷预测模型比基于主成分分析的缺陷预测模型更准确。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Novel Approach for Converting N-Dimensional Dataset into Two Dimensions to Improve Accuracy in Software Defect Prediction

Software defect prediction model is trained using code metrics and historical defect information to identify probable software defects. The accuracy and performance of a prediction model largely depend on the training dataset. In order to provide proper training dataset, it is required to make the dataset clustered with less variabilities using clustering algorithms. However, clustering process is hampered due to multiple attributes of dataset such as Coupling between Objects, Response for Class, Lines of Code, etc. This research will aim to predict software defects through reducing code metrics dimensions to two latent variables. It will finally help the clustering algorithms to group data properly for the defect prediction model. In this paper, the dataset similarities are analyzed by reducing code metrics’ attributes into two latent variables based on their impacts to defects. Their impacts to defects can be analyzed using regression analysis because it identifies the relationship among a set of dependent and independent variables. Then, the code metrics are merged into two variables PosImpactValue and NegImpactValue based on their positive or negative impact, respectively. As a result, multi-dimensional dataset is mapped into two-dimensional dataset. Plotting those dimensions reduced datasets enable distance-based clustering algorithms to group those datasets based on their similarities. Experiments have been performed on 18 releases of 6 open source software datasets such as jEdit, Ant, Xalan, Synapse, Tomcat and Camel. For comparative analysis, one of the most commonly used dimension reduction techniques named Principle Component Analysis (PCA) and two popular clustering techniques in defect prediction – DBSCAN and WHERE have been used in the experiment. First, the dimensions of the experimental datasets have been reduced using the proposed technique and PCA separately. Then, the reduced datasets have been clustered using DBSCAN and WHERE independently for identifying number of defects accurately. The comparative result analysis shows that the defect prediction models based on the clustering algorithms are more accurate for the dataset reduced by the proposed technique than PCA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

e Informatica Softw. Eng. J.

自引率

0.00%

发文量