{"title":"Grey Relational Analysis Based k Nearest Neighbor Missing Data Imputation for Software Quality Datasets","authors":"Jianglin Huang, Hongyi Sun","doi":"10.1109/QRS.2016.20","DOIUrl":null,"url":null,"abstract":"Software quality estimation is important yet difficult in software engineering studies. Historical quality datasets are used to build classification models for estimating fault-proneness. However, the missing values in the datasets severely affect the estimation ability and therefore, cause inconclusive decision-making. Among the single imputation approaches, k nearest neighbor (kNN) imputation is popular in empirical studies due to the relatively high accuracy. However, researchers are still calling for the optimal parameter setting of kNN imputation. In this study, a novel grey relational analysis based incomplete-instance kNN imputation is built for software quality data. An evaluation is conducted on four quality datasets with different simulated missingness scenarios to analyze the performance of the proposed imputation. The empirical results show that the proposed approach is superior to traditional kNN imputation and mean imputation in most cases. Moreover, the classification accuracy can be maintained or even improved by using this approach in classification tasks.","PeriodicalId":412973,"journal":{"name":"2016 IEEE International Conference on Software Quality, Reliability and Security (QRS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Software Quality, Reliability and Security (QRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS.2016.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
Software quality estimation is important yet difficult in software engineering studies. Historical quality datasets are used to build classification models for estimating fault-proneness. However, the missing values in the datasets severely affect the estimation ability and therefore, cause inconclusive decision-making. Among the single imputation approaches, k nearest neighbor (kNN) imputation is popular in empirical studies due to the relatively high accuracy. However, researchers are still calling for the optimal parameter setting of kNN imputation. In this study, a novel grey relational analysis based incomplete-instance kNN imputation is built for software quality data. An evaluation is conducted on four quality datasets with different simulated missingness scenarios to analyze the performance of the proposed imputation. The empirical results show that the proposed approach is superior to traditional kNN imputation and mean imputation in most cases. Moreover, the classification accuracy can be maintained or even improved by using this approach in classification tasks.