{"title":"软件数据集分析技术的研究","authors":"L. Pickard, B. Kitchenham, Susan Linkman","doi":"10.1109/METRIC.1999.809734","DOIUrl":null,"url":null,"abstract":"The goal of the study was to investigate the efficacy of different data analysis techniques for software data. We used simulation to create datasets with a known underlying model and with non-Normal characteristics that are frequently found in software datasets: skewness, unstable variance, and outliers and combinations of these characteristics. We investigated three main statistically based data analysis techniques: residual analysis; multivariate regression; classification and regression trees (CART). In addition to the standard \"least squares\" version of the technique, we also investigated robust and nonparametric versions of the techniques. We found that standard multivariate regression techniques were best if the data only exhibited skewness. However, under more extreme conditions such as severe heteroscedasticity, the nonparametric residual analysis technique performed best. We also found that even when the analysis technique did not accurately recreate the true underlying model, the faulty model could generate reasonably good predictions. The study indicates that simulation is very useful technique for evaluating different data analysis techniques.","PeriodicalId":372331,"journal":{"name":"Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":"{\"title\":\"An investigation of analysis techniques for software datasets\",\"authors\":\"L. Pickard, B. Kitchenham, Susan Linkman\",\"doi\":\"10.1109/METRIC.1999.809734\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The goal of the study was to investigate the efficacy of different data analysis techniques for software data. We used simulation to create datasets with a known underlying model and with non-Normal characteristics that are frequently found in software datasets: skewness, unstable variance, and outliers and combinations of these characteristics. We investigated three main statistically based data analysis techniques: residual analysis; multivariate regression; classification and regression trees (CART). In addition to the standard \\\"least squares\\\" version of the technique, we also investigated robust and nonparametric versions of the techniques. We found that standard multivariate regression techniques were best if the data only exhibited skewness. However, under more extreme conditions such as severe heteroscedasticity, the nonparametric residual analysis technique performed best. We also found that even when the analysis technique did not accurately recreate the true underlying model, the faulty model could generate reasonably good predictions. The study indicates that simulation is very useful technique for evaluating different data analysis techniques.\",\"PeriodicalId\":372331,\"journal\":{\"name\":\"Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"75\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/METRIC.1999.809734\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/METRIC.1999.809734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An investigation of analysis techniques for software datasets
The goal of the study was to investigate the efficacy of different data analysis techniques for software data. We used simulation to create datasets with a known underlying model and with non-Normal characteristics that are frequently found in software datasets: skewness, unstable variance, and outliers and combinations of these characteristics. We investigated three main statistically based data analysis techniques: residual analysis; multivariate regression; classification and regression trees (CART). In addition to the standard "least squares" version of the technique, we also investigated robust and nonparametric versions of the techniques. We found that standard multivariate regression techniques were best if the data only exhibited skewness. However, under more extreme conditions such as severe heteroscedasticity, the nonparametric residual analysis technique performed best. We also found that even when the analysis technique did not accurately recreate the true underlying model, the faulty model could generate reasonably good predictions. The study indicates that simulation is very useful technique for evaluating different data analysis techniques.