软件缺陷预测的半监督方法

2014 IEEE 38th Annual Computer Software and Applications Conference Pub Date : 2014-07-21 DOI:10.1109/COMPSAC.2014.65

Huihua Lu, B. Cukic, M. Culp

{"title":"软件缺陷预测的半监督方法","authors":"Huihua Lu, B. Cukic, M. Culp","doi":"10.1109/COMPSAC.2014.65","DOIUrl":null,"url":null,"abstract":"Accurate detection of software components that need to be exposed to additional verification and validation offers the path to high quality products while minimizing non essential software assurance expenditures. In this type of quality modeling we assume that software modules with known fault content developed in similar environment are available. Supervised learning algorithms are the traditional methods of choice for training on existing modules. The models are then used to predict fault content for newly developed software components prior to product release. However, one needs to realize that establishing whether a module contains a fault or not, only to be used for model training, can be expensive. The basic idea behind semi-supervised learning is to learn from a small number of software modules with known fault content and supplement model training with modules for which the fault information is not available, thus reducing the overall cost of quality assurance. In this study, we investigate the performance of semi-supervised learning for software fault prediction. A preprocessing strategy, multidimensional scaling, is embedded in the approach to reduce the dimensional complexity of software metrics used for prediction. Our results show that the dimension-reduction with semi-supervised learning algorithm preforms significantly better than one of the best performing supervised learning algorithm - random forest - in situations when few modules with known fault content are available. We compare our results with the published benchmarks and clearly demonstrate performance benefits.","PeriodicalId":106871,"journal":{"name":"2014 IEEE 38th Annual Computer Software and Applications Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"A Semi-supervised Approach to Software Defect Prediction\",\"authors\":\"Huihua Lu, B. Cukic, M. Culp\",\"doi\":\"10.1109/COMPSAC.2014.65\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate detection of software components that need to be exposed to additional verification and validation offers the path to high quality products while minimizing non essential software assurance expenditures. In this type of quality modeling we assume that software modules with known fault content developed in similar environment are available. Supervised learning algorithms are the traditional methods of choice for training on existing modules. The models are then used to predict fault content for newly developed software components prior to product release. However, one needs to realize that establishing whether a module contains a fault or not, only to be used for model training, can be expensive. The basic idea behind semi-supervised learning is to learn from a small number of software modules with known fault content and supplement model training with modules for which the fault information is not available, thus reducing the overall cost of quality assurance. In this study, we investigate the performance of semi-supervised learning for software fault prediction. A preprocessing strategy, multidimensional scaling, is embedded in the approach to reduce the dimensional complexity of software metrics used for prediction. Our results show that the dimension-reduction with semi-supervised learning algorithm preforms significantly better than one of the best performing supervised learning algorithm - random forest - in situations when few modules with known fault content are available. We compare our results with the published benchmarks and clearly demonstrate performance benefits.\",\"PeriodicalId\":106871,\"journal\":{\"name\":\"2014 IEEE 38th Annual Computer Software and Applications Conference\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 38th Annual Computer Software and Applications Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMPSAC.2014.65\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 38th Annual Computer Software and Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC.2014.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

对需要进行额外验证和确认的软件组件的准确检测为获得高质量的产品提供了途径，同时最大限度地减少了不必要的软件保证支出。在这种类型的质量建模中，我们假设在类似环境中开发的具有已知故障内容的软件模块是可用的。监督学习算法是在现有模块上进行训练的传统方法。然后，这些模型用于在产品发布之前预测新开发的软件组件的故障内容。然而，需要认识到，建立一个模块是否包含故障，仅用于模型训练，可能是昂贵的。半监督学习的基本思想是，从少数已知故障内容的软件模块中学习，并用无法获得故障信息的模块补充模型训练，从而降低质量保证的总体成本。在这项研究中，我们研究了半监督学习在软件故障预测中的性能。一种预处理策略，多维缩放，被嵌入到方法中，以减少用于预测的软件度量的维度复杂性。我们的研究结果表明，在已知故障内容的模块很少的情况下，半监督学习算法的降维效果明显优于性能最好的监督学习算法之一——随机森林。我们将结果与发布的基准测试进行比较，并清楚地展示性能优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Semi-supervised Approach to Software Defect Prediction

Accurate detection of software components that need to be exposed to additional verification and validation offers the path to high quality products while minimizing non essential software assurance expenditures. In this type of quality modeling we assume that software modules with known fault content developed in similar environment are available. Supervised learning algorithms are the traditional methods of choice for training on existing modules. The models are then used to predict fault content for newly developed software components prior to product release. However, one needs to realize that establishing whether a module contains a fault or not, only to be used for model training, can be expensive. The basic idea behind semi-supervised learning is to learn from a small number of software modules with known fault content and supplement model training with modules for which the fault information is not available, thus reducing the overall cost of quality assurance. In this study, we investigate the performance of semi-supervised learning for software fault prediction. A preprocessing strategy, multidimensional scaling, is embedded in the approach to reduce the dimensional complexity of software metrics used for prediction. Our results show that the dimension-reduction with semi-supervised learning algorithm preforms significantly better than one of the best performing supervised learning algorithm - random forest - in situations when few modules with known fault content are available. We compare our results with the published benchmarks and clearly demonstrate performance benefits.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 38th Annual Computer Software and Applications Conference

自引率

0.00%

发文量