利用工业环境中提交日志的潜在特征和语义特征预测缺陷

Beyza Eken, RiFat Atar, Sahra Sertalp, Ayse Tosun Misirli
{"title":"利用工业环境中提交日志的潜在特征和语义特征预测缺陷","authors":"Beyza Eken, RiFat Atar, Sahra Sertalp, Ayse Tosun Misirli","doi":"10.1109/ASEW.2019.00038","DOIUrl":null,"url":null,"abstract":"Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.","PeriodicalId":277020,"journal":{"name":"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Predicting Defects with Latent and Semantic Features from Commit Logs in an Industrial Setting\",\"authors\":\"Beyza Eken, RiFat Atar, Sahra Sertalp, Ayse Tosun Misirli\",\"doi\":\"10.1109/ASEW.2019.00038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.\",\"PeriodicalId\":277020,\"journal\":{\"name\":\"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASEW.2019.00038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASEW.2019.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

在工业环境中,软件缺陷预测仍然是一项具有挑战性的任务。由于数据嘈杂和/或缺乏数据,很难建立成功的预测模型。在本研究中,我们旨在为工业环境中的软件项目建立一个变化级缺陷预测模型。我们结合了各种概率模型,即矩阵因式分解和主题建模,期望通过从多种资源中提取隐藏特征,克服工业环境中的噪声和有限性。通过使用对数过滤和特征选择技术,以及 Naive Bayes 和 Extreme Gradient Boosting (XGBoost) 两种机器学习算法,我们将提交级别的流程指标、提交中的潜在特征和提交消息中的语义特征结合起来,构建了缺陷预测器。通过收集各种来源的数据并应用数据预处理技术,与仅使用流程指标的基础模型相比,检测概率在统计学上有了显著提高,最高提高了 24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Predicting Defects with Latent and Semantic Features from Commit Logs in an Industrial Setting
Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信