{"title":"利用工业环境中提交日志的潜在特征和语义特征预测缺陷","authors":"Beyza Eken, RiFat Atar, Sahra Sertalp, Ayse Tosun Misirli","doi":"10.1109/ASEW.2019.00038","DOIUrl":null,"url":null,"abstract":"Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.","PeriodicalId":277020,"journal":{"name":"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Predicting Defects with Latent and Semantic Features from Commit Logs in an Industrial Setting\",\"authors\":\"Beyza Eken, RiFat Atar, Sahra Sertalp, Ayse Tosun Misirli\",\"doi\":\"10.1109/ASEW.2019.00038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.\",\"PeriodicalId\":277020,\"journal\":{\"name\":\"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASEW.2019.00038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASEW.2019.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting Defects with Latent and Semantic Features from Commit Logs in an Industrial Setting
Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.