Automated Configuration Bug Report Prediction Using Text Mining

2014 IEEE 38th Annual Computer Software and Applications Conference Pub Date : 2014-07-21 DOI:10.1109/COMPSAC.2014.17

Xin Xia, D. Lo, Weiwei Qiu, Xingen Wang, Bo Zhou

{"title":"Automated Configuration Bug Report Prediction Using Text Mining","authors":"Xin Xia, D. Lo, Weiwei Qiu, Xingen Wang, Bo Zhou","doi":"10.1109/COMPSAC.2014.17","DOIUrl":null,"url":null,"abstract":"Configuration bugs are one of the dominant causes of software failures. Previous studies show that a configuration bug could cause huge financial losses in a software system. The importance of configuration bugs has attracted various research studies, e.g., To detect, diagnose, and fix configuration bugs. Given a bug report, an approach that can identify whether the bug is a configuration bug could help developers reduce debugging effort. We refer to this problem as configuration bug reports prediction. To address this problem, we develop a new automated framework that applies text mining technologies on the natural-language description of bug reports to train a statistical model on historical bug reports with known labels (i.e., Configuration or non-configuration), and the statistical model is then used to predict a label for a new bug report. Developers could apply our model to automatically predict labels of bug reports to improve their productivity. Our tool first applies feature selection techniques (e.g., Information gain and Chi-square) to pre-process the textual information in bug reports, and then applies various text mining techniques (e.g., Naive Bayes, SVM, naive Bayes multinomial) to build statistical models. We evaluate our solution on 5 bug report datasets including accumulo, activemq, camel, flume, and wicket. We show that naive Bayes multinomial with information gain achieves the best performance. On average across the 5 projects, its accuracy, configuration F-measure and non-configuration F-measure are 0.811, 0.450, and 0.880, respectively. We also compare our solution with the method proposed by Arshad et al. The results show that our proposed approach that uses naive Bayes multinomial with information gain on average improves accuracy, configuration F-measure and non-configuration F-measure scores of Arshad et al.'s method by 8.34%, 103.7%, and 4.24%, respectively.","PeriodicalId":106871,"journal":{"name":"2014 IEEE 38th Annual Computer Software and Applications Conference","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 38th Annual Computer Software and Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC.2014.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

Abstract

Configuration bugs are one of the dominant causes of software failures. Previous studies show that a configuration bug could cause huge financial losses in a software system. The importance of configuration bugs has attracted various research studies, e.g., To detect, diagnose, and fix configuration bugs. Given a bug report, an approach that can identify whether the bug is a configuration bug could help developers reduce debugging effort. We refer to this problem as configuration bug reports prediction. To address this problem, we develop a new automated framework that applies text mining technologies on the natural-language description of bug reports to train a statistical model on historical bug reports with known labels (i.e., Configuration or non-configuration), and the statistical model is then used to predict a label for a new bug report. Developers could apply our model to automatically predict labels of bug reports to improve their productivity. Our tool first applies feature selection techniques (e.g., Information gain and Chi-square) to pre-process the textual information in bug reports, and then applies various text mining techniques (e.g., Naive Bayes, SVM, naive Bayes multinomial) to build statistical models. We evaluate our solution on 5 bug report datasets including accumulo, activemq, camel, flume, and wicket. We show that naive Bayes multinomial with information gain achieves the best performance. On average across the 5 projects, its accuracy, configuration F-measure and non-configuration F-measure are 0.811, 0.450, and 0.880, respectively. We also compare our solution with the method proposed by Arshad et al. The results show that our proposed approach that uses naive Bayes multinomial with information gain on average improves accuracy, configuration F-measure and non-configuration F-measure scores of Arshad et al.'s method by 8.34%, 103.7%, and 4.24%, respectively.

查看原文本刊更多论文

使用文本挖掘的自动配置错误报告预测

配置错误是导致软件失败的主要原因之一。以前的研究表明，配置错误可能会给软件系统造成巨大的经济损失。配置错误的重要性吸引了各种各样的研究，例如:检测、诊断和修复配置错误。给定一个错误报告，一种能够识别错误是否为配置错误的方法可以帮助开发人员减少调试工作。我们将此问题称为配置错误报告预测。为了解决这个问题，我们开发了一个新的自动化框架，该框架将文本挖掘技术应用于bug报告的自然语言描述，以训练具有已知标签(即配置或非配置)的历史bug报告的统计模型，然后使用统计模型来预测新bug报告的标签。开发人员可以应用我们的模型来自动预测bug报告的标签，以提高他们的工作效率。我们的工具首先应用特征选择技术(如信息增益和卡方)对bug报告中的文本信息进行预处理，然后应用各种文本挖掘技术(如朴素贝叶斯、支持向量机、朴素贝叶斯多项)构建统计模型。我们在5个bug报告数据集上评估了我们的解决方案，包括accumulo、activemq、camel、flume和wicket。结果表明，具有信息增益的朴素贝叶斯多项式具有最佳性能。5个项目的平均精度、配置f -测度和非配置f -测度分别为0.811、0.450和0.880。我们还与Arshad等人提出的方法进行了比较。结果表明，采用具有信息增益的朴素贝叶斯多项式的方法，平均提高了Arshad等方法的准确率、配置f -测度分数和非配置f -测度分数，分别提高了8.34%、103.7%和4.24%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 38th Annual Computer Software and Applications Conference

自引率

0.00%

发文量