在bug报告中识别非自然语言工件

2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) Pub Date : 2021-10-04 DOI:10.1109/ASEW52652.2021.00046

Thomas Hirsch, Birgit Hofer

{"title":"在bug报告中识别非自然语言工件","authors":"Thomas Hirsch, Birgit Hofer","doi":"10.1109/ASEW52652.2021.00046","DOIUrl":null,"url":null,"abstract":"Bug reports are a popular target for natural language processing (NLP). However, bug reports often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the bug reports with noise, but often constitute a real problem for the NLP approach at hand and have to be removed. In this paper, we present a machine learning based approach to classify content into natural language and artifacts at line level implemented in Python. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for bug reports. Our model scores at 0.95 ROC-AUC and 0.93 F1 against our manually annotated validation set, and classifies 10k lines in 0.72 seconds. We cross evaluated our model against a foreign dataset and a foreign R model for the same task. The Python implementation of our model and our datasets are made publicly available under an open source license.","PeriodicalId":349977,"journal":{"name":"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)","volume":"239 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Identifying non-natural language artifacts in bug reports\",\"authors\":\"Thomas Hirsch, Birgit Hofer\",\"doi\":\"10.1109/ASEW52652.2021.00046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bug reports are a popular target for natural language processing (NLP). However, bug reports often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the bug reports with noise, but often constitute a real problem for the NLP approach at hand and have to be removed. In this paper, we present a machine learning based approach to classify content into natural language and artifacts at line level implemented in Python. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for bug reports. Our model scores at 0.95 ROC-AUC and 0.93 F1 against our manually annotated validation set, and classifies 10k lines in 0.72 seconds. We cross evaluated our model against a foreign dataset and a foreign R model for the same task. The Python implementation of our model and our datasets are made publicly available under an open source license.\",\"PeriodicalId\":349977,\"journal\":{\"name\":\"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)\",\"volume\":\"239 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASEW52652.2021.00046\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASEW52652.2021.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

Bug报告是自然语言处理(NLP)的热门目标。然而，bug报告通常包含诸如代码片段、日志输出和堆栈跟踪之类的工件。这些工件不仅使bug报告充满噪音，而且经常构成手头的NLP方法的实际问题，必须删除。在本文中，我们提出了一种基于机器学习的方法，将内容分类为自然语言和用Python实现的行级工件。我们展示了如何将来自GitHub问题跟踪器的数据用于自动训练集生成，并为bug报告提供了自定义预处理方法。我们的模型在手动标注的验证集上的ROC-AUC得分为0.95,F1得分为0.93，在0.72秒内对10k行进行了分类。我们针对同一任务的外部数据集和外部R模型交叉评估了我们的模型。我们的模型和数据集的Python实现在开源许可下公开可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identifying non-natural language artifacts in bug reports

Bug reports are a popular target for natural language processing (NLP). However, bug reports often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the bug reports with noise, but often constitute a real problem for the NLP approach at hand and have to be removed. In this paper, we present a machine learning based approach to classify content into natural language and artifacts at line level implemented in Python. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for bug reports. Our model scores at 0.95 ROC-AUC and 0.93 F1 against our manually annotated validation set, and classifies 10k lines in 0.72 seconds. We cross evaluated our model against a foreign dataset and a foreign R model for the same task. The Python implementation of our model and our datasets are made publicly available under an open source license.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)

自引率

0.00%

发文量