Empirically revisiting and enhancing automatic classification of bug and non-bug issues

IF 4.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Computer Science Pub Date : 2023-12-23 DOI:10.1007/s11704-023-2771-z

Zhong Li, Minxue Pan, Yu Pei, Tian Zhang, Linzhang Wang, Xuandong Li

{"title":"Empirically revisiting and enhancing automatic classification of bug and non-bug issues","authors":"Zhong Li, Minxue Pan, Yu Pei, Tian Zhang, Linzhang Wang, Xuandong Li","doi":"10.1007/s11704-023-2771-z","DOIUrl":null,"url":null,"abstract":"A large body of research effort has been dedicated to automated issue classification for Issue Tracking Systems (ITSs). Although the existing approaches have shown promising performance, the different design choices, including the different textual fields, feature representation methods and machine learning algorithms adopted by existing approaches, have not been comprehensively compared and analyzed. To fill this gap, we perform the first extensive study of automated issue classification on 9 state-of-the-art issue classification approaches. Our experimental results on the widely studied dataset reveal multiple practical guidelines for automated issue classification, including: (1) Training separate models for the issue titles and descriptions and then combining these two models tend to achieve better performance for issue classification; (2) Word embedding with Long Short-Term Memory (LSTM) can better extract features from the textual fields in the issues, and hence, lead to better issue classification models; (3) There exist certain terms in the textual fields that are helpful for building more discriminating classifiers between bug and non-bug issues; (4) The performance of the issue classification model is not sensitive to the choices of ML algorithms. Based on our study outcomes, we further propose an advanced issue classification approach, DeepLabel, which can achieve better performance compared with the existing issue classification approaches.","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"32 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-2771-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

A large body of research effort has been dedicated to automated issue classification for Issue Tracking Systems (ITSs). Although the existing approaches have shown promising performance, the different design choices, including the different textual fields, feature representation methods and machine learning algorithms adopted by existing approaches, have not been comprehensively compared and analyzed. To fill this gap, we perform the first extensive study of automated issue classification on 9 state-of-the-art issue classification approaches. Our experimental results on the widely studied dataset reveal multiple practical guidelines for automated issue classification, including: (1) Training separate models for the issue titles and descriptions and then combining these two models tend to achieve better performance for issue classification; (2) Word embedding with Long Short-Term Memory (LSTM) can better extract features from the textual fields in the issues, and hence, lead to better issue classification models; (3) There exist certain terms in the textual fields that are helpful for building more discriminating classifiers between bug and non-bug issues; (4) The performance of the issue classification model is not sensitive to the choices of ML algorithms. Based on our study outcomes, we further propose an advanced issue classification approach, DeepLabel, which can achieve better performance compared with the existing issue classification approaches.

查看原文本刊更多论文

以经验为基础，重新审视并加强错误和非错误问题的自动分类

针对问题跟踪系统（ITSs）的自动问题分类已经开展了大量的研究工作。虽然现有的方法都显示出了良好的性能，但对不同的设计选择，包括现有方法所采用的不同文本字段、特征表示方法和机器学习算法，还没有进行过全面的比较和分析。为了填补这一空白，我们首次对 9 种最先进的问题分类方法进行了广泛的自动问题分类研究。我们在广泛研究的数据集上的实验结果揭示了自动问题分类的多种实用指南，包括(1) 为问题标题和描述分别训练模型，然后将这两个模型结合起来，往往能获得更好的问题分类性能；(2) 使用长短期记忆（LSTM）进行单词嵌入能更好地从问题的文本字段中提取特征，从而建立更好的问题分类模型；(3) 文本字段中的某些术语有助于在错误问题和非错误问题之间建立更具区分性的分类器；(4) 问题分类模型的性能对多重L算法的选择并不敏感。在研究成果的基础上，我们进一步提出了一种先进的问题分类方法--DeepLabel，与现有的问题分类方法相比，它可以获得更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

8.60

自引率

2.40%

发文量

799

审稿时长

6-12 weeks

期刊介绍： Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.