基于文本挖掘的问题报告分类之路在何方?

2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) Pub Date : 2017-11-09 DOI:10.1109/ESEM.2017.19

Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang

{"title":"基于文本挖掘的问题报告分类之路在何方?","authors":"Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang","doi":"10.1109/ESEM.2017.19","DOIUrl":null,"url":null,"abstract":"Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.","PeriodicalId":213866,"journal":{"name":"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)","volume":"452 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":"{\"title\":\"Where Is the Road for Issue Reports Classification Based on Text Mining?\",\"authors\":\"Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang\",\"doi\":\"10.1109/ESEM.2017.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.\",\"PeriodicalId\":213866,\"journal\":{\"name\":\"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)\",\"volume\":\"452 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"39\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESEM.2017.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2017.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

摘要

目前，开源项目每天都会收到各种各样的问题，因为GitHub的问题跟踪系统(ITS)非常开放。对于项目经理来说，ITS是一项劳动密集且耗时的问题分类任务。然而，贡献者只需要一个简短的文本摘要来报告GitHub中的问题。因此，大多数基于详细和结构化数据(例如，优先级、严重性、软件版本等)的传统分类方法难以采用。本文研究了一个大型数据集上的问题分类方法，该数据集包括从GitHub收集的80个热门项目和超过25.2万个问题报告。首先，讨论了四种传统的基于文本的分类方法及其性能。语义困惑(即问题描述将bug相关的句子与非bug相关的句子混淆)是影响基于定量和定性研究的分类性能的关键因素。最后，设计了一个基于问题报告语义困惑度度量的两阶段分类器框架。结果表明，两阶段分类能显著提高问题分类性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Where Is the Road for Issue Reports Classification Based on Text Mining?

Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

自引率

0.00%

发文量