基于文本挖掘的问题报告分类之路在何方?

Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang
{"title":"基于文本挖掘的问题报告分类之路在何方?","authors":"Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang","doi":"10.1109/ESEM.2017.19","DOIUrl":null,"url":null,"abstract":"Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.","PeriodicalId":213866,"journal":{"name":"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)","volume":"452 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":"{\"title\":\"Where Is the Road for Issue Reports Classification Based on Text Mining?\",\"authors\":\"Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang\",\"doi\":\"10.1109/ESEM.2017.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.\",\"PeriodicalId\":213866,\"journal\":{\"name\":\"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)\",\"volume\":\"452 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"39\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESEM.2017.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2017.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39

摘要

目前,开源项目每天都会收到各种各样的问题,因为GitHub的问题跟踪系统(ITS)非常开放。对于项目经理来说,ITS是一项劳动密集且耗时的问题分类任务。然而,贡献者只需要一个简短的文本摘要来报告GitHub中的问题。因此,大多数基于详细和结构化数据(例如,优先级、严重性、软件版本等)的传统分类方法难以采用。本文研究了一个大型数据集上的问题分类方法,该数据集包括从GitHub收集的80个热门项目和超过25.2万个问题报告。首先,讨论了四种传统的基于文本的分类方法及其性能。语义困惑(即问题描述将bug相关的句子与非bug相关的句子混淆)是影响基于定量和定性研究的分类性能的关键因素。最后,设计了一个基于问题报告语义困惑度度量的两阶段分类器框架。结果表明,两阶段分类能显著提高问题分类性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Where Is the Road for Issue Reports Classification Based on Text Mining?
Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信