Towards more accurate content categorization of API discussions

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) Pub Date : 2014-06-02 DOI:10.1145/2597008.2597142

Bo Zhou, Xin Xia, D. Lo, Cong Tian, Xinyu Wang

{"title":"Towards more accurate content categorization of API discussions","authors":"Bo Zhou, Xin Xia, D. Lo, Cong Tian, Xinyu Wang","doi":"10.1145/2597008.2597142","DOIUrl":null,"url":null,"abstract":"Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined semantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this process as content categorization of API discussions. To solve this problem, Hou and Mo proposed the usage of naive Bayes multinomial, which is an effective classification algorithm. \n In this paper, we propose a Cache-bAsed compoSitE algorithm, short formed as CASE, to automatically categorize API discussions. Considering that the content of an API discussion contains both textual description and source code, CASE has 3 components that analyze an API discussion in 3 different ways: text, code, and original. In the text component, CASE only considers the textual description; in the code component, CASE only considers the source code; in the original component, CASE considers the original content of an API discussion which might include textual description and source code. Next, for each component, since different terms (i.e., words) have different affinities to different categories, CASE caches a subset of terms which have the highest affinity scores to each category, and builds a classifier based on the cached terms. Finally, CASE combines all the 3 classifiers to achieve a better accuracy score. We evaluate the performance of CASE on 3 datasets which contain a total of 1,035 API discussions. The experiment results show that CASE achieves accuracy scores of 0.69, 0.77, and 0.96 for the 3 datasets respectively, which outperforms the state-of-the-art method proposed by Hou and Mo by 11%, 10%, and 2%, respectively.","PeriodicalId":6853,"journal":{"name":"2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)","volume":"92 4 1","pages":"95-105"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597008.2597142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined semantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this process as content categorization of API discussions. To solve this problem, Hou and Mo proposed the usage of naive Bayes multinomial, which is an effective classification algorithm. In this paper, we propose a Cache-bAsed compoSitE algorithm, short formed as CASE, to automatically categorize API discussions. Considering that the content of an API discussion contains both textual description and source code, CASE has 3 components that analyze an API discussion in 3 different ways: text, code, and original. In the text component, CASE only considers the textual description; in the code component, CASE only considers the source code; in the original component, CASE considers the original content of an API discussion which might include textual description and source code. Next, for each component, since different terms (i.e., words) have different affinities to different categories, CASE caches a subset of terms which have the highest affinity scores to each category, and builds a classifier based on the cached terms. Finally, CASE combines all the 3 classifiers to achieve a better accuracy score. We evaluate the performance of CASE on 3 datasets which contain a total of 1,035 API discussions. The experiment results show that CASE achieves accuracy scores of 0.69, 0.77, and 0.96 for the 3 datasets respectively, which outperforms the state-of-the-art method proposed by Hou and Mo by 11%, 10%, and 2%, respectively.

查看原文本刊更多论文

针对API更准确的内容分类进行讨论

如今，软件开发人员经常在在线论坛上讨论各种api的使用。自动为这些论坛中的API讨论分配预定义的语义分类可以帮助管理在线论坛中的数据，并帮助开发人员搜索有用的信息。我们把这个过程称为API讨论的内容分类。为了解决这个问题，Hou和Mo提出了使用朴素贝叶斯多项式，这是一种有效的分类算法。在本文中，我们提出了一种基于缓存的组合算法，简称CASE，用于对API讨论进行自动分类。考虑到API讨论的内容包含文本描述和源代码，CASE有3个组件，它们以3种不同的方式分析API讨论:文本、代码和原始内容。在文本组件中，CASE只考虑文本描述;在代码组件中，CASE只考虑源代码;在原始组件中，CASE考虑API讨论的原始内容，其中可能包括文本描述和源代码。接下来，对于每个组件，由于不同的术语(即单词)与不同的类别具有不同的关联，CASE缓存与每个类别具有最高关联分数的术语子集，并基于缓存的术语构建分类器。最后，CASE将所有3个分类器组合在一起以获得更好的准确率分数。我们在包含1,035个API讨论的3个数据集上评估了CASE的性能。实验结果表明，CASE在3个数据集上的准确率分别为0.69、0.77和0.96，分别比Hou和Mo提出的最先进方法高11%、10%和2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量