Characterizing Leveraged Stack Overflow Posts

2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM) Pub Date : 2019-09-01 DOI:10.1109/SCAM.2019.00025

Salvatore Geremia, G. Bavota, R. Oliveto, Michele Lanza, M. D. Penta

{"title":"Characterizing Leveraged Stack Overflow Posts","authors":"Salvatore Geremia, G. Bavota, R. Oliveto, Michele Lanza, M. D. Penta","doi":"10.1109/SCAM.2019.00025","DOIUrl":null,"url":null,"abstract":"Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of \"useful\" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of \"high-quality content\" in Stack Overflow.","PeriodicalId":431316,"journal":{"name":"2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM.2019.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of "useful" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of "high-quality content" in Stack Overflow.

查看原文本刊更多论文

杠杆堆栈溢出职位的特征

Stack Overflow是最受欢迎的计算机编程问答网站，拥有超过250万用户，1600万个问题，平均每五秒钟发布一个新答案。数据的广泛可用性促使研究人员开发了挖掘Stack Overflow帖子的技术。其目的是寻找并推荐对开发人员有用的帖子。然而，从开发人员的角度来看，并不是每一篇Stack Overflow文章都有用，这并不奇怪。我们实证地调查了“有用的”Stack Overflow帖子的特征是什么。我们研究的基本假设是开发人员过去使用过(在源代码中引用过)的帖子可能是有用的。我们把这些帖子称为杠杆帖子。我们研究了与非杠杆帖子相反的杠杆帖子的特征，重点关注社区方面(例如，撰写帖子的用户的声誉)，所包含代码片段的质量(例如，复杂性)和帖子文本内容的质量(例如，可读性)。然后，我们使用这些特性来构建一个预测模型，以自动识别可能被开发人员利用的帖子。研究结果表明，后元数据(例如，回答收到的评论数量)对于预测是否利用它特别有用，而代码可读性似乎不太有用。分类器对杠杆帖子的分类精度为65%，召回率为49%，对非杠杆帖子的分类精度为95%，召回率为97%。这为Stack Overflow中“高质量内容”的自动识别开辟了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM)

自引率

0.00%

发文量