Characterizing Leveraged Stack Overflow Posts

Salvatore Geremia, G. Bavota, R. Oliveto, Michele Lanza, M. D. Penta
{"title":"Characterizing Leveraged Stack Overflow Posts","authors":"Salvatore Geremia, G. Bavota, R. Oliveto, Michele Lanza, M. D. Penta","doi":"10.1109/SCAM.2019.00025","DOIUrl":null,"url":null,"abstract":"Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of \"useful\" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of \"high-quality content\" in Stack Overflow.","PeriodicalId":431316,"journal":{"name":"2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM.2019.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of "useful" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of "high-quality content" in Stack Overflow.
杠杆堆栈溢出职位的特征
Stack Overflow是最受欢迎的计算机编程问答网站,拥有超过250万用户,1600万个问题,平均每五秒钟发布一个新答案。数据的广泛可用性促使研究人员开发了挖掘Stack Overflow帖子的技术。其目的是寻找并推荐对开发人员有用的帖子。然而,从开发人员的角度来看,并不是每一篇Stack Overflow文章都有用,这并不奇怪。我们实证地调查了“有用的”Stack Overflow帖子的特征是什么。我们研究的基本假设是开发人员过去使用过(在源代码中引用过)的帖子可能是有用的。我们把这些帖子称为杠杆帖子。我们研究了与非杠杆帖子相反的杠杆帖子的特征,重点关注社区方面(例如,撰写帖子的用户的声誉),所包含代码片段的质量(例如,复杂性)和帖子文本内容的质量(例如,可读性)。然后,我们使用这些特性来构建一个预测模型,以自动识别可能被开发人员利用的帖子。研究结果表明,后元数据(例如,回答收到的评论数量)对于预测是否利用它特别有用,而代码可读性似乎不太有用。分类器对杠杆帖子的分类精度为65%,召回率为49%,对非杠杆帖子的分类精度为95%,召回率为97%。这为Stack Overflow中“高质量内容”的自动识别开辟了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信