SOCluster - Towards Answering Unanswered Questions on Stack Overflow via Answered Questions

Proceedings of the 16th Innovations in Software Engineering Conference Pub Date : 2023-02-23 DOI:10.1145/3578527.3578544

Abhishek Kumar, Deep Ghadiyali, S. Chimalakonda, Akhila Sri Manasa Venigalla

{"title":"SOCluster - Towards Answering Unanswered Questions on Stack Overflow via Answered Questions","authors":"Abhishek Kumar, Deep Ghadiyali, S. Chimalakonda, Akhila Sri Manasa Venigalla","doi":"10.1145/3578527.3578544","DOIUrl":null,"url":null,"abstract":"Stack Overflow (SO) platform has a huge dataset of questions and answers driven by interactions between users. But the count of unanswered questions is continuously rising, which is observed in various similar community Question & Answering platforms (Q&A) such as Yahoo, Quora and so on. To address this issue, these communities have explored clustering mechanisms to answer unanswered questions using other answered questions in the same cluster, which could also improve the response time for new questions. It is here, we propose SOCluster, an approach and a tool to cluster SO questions using a graph-based clustering approach. We selected four datasets of 10k, 20k, 30k & 40k SO questions without code-snippets or images involved, and performed clustering on them. We have done a preliminary evaluation of our tool by analyzing the resultant clusters using the commonly used metrics of Silhouette coefficient, Calinkski-Harabasz Index, & Davies-Bouldin Index. We performed clustering for 8 different threshold similarity values and analyzed the intriguing trends reflected by the output clusters through the three evaluation metrics. At 90% threshold similarity, it shows the best improvement for the three evaluation metrics on all four datasets. We further manually assessed the content in the clusters to confirm the similarity of elements in clusters. This revealed clusters to correspond to topics such as mouse over effect, speed optimisation, how to perform ‘some’ action in JavaScript, and so on. The source code and tool are available for download on Github at: https://github.com/rishalab/SOCluster, and the demo can be found here: https://youtu.be/Ewm-M_rg_x8.","PeriodicalId":326318,"journal":{"name":"Proceedings of the 16th Innovations in Software Engineering Conference","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th Innovations in Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578527.3578544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Stack Overflow (SO) platform has a huge dataset of questions and answers driven by interactions between users. But the count of unanswered questions is continuously rising, which is observed in various similar community Question & Answering platforms (Q&A) such as Yahoo, Quora and so on. To address this issue, these communities have explored clustering mechanisms to answer unanswered questions using other answered questions in the same cluster, which could also improve the response time for new questions. It is here, we propose SOCluster, an approach and a tool to cluster SO questions using a graph-based clustering approach. We selected four datasets of 10k, 20k, 30k & 40k SO questions without code-snippets or images involved, and performed clustering on them. We have done a preliminary evaluation of our tool by analyzing the resultant clusters using the commonly used metrics of Silhouette coefficient, Calinkski-Harabasz Index, & Davies-Bouldin Index. We performed clustering for 8 different threshold similarity values and analyzed the intriguing trends reflected by the output clusters through the three evaluation metrics. At 90% threshold similarity, it shows the best improvement for the three evaluation metrics on all four datasets. We further manually assessed the content in the clusters to confirm the similarity of elements in clusters. This revealed clusters to correspond to topics such as mouse over effect, speed optimisation, how to perform ‘some’ action in JavaScript, and so on. The source code and tool are available for download on Github at: https://github.com/rishalab/SOCluster, and the demo can be found here: https://youtu.be/Ewm-M_rg_x8.

查看原文本刊更多论文

SOCluster -通过回答问题来回答关于堆栈溢出的未回答问题

Stack Overflow (SO)平台有一个庞大的问题和答案数据集，由用户之间的交互驱动。但在雅虎、Quora等各种类似的社区问答平台(Q&A)中，未解问题的数量在不断上升。为了解决这个问题，这些社区探索了集群机制，使用同一集群中其他已回答的问题来回答未回答的问题，这也可以提高对新问题的响应时间。在这里，我们提出了SOCluster，一种使用基于图的聚类方法对SO问题进行聚类的方法和工具。我们选择了不涉及代码片段和图像的10k、20k、30k和40k SO问题4个数据集，并对它们进行聚类。通过使用Silhouette系数、Calinkski-Harabasz指数和Davies-Bouldin指数等常用指标分析生成的聚类，我们对工具进行了初步评估。我们对8个不同的阈值相似度值进行了聚类，并通过三个评价指标分析了输出聚类所反映的有趣趋势。在90%的阈值相似度下，它显示了所有四个数据集上三个评估指标的最佳改进。我们进一步手动评估聚类中的内容，以确认聚类中元素的相似性。这揭示了集群对应于诸如鼠标悬停效果、速度优化、如何在JavaScript中执行“某些”操作等主题。源代码和工具可以在Github上下载:https://github.com/rishalab/SOCluster, demo可以在这里找到:https://youtu.be/Ewm-M_rg_x8。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th Innovations in Software Engineering Conference

自引率

0.00%

发文量