Investigating the Effectiveness of Clustering for Story Point Estimation

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2022-03-01 DOI:10.1109/saner53432.2022.00101

Vali Tawosi, A. Al-Subaihin, Federica Sarro

{"title":"Investigating the Effectiveness of Clustering for Story Point Estimation","authors":"Vali Tawosi, A. Al-Subaihin, Federica Sarro","doi":"10.1109/saner53432.2022.00101","DOIUrl":null,"url":null,"abstract":"Automated techniques to estimate Story Points (SP) for user stories in agile software development came to the fore a decade ago. Yet, the state-of-the-art estimation techniques' accuracy has room for improvement. In this paper, we present a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering. We first use LDA to represent issue reports in a new space of generated topics. We then use hierarchical clustering to agglomerate issues into clusters based on their topic similarities. Next, we build estimation models using the issues in each cluster. Then, we find the closest cluster to the new coming issue and use the model from that cluster to estimate the SP. Our approach is evaluated on a dataset of 26 open source projects with a total of 31,960 issues and compared against both baselines and state-of-the-art SP estimation techniques. The results show that the estimation performance of our proposed approach is as good as the state-of-the-art. However, none of these approaches is statistically significantly better than more naive estimators in all cases, which does not justify their additional complexity. We therefore encourage future work to develop alternative strategies for story points estimation. The experimental data and scripts we used in this work are publicly available to allow for replication and extension.","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/saner53432.2022.00101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Automated techniques to estimate Story Points (SP) for user stories in agile software development came to the fore a decade ago. Yet, the state-of-the-art estimation techniques' accuracy has room for improvement. In this paper, we present a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering. We first use LDA to represent issue reports in a new space of generated topics. We then use hierarchical clustering to agglomerate issues into clusters based on their topic similarities. Next, we build estimation models using the issues in each cluster. Then, we find the closest cluster to the new coming issue and use the model from that cluster to estimate the SP. Our approach is evaluated on a dataset of 26 open source projects with a total of 31,960 issues and compared against both baselines and state-of-the-art SP estimation techniques. The results show that the estimation performance of our proposed approach is as good as the state-of-the-art. However, none of these approaches is statistically significantly better than more naive estimators in all cases, which does not justify their additional complexity. We therefore encourage future work to develop alternative strategies for story points estimation. The experimental data and scripts we used in this work are publicly available to allow for replication and extension.

查看原文本刊更多论文

研究聚类在故事点估计中的有效性

在敏捷软件开发中，为用户故事估计故事点(SP)的自动化技术在十年前就出现了。然而，最先进的估计技术的准确性仍有提高的空间。本文提出了一种基于潜在狄利克雷分配(latent Dirichlet allocation, LDA)和聚类分析软件问题文本特征的SP估计新方法。我们首先使用LDA在生成主题的新空间中表示问题报告。然后，我们使用分层聚类根据主题相似度将问题聚集到聚类中。接下来，我们使用每个集群中的问题构建估计模型。然后，我们找到最接近即将到来的新问题的集群，并使用该集群中的模型来估计SP。我们的方法在26个开源项目的数据集上进行评估，总共有31,960个问题，并与基线和最先进的SP估计技术进行比较。结果表明，我们提出的方法的估计性能与最先进的方法一样好。然而，在所有情况下，这些方法在统计上都没有比更简单的估计器更好，这并不能证明它们额外的复杂性是合理的。因此，我们鼓励未来的工作为故事点评估开发替代策略。我们在这项工作中使用的实验数据和脚本是公开的，允许复制和扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量