堆栈溢出中的重复问题检测:再现性研究

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2018-02-21 DOI:10.1109/SANER.2018.8330262

Rodrigo F. Silva, K. V. R. Paixão, M. Maia

{"title":"堆栈溢出中的重复问题检测:再现性研究","authors":"Rodrigo F. Silva, K. V. R. Paixão, M. Maia","doi":"10.1109/SANER.2018.8330262","DOIUrl":null,"url":null,"abstract":"Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"1 1","pages":"572-581"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"Duplicate question detection in stack overflow: A reproducibility study\",\"authors\":\"Rodrigo F. Silva, K. V. R. Paixão, M. Maia\",\"doi\":\"10.1109/SANER.2018.8330262\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.\",\"PeriodicalId\":6602,\"journal\":{\"name\":\"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"volume\":\"1 1\",\"pages\":\"572-581\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SANER.2018.8330262\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2018.8330262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

Stack Overflow已经成为开发人员工具集的基本元素。这种影响力的增加伴随着Stack Overflow社区的努力，以保持其内容的质量。其中一个危害质量的问题是重复问题的不断增加。为了解决这个问题，之前的工作主要集中在自动检测重复问题上。两个重要的解决方案是DupPredictor和Dupe。尽管报告了重要的结果，但这两项工作都没有公开提供它们的实现，阻碍了科学文献中依赖它们的后续工作。我们执行了一项实证研究作为DupPredictor和Dupe的复制。我们的结果，当尝试使用不同的工具和数据集时，并不健壮，表明再现这些方法的障碍很高。此外，当应用于最近的数据时，我们观察到，随着问题数量的增加，我们的两个复制版本在召回率方面的性能随着时间的推移而下降。我们的研究结果表明，后续关于在问答社区中检测重复问题的工作需要更多的调查来证实他们的发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Duplicate question detection in stack overflow: A reproducibility study

Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量