PANDORA: Continuous Mining Software Repository and Dataset Generation

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2022-03-01 DOI:10.1109/saner53432.2022.00041

H. Nguyen, Francesco Lomio, Fabiano Pecorelli, Valentina Lenarduzzi

{"title":"PANDORA: Continuous Mining Software Repository and Dataset Generation","authors":"H. Nguyen, Francesco Lomio, Fabiano Pecorelli, Valentina Lenarduzzi","doi":"10.1109/saner53432.2022.00041","DOIUrl":null,"url":null,"abstract":"During the mining software repository activities, a huge amount of data gathered from different sources is analyzed. Different tools have been developed for collecting and aggregating data from repositories, but they do not easily allow researchers to develop new extractors, to integrate the data collected from other platforms, and in particular from platforms that delete the data periodically. Moreover, mining software repository studies are commonly performed on old versions of software projects and their results are not commonly periodically updated. As a result of the non-continuously updated studies, practitioners often do not trust results from empirical studies. In order to overcome the aforementioned issues, in this paper, we present Pandora, a tool that automatically and continuously mines data from different existing tools and online platforms and enables to run and continuously update the results of mining software repository studies. To evaluate the applicability of our tool, we currently analyzed 365 projects (developed in different languages), continuously collecting data from December 2020 to May 2021 and running an example study, investigating the build-stability of SonarQube rules. Link to dashboard: http://sqa.rd.tuni.fi/superset/dashboard/1 Link to source code: https://github.com/clowee/PANDORA Link to 5-minutes video: https://youtu.be/CuVO9YGJ59I","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/saner53432.2022.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

During the mining software repository activities, a huge amount of data gathered from different sources is analyzed. Different tools have been developed for collecting and aggregating data from repositories, but they do not easily allow researchers to develop new extractors, to integrate the data collected from other platforms, and in particular from platforms that delete the data periodically. Moreover, mining software repository studies are commonly performed on old versions of software projects and their results are not commonly periodically updated. As a result of the non-continuously updated studies, practitioners often do not trust results from empirical studies. In order to overcome the aforementioned issues, in this paper, we present Pandora, a tool that automatically and continuously mines data from different existing tools and online platforms and enables to run and continuously update the results of mining software repository studies. To evaluate the applicability of our tool, we currently analyzed 365 projects (developed in different languages), continuously collecting data from December 2020 to May 2021 and running an example study, investigating the build-stability of SonarQube rules. Link to dashboard: http://sqa.rd.tuni.fi/superset/dashboard/1 Link to source code: https://github.com/clowee/PANDORA Link to 5-minutes video: https://youtu.be/CuVO9YGJ59I

查看原文本刊更多论文

PANDORA:持续挖掘软件存储库和数据集生成

在挖掘软件存储库活动期间，需要分析从不同来源收集的大量数据。已经开发了不同的工具用于从存储库中收集和聚合数据，但是它们不容易允许研究人员开发新的提取器，以集成从其他平台收集的数据，特别是从定期删除数据的平台收集的数据。此外，挖掘软件存储库研究通常是在软件项目的旧版本上进行的，其结果通常不会定期更新。由于研究的不持续更新，从业者往往不相信实证研究的结果。为了克服上述问题，在本文中，我们提出了潘多拉，一个自动持续挖掘来自不同现有工具和在线平台的数据的工具，并能够运行和持续更新挖掘软件存储库研究的结果。为了评估我们的工具的适用性，我们目前分析了365个项目(用不同的语言开发)，从2020年12月到2021年5月不断收集数据，并运行一个示例研究，调查SonarQube规则的构建稳定性。链接到仪表板:http://sqa.rd.tuni.fi/superset/dashboard/1链接到源代码:https://github.com/clowee/PANDORA链接到5分钟视频:https://youtu.be/CuVO9YGJ59I

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量