挖掘和创建软件存储库数据集

2020 7th NAFOSTED Conference on Information and Computer Science (NICS) Pub Date : 2020-11-26 DOI:10.1109/NICS51282.2020.9335894

Thai-Bao Do, Huu-Nghia H. Nguyen, Bao-Linh L. Mai, Vu Nguyen

{"title":"挖掘和创建软件存储库数据集","authors":"Thai-Bao Do, Huu-Nghia H. Nguyen, Bao-Linh L. Mai, Vu Nguyen","doi":"10.1109/NICS51282.2020.9335894","DOIUrl":null,"url":null,"abstract":"Mining software repositories to extract meaningful information from them has become an important topic in software engineering. This paper presents our study to mine a very large dataset consisting of over three million software repositories across many version control systems and create derived data for future studies. Through this study, we propose a method for detecting forks and duplicates in repositories. We also preliminarily investigate the possible correlations between forking patterns, software health and risks, and success indicators.","PeriodicalId":308944,"journal":{"name":"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mining and Creating a Software Repositories Dataset\",\"authors\":\"Thai-Bao Do, Huu-Nghia H. Nguyen, Bao-Linh L. Mai, Vu Nguyen\",\"doi\":\"10.1109/NICS51282.2020.9335894\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Mining software repositories to extract meaningful information from them has become an important topic in software engineering. This paper presents our study to mine a very large dataset consisting of over three million software repositories across many version control systems and create derived data for future studies. Through this study, we propose a method for detecting forks and duplicates in repositories. We also preliminarily investigate the possible correlations between forking patterns, software health and risks, and success indicators.\",\"PeriodicalId\":308944,\"journal\":{\"name\":\"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NICS51282.2020.9335894\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS51282.2020.9335894","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

挖掘软件库，从中提取有意义的信息已成为软件工程中的一个重要课题。本文介绍了我们的研究，以挖掘一个非常大的数据集，该数据集由跨越许多版本控制系统的300多万个软件存储库组成，并为未来的研究创建派生数据。通过这项研究，我们提出了一种在存储库中检测分叉和副本的方法。我们还初步调查了分叉模式、软件健康和风险以及成功指标之间可能存在的相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mining and Creating a Software Repositories Dataset

Mining software repositories to extract meaningful information from them has become an important topic in software engineering. This paper presents our study to mine a very large dataset consisting of over three million software repositories across many version control systems and create derived data for future studies. Through this study, we propose a method for detecting forks and duplicates in repositories. We also preliminarily investigate the possible correlations between forking patterns, software health and risks, and success indicators.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 7th NAFOSTED Conference on Information and Computer Science (NICS)

自引率

0.00%

发文量