{"title":"Three trillion lines: infrastructure for mining GitHub in the classroom","authors":"Toni Mattis, Patrick Rein, R. Hirschfeld","doi":"10.1145/3397537.3397551","DOIUrl":null,"url":null,"abstract":"The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.","PeriodicalId":373173,"journal":{"name":"Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397537.3397551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.