{"title":"三万亿行:在教室里挖掘GitHub的基础设施","authors":"Toni Mattis, Patrick Rein, R. Hirschfeld","doi":"10.1145/3397537.3397551","DOIUrl":null,"url":null,"abstract":"The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.","PeriodicalId":373173,"journal":{"name":"Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Three trillion lines: infrastructure for mining GitHub in the classroom\",\"authors\":\"Toni Mattis, Patrick Rein, R. Hirschfeld\",\"doi\":\"10.1145/3397537.3397551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.\",\"PeriodicalId\":373173,\"journal\":{\"name\":\"Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3397537.3397551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397537.3397551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Three trillion lines: infrastructure for mining GitHub in the classroom
The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.