{"title":"Exploring the under-explored terrain of non-open source data for software engineering through the lens of federated learning","authors":"Shriram Shanbhag, S. Chimalakonda","doi":"10.1145/3540250.3560883","DOIUrl":null,"url":null,"abstract":"The availability of open source projects on platforms like GitHub has led to the wide use of the artifacts from these projects in software engineering research. These publicly available artifacts have been used to train artificial intelligence models used in various empirical studies and the development of tools. However, these advancements have missed out on the artifacts from non-open source projects due to the unavailability of the data. A major cause for the unavailability of the data from non-open source repositories is the issue concerning data privacy. In this paper, we propose using federated learning to address the issue of data privacy to enable the use of data from non-open source to train AI models used in software engineering research. We believe that this can potentially enable industries to collaborate with software engineering researchers without concerns about privacy. We present the preliminary evaluation of the use of federated learning to train a classifier to label bug-fix commits from an existing study to demonstrate its feasibility. The federated approach achieved an F1 score of 0.83 compared to a score of 0.84 using the centralized approach. We also present our vision of the potential implications of the use of federated learning in software engineering research.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件产业与工程","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1145/3540250.3560883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The availability of open source projects on platforms like GitHub has led to the wide use of the artifacts from these projects in software engineering research. These publicly available artifacts have been used to train artificial intelligence models used in various empirical studies and the development of tools. However, these advancements have missed out on the artifacts from non-open source projects due to the unavailability of the data. A major cause for the unavailability of the data from non-open source repositories is the issue concerning data privacy. In this paper, we propose using federated learning to address the issue of data privacy to enable the use of data from non-open source to train AI models used in software engineering research. We believe that this can potentially enable industries to collaborate with software engineering researchers without concerns about privacy. We present the preliminary evaluation of the use of federated learning to train a classifier to label bug-fix commits from an existing study to demonstrate its feasibility. The federated approach achieved an F1 score of 0.83 compared to a score of 0.84 using the centralized approach. We also present our vision of the potential implications of the use of federated learning in software engineering research.