{"title":"在实证软件工程研究中如何构建软件数据集?系统的制图研究","authors":"J. A. Carruthers, J. A. D. Pace, E. Irrazábal","doi":"10.1109/SEAA56994.2022.00075","DOIUrl":null,"url":null,"abstract":"Context: Software projects are common inputs in Empirical Software Engineering (ESE) studies, although they are often selected with ad-hoc strategies that reduce the generalizability of the results. An alternative is the usage of available datasets of software projects, which should be current and follow explicit rules for ensuring their validity over time. Goal: In this context, it is important to assess the general state of software datasets in terms of purpose, last update, project characterization, source code metrics, and tools to extract source-code-related artifacts. Method: We conducted a systematic mapping study retrieving software datasets used in ESE studies published from January 2013 to December 2021. Results: We selected 74 datasets created mainly for software defects, software estimation, and software maintainability studies. The majority of these datasets (64%) explicitly stated the characteristics to select the projects, and the most common programming languages were Java and C. Conclusions: Our study identified scarce efforts to keep datasets updated over time and also provides recommendations to support their construction and consumption for ESE studies.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How are software datasets constructed in Empirical Software Engineering studies? A systematic mapping study\",\"authors\":\"J. A. Carruthers, J. A. D. Pace, E. Irrazábal\",\"doi\":\"10.1109/SEAA56994.2022.00075\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context: Software projects are common inputs in Empirical Software Engineering (ESE) studies, although they are often selected with ad-hoc strategies that reduce the generalizability of the results. An alternative is the usage of available datasets of software projects, which should be current and follow explicit rules for ensuring their validity over time. Goal: In this context, it is important to assess the general state of software datasets in terms of purpose, last update, project characterization, source code metrics, and tools to extract source-code-related artifacts. Method: We conducted a systematic mapping study retrieving software datasets used in ESE studies published from January 2013 to December 2021. Results: We selected 74 datasets created mainly for software defects, software estimation, and software maintainability studies. The majority of these datasets (64%) explicitly stated the characteristics to select the projects, and the most common programming languages were Java and C. Conclusions: Our study identified scarce efforts to keep datasets updated over time and also provides recommendations to support their construction and consumption for ESE studies.\",\"PeriodicalId\":269970,\"journal\":{\"name\":\"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SEAA56994.2022.00075\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
How are software datasets constructed in Empirical Software Engineering studies? A systematic mapping study
Context: Software projects are common inputs in Empirical Software Engineering (ESE) studies, although they are often selected with ad-hoc strategies that reduce the generalizability of the results. An alternative is the usage of available datasets of software projects, which should be current and follow explicit rules for ensuring their validity over time. Goal: In this context, it is important to assess the general state of software datasets in terms of purpose, last update, project characterization, source code metrics, and tools to extract source-code-related artifacts. Method: We conducted a systematic mapping study retrieving software datasets used in ESE studies published from January 2013 to December 2021. Results: We selected 74 datasets created mainly for software defects, software estimation, and software maintainability studies. The majority of these datasets (64%) explicitly stated the characteristics to select the projects, and the most common programming languages were Java and C. Conclusions: Our study identified scarce efforts to keep datasets updated over time and also provides recommendations to support their construction and consumption for ESE studies.