{"title":"Extracting Insights from Big Source Code Repositories with Automatic Clustering of Projects by File Names and Types","authors":"Yu. E. Yakhno, Selin Metin","doi":"10.1109/SmartNets58706.2023.10215598","DOIUrl":null,"url":null,"abstract":"Software project delivery requires a set of related activities to be conducted. The output of these activities form a collection of unstructured data such as specifications, requirements, manuals, source code and packaging files which is stored in configuration management systems. Software repositories are infrastructures to support project management activities and can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This large and variable data collection provides opportunities for text mining tasks which further can be utilized for software delivery or business intelligence related goals. The proposed approach uses machine learning methods to inspect large software repositories and classifies the results to propose insights to project managers as an aide to make strategic business decisions. In the present work every software project is described by a certain representation (vector or set of vectors), which is constructed from names and types of the files from project content. These representations are used to find clusters of similar projects in big code repositories and highlight specific properties of the found groups: most used names and types of the file.","PeriodicalId":301834,"journal":{"name":"2023 International Conference on Smart Applications, Communications and Networking (SmartNets)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Smart Applications, Communications and Networking (SmartNets)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartNets58706.2023.10215598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Software project delivery requires a set of related activities to be conducted. The output of these activities form a collection of unstructured data such as specifications, requirements, manuals, source code and packaging files which is stored in configuration management systems. Software repositories are infrastructures to support project management activities and can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This large and variable data collection provides opportunities for text mining tasks which further can be utilized for software delivery or business intelligence related goals. The proposed approach uses machine learning methods to inspect large software repositories and classifies the results to propose insights to project managers as an aide to make strategic business decisions. In the present work every software project is described by a certain representation (vector or set of vectors), which is constructed from names and types of the files from project content. These representations are used to find clusters of similar projects in big code repositories and highlight specific properties of the found groups: most used names and types of the file.