Extracting Insights from Big Source Code Repositories with Automatic Clustering of Projects by File Names and Types

Yu. E. Yakhno, Selin Metin
{"title":"Extracting Insights from Big Source Code Repositories with Automatic Clustering of Projects by File Names and Types","authors":"Yu. E. Yakhno, Selin Metin","doi":"10.1109/SmartNets58706.2023.10215598","DOIUrl":null,"url":null,"abstract":"Software project delivery requires a set of related activities to be conducted. The output of these activities form a collection of unstructured data such as specifications, requirements, manuals, source code and packaging files which is stored in configuration management systems. Software repositories are infrastructures to support project management activities and can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This large and variable data collection provides opportunities for text mining tasks which further can be utilized for software delivery or business intelligence related goals. The proposed approach uses machine learning methods to inspect large software repositories and classifies the results to propose insights to project managers as an aide to make strategic business decisions. In the present work every software project is described by a certain representation (vector or set of vectors), which is constructed from names and types of the files from project content. These representations are used to find clusters of similar projects in big code repositories and highlight specific properties of the found groups: most used names and types of the file.","PeriodicalId":301834,"journal":{"name":"2023 International Conference on Smart Applications, Communications and Networking (SmartNets)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Smart Applications, Communications and Networking (SmartNets)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartNets58706.2023.10215598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Software project delivery requires a set of related activities to be conducted. The output of these activities form a collection of unstructured data such as specifications, requirements, manuals, source code and packaging files which is stored in configuration management systems. Software repositories are infrastructures to support project management activities and can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This large and variable data collection provides opportunities for text mining tasks which further can be utilized for software delivery or business intelligence related goals. The proposed approach uses machine learning methods to inspect large software repositories and classifies the results to propose insights to project managers as an aide to make strategic business decisions. In the present work every software project is described by a certain representation (vector or set of vectors), which is constructed from names and types of the files from project content. These representations are used to find clusters of similar projects in big code repositories and highlight specific properties of the found groups: most used names and types of the file.
通过按文件名和类型自动集群项目从大型源代码存储库中提取见解
软件项目交付需要进行一系列相关的活动。这些活动的输出形成非结构化数据的集合,例如存储在配置管理系统中的规范、需求、手册、源代码和打包文件。软件存储库是支持项目管理活动的基础设施,可以由几个系统组成,包括代码变更管理、bug跟踪、代码审查、构建系统、发布二进制文件、wiki、论坛等。这种大型且可变的数据收集为文本挖掘任务提供了机会,这些任务可以进一步用于软件交付或业务智能相关目标。所提出的方法使用机器学习方法来检查大型软件存储库,并对结果进行分类,以向项目经理提出见解,作为制定战略业务决策的助手。在目前的工作中,每个软件项目都是由特定的表示(向量或向量集)来描述的,这些表示是由项目内容中的文件的名称和类型构造而成的。这些表示用于在大型代码存储库中查找类似项目的集群,并突出显示所找到的组的特定属性:最常用的名称和文件类型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信