使用DBSCAN和随机森林对Git和Mercurial进行特性驱动的比较

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS
Array Pub Date : 2025-09-20 DOI:10.1016/j.array.2025.100519
Rashed Bahlool, Sameh Foulad, Sami Dagash
{"title":"使用DBSCAN和随机森林对Git和Mercurial进行特性驱动的比较","authors":"Rashed Bahlool,&nbsp;Sameh Foulad,&nbsp;Sami Dagash","doi":"10.1016/j.array.2025.100519","DOIUrl":null,"url":null,"abstract":"<div><div>Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"28 ","pages":"Article 100519"},"PeriodicalIF":4.5000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature-driven comparison of Git and Mercurial using DBSCAN and Random Forest\",\"authors\":\"Rashed Bahlool,&nbsp;Sameh Foulad,&nbsp;Sami Dagash\",\"doi\":\"10.1016/j.array.2025.100519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"28 \",\"pages\":\"Article 100519\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625001468\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

Git和Mercurial是突出的分布式版本控制系统,它们在底层架构和资源消耗方面有所不同。本研究使用机器学习驱动的方法对它们的性能进行了系统的比较,该方法基于综合生成的存储库来系统地控制关键变量,从而测量它们的资源消耗。为了消除异常,采用了基于密度的带噪声应用空间聚类(DBSCAN)算法。使用随机森林(Random Forest, RF)在三个性能维度上确定存储库特性的影响:CPU时间、内存使用和存储库大小。研究结果表明,Git在CPU和内存使用方面表现出更高的效率,特别是在分支操作方面,而Mercurial表现出更好的存储优化和一致性,使其适合存储容量受限的大型项目。为了解释我们的结果,我们使用SHapley加性解释(SHAP)来揭示与存储库特征相对应的特征影响的方向和强度。这些发现为根据特定的项目需求选择版本控制系统提供了实用的指导。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Feature-driven comparison of Git and Mercurial using DBSCAN and Random Forest
Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Array
Array Computer Science-General Computer Science
CiteScore
4.40
自引率
0.00%
发文量
93
审稿时长
45 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信