{"title":"Feature-driven comparison of Git and Mercurial using DBSCAN and Random Forest","authors":"Rashed Bahlool, Sameh Foulad, Sami Dagash","doi":"10.1016/j.array.2025.100519","DOIUrl":null,"url":null,"abstract":"<div><div>Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"28 ","pages":"Article 100519"},"PeriodicalIF":4.5000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.