{"title":"使用DBSCAN和随机森林对Git和Mercurial进行特性驱动的比较","authors":"Rashed Bahlool, Sameh Foulad, Sami Dagash","doi":"10.1016/j.array.2025.100519","DOIUrl":null,"url":null,"abstract":"<div><div>Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"28 ","pages":"Article 100519"},"PeriodicalIF":4.5000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature-driven comparison of Git and Mercurial using DBSCAN and Random Forest\",\"authors\":\"Rashed Bahlool, Sameh Foulad, Sami Dagash\",\"doi\":\"10.1016/j.array.2025.100519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"28 \",\"pages\":\"Article 100519\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625001468\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Feature-driven comparison of Git and Mercurial using DBSCAN and Random Forest
Git and Mercurial are prominent distributed version control systems that differ in their underlying architecture and resource consumption. This study presents a systematic comparison between their performance using a machine learning-driven approach that measures their resource consumption based on synthetically generated repositories to systematically control key variables. To eliminate anomalies, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was applied. The influence of repository features was determined using the Random Forest (RF) in three performance dimensions: CPU time, memory usage, and repository size. The findings indicate that Git demonstrated superior efficiency in CPU and memory usage, particularly in branching operations, while Mercurial exhibited better storage optimization and consistency, making it suitable for large-scale projects with constrained storage capacity. To interpret our results, SHapley Additive exPlanations (SHAP) was used to reveal the direction and strength of features influence that correspond to the repository characteristics. These findings offer a practical guidance for selecting version control systems based on specific project requirements.