{"title":"使用特征选择和堆叠异构集成的合并冲突预测:一个实证研究","authors":"Reem Alfayez, Amal Alazba","doi":"10.1002/smr.70047","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Merge conflicts arise when multiple developers simultaneously modify the same part of a codebase and attempt to merge their changes. These conflicts occur because the version control system (VCS) cannot automatically determine which changes should take precedence. Resolving such conflicts involves manually reviewing the conflicting changes and deciding how to integrate them to maintain a functional and coherent codebase. This process is often time-consuming, complex, and prone to errors. Consequently, the software engineering community has focused on predicting merge conflicts to warn developers early and allow them to address conflicts before they escalate. Despite several efforts to predict merge conflicts, no perfect solution has been identified. Fortunately, many machine learning techniques have demonstrated potential in improving prediction performance across various contexts. This study aims to empirically investigate the effectiveness of stacking heterogeneous ensembles in enhancing merge conflict prediction performance. We empirically compared the prediction performance of the following individual models: decision trees (DT); support vector machine (SVM) with a linear kernel; naive Bayes (NB) with Bernoulli, Gaussian, and Multinomial variants; logistic regression (LR); multilayer perceptron (MLP); stochastic gradient descent (SGD); and k-nearest neighbors (KNN). Additionally, we evaluated three heterogeneous stacking ensembles: Stack-DT, Stack-SVM, and Stack-LR, which were constructed using the aforementioned individual models as base models. We utilized gain ratio (GR) to identify the most important technical and social features for predicting merge conflicts and assessed the impact of using only these important features on the performance of both individual and stacking models. The study revealed variability in the performance of individual models, with DT demonstrating the best predictive performance among them. Heterogeneous stacking ensembles demonstrated potential to enhance merge conflict prediction, with Stack-SVM emerging as the top-performing model. GR analysis highlighted the importance of both social and technical features in predicting merge conflicts. However, using only the most important features identified by GR led to a decline in the performance of most models compared to using all features. Heterogeneous stacking ensembles significantly improve prediction performance over individual models. Both social and technical features are important in predicting merge conflicts, and utilizing the full set of features instead of only the most important ones generally yields better results.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Merge Conflict Prediction Using Feature Selection and Stacking Heterogeneous Ensembles: An Empirical Investigation\",\"authors\":\"Reem Alfayez, Amal Alazba\",\"doi\":\"10.1002/smr.70047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Merge conflicts arise when multiple developers simultaneously modify the same part of a codebase and attempt to merge their changes. These conflicts occur because the version control system (VCS) cannot automatically determine which changes should take precedence. Resolving such conflicts involves manually reviewing the conflicting changes and deciding how to integrate them to maintain a functional and coherent codebase. This process is often time-consuming, complex, and prone to errors. Consequently, the software engineering community has focused on predicting merge conflicts to warn developers early and allow them to address conflicts before they escalate. Despite several efforts to predict merge conflicts, no perfect solution has been identified. Fortunately, many machine learning techniques have demonstrated potential in improving prediction performance across various contexts. This study aims to empirically investigate the effectiveness of stacking heterogeneous ensembles in enhancing merge conflict prediction performance. We empirically compared the prediction performance of the following individual models: decision trees (DT); support vector machine (SVM) with a linear kernel; naive Bayes (NB) with Bernoulli, Gaussian, and Multinomial variants; logistic regression (LR); multilayer perceptron (MLP); stochastic gradient descent (SGD); and k-nearest neighbors (KNN). Additionally, we evaluated three heterogeneous stacking ensembles: Stack-DT, Stack-SVM, and Stack-LR, which were constructed using the aforementioned individual models as base models. We utilized gain ratio (GR) to identify the most important technical and social features for predicting merge conflicts and assessed the impact of using only these important features on the performance of both individual and stacking models. The study revealed variability in the performance of individual models, with DT demonstrating the best predictive performance among them. Heterogeneous stacking ensembles demonstrated potential to enhance merge conflict prediction, with Stack-SVM emerging as the top-performing model. GR analysis highlighted the importance of both social and technical features in predicting merge conflicts. However, using only the most important features identified by GR led to a decline in the performance of most models compared to using all features. Heterogeneous stacking ensembles significantly improve prediction performance over individual models. Both social and technical features are important in predicting merge conflicts, and utilizing the full set of features instead of only the most important ones generally yields better results.</p>\\n </div>\",\"PeriodicalId\":48898,\"journal\":{\"name\":\"Journal of Software-Evolution and Process\",\"volume\":\"37 9\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Software-Evolution and Process\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/smr.70047\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70047","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Merge Conflict Prediction Using Feature Selection and Stacking Heterogeneous Ensembles: An Empirical Investigation
Merge conflicts arise when multiple developers simultaneously modify the same part of a codebase and attempt to merge their changes. These conflicts occur because the version control system (VCS) cannot automatically determine which changes should take precedence. Resolving such conflicts involves manually reviewing the conflicting changes and deciding how to integrate them to maintain a functional and coherent codebase. This process is often time-consuming, complex, and prone to errors. Consequently, the software engineering community has focused on predicting merge conflicts to warn developers early and allow them to address conflicts before they escalate. Despite several efforts to predict merge conflicts, no perfect solution has been identified. Fortunately, many machine learning techniques have demonstrated potential in improving prediction performance across various contexts. This study aims to empirically investigate the effectiveness of stacking heterogeneous ensembles in enhancing merge conflict prediction performance. We empirically compared the prediction performance of the following individual models: decision trees (DT); support vector machine (SVM) with a linear kernel; naive Bayes (NB) with Bernoulli, Gaussian, and Multinomial variants; logistic regression (LR); multilayer perceptron (MLP); stochastic gradient descent (SGD); and k-nearest neighbors (KNN). Additionally, we evaluated three heterogeneous stacking ensembles: Stack-DT, Stack-SVM, and Stack-LR, which were constructed using the aforementioned individual models as base models. We utilized gain ratio (GR) to identify the most important technical and social features for predicting merge conflicts and assessed the impact of using only these important features on the performance of both individual and stacking models. The study revealed variability in the performance of individual models, with DT demonstrating the best predictive performance among them. Heterogeneous stacking ensembles demonstrated potential to enhance merge conflict prediction, with Stack-SVM emerging as the top-performing model. GR analysis highlighted the importance of both social and technical features in predicting merge conflicts. However, using only the most important features identified by GR led to a decline in the performance of most models compared to using all features. Heterogeneous stacking ensembles significantly improve prediction performance over individual models. Both social and technical features are important in predicting merge conflicts, and utilizing the full set of features instead of only the most important ones generally yields better results.