{"title":"Mining Historical Test Failures to Dynamically Batch Tests to Save CI Resources","authors":"Amir Hossein Bavand, Peter C. Rigby","doi":"10.1109/ICSME52107.2021.00026","DOIUrl":null,"url":null,"abstract":"Testing is a costly, time-consuming, and challenging part of modern software development. During continuous integration, after submitting each change, it is tested automatically to ensure that it does not break the system's functionality. A common approach to reducing the number of test case executions is to batch changes together for testing. For example, given four changes to test, if we group them in a batch and they pass we use one execution to test all four changes. However, if they fail, additional executions are required to find the culprit change that is responsible for the failure. We evaluate five batch culprit finding approaches: Dorfman, double pool testing, BatchBisect, BatchStop4, and our novel BatchDivide4. All prior works on batching use a constant batch size. In this work, we propose a dynamic batch size technique based on the weighted historical failure rate of the project. We simulate each of the batching strategies across 12 large projects on Travis with varying failures rate. We find that dynamic batching coupled with BatchDivide4 outperforms the other approaches. Compared to TestAll, this approach decreases the number of executions by 47.49% on average across the Travis projects. It outperforms the current state-of-the-art Batch4 by 5.17 percentage points. Our historical weighting approach leads us to a metric that describes the number of consecutive build failures. We find that the correlation between batch savings and FailureSpread is r = -0.97 with a p << 0.0001. This metric easily allows developers to determine the potential of batching on their project. We also contribute a theoretical limit for the savings that can be achieved by batch testing. We show that using dynamic batching, we achieve an across project average of 58.91% of the theoretical limit. Although batching is highly effective, there is still substantial room for improving batching relative to the theoretical batch savings limit. We make our scripts and data available for replication [1].","PeriodicalId":205629,"journal":{"name":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME52107.2021.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Testing is a costly, time-consuming, and challenging part of modern software development. During continuous integration, after submitting each change, it is tested automatically to ensure that it does not break the system's functionality. A common approach to reducing the number of test case executions is to batch changes together for testing. For example, given four changes to test, if we group them in a batch and they pass we use one execution to test all four changes. However, if they fail, additional executions are required to find the culprit change that is responsible for the failure. We evaluate five batch culprit finding approaches: Dorfman, double pool testing, BatchBisect, BatchStop4, and our novel BatchDivide4. All prior works on batching use a constant batch size. In this work, we propose a dynamic batch size technique based on the weighted historical failure rate of the project. We simulate each of the batching strategies across 12 large projects on Travis with varying failures rate. We find that dynamic batching coupled with BatchDivide4 outperforms the other approaches. Compared to TestAll, this approach decreases the number of executions by 47.49% on average across the Travis projects. It outperforms the current state-of-the-art Batch4 by 5.17 percentage points. Our historical weighting approach leads us to a metric that describes the number of consecutive build failures. We find that the correlation between batch savings and FailureSpread is r = -0.97 with a p << 0.0001. This metric easily allows developers to determine the potential of batching on their project. We also contribute a theoretical limit for the savings that can be achieved by batch testing. We show that using dynamic batching, we achieve an across project average of 58.91% of the theoretical limit. Although batching is highly effective, there is still substantial room for improving batching relative to the theoretical batch savings limit. We make our scripts and data available for replication [1].