Mining Historical Test Failures to Dynamically Batch Tests to Save CI Resources

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2021-09-01 DOI:10.1109/ICSME52107.2021.00026

Amir Hossein Bavand, Peter C. Rigby

{"title":"Mining Historical Test Failures to Dynamically Batch Tests to Save CI Resources","authors":"Amir Hossein Bavand, Peter C. Rigby","doi":"10.1109/ICSME52107.2021.00026","DOIUrl":null,"url":null,"abstract":"Testing is a costly, time-consuming, and challenging part of modern software development. During continuous integration, after submitting each change, it is tested automatically to ensure that it does not break the system's functionality. A common approach to reducing the number of test case executions is to batch changes together for testing. For example, given four changes to test, if we group them in a batch and they pass we use one execution to test all four changes. However, if they fail, additional executions are required to find the culprit change that is responsible for the failure. We evaluate five batch culprit finding approaches: Dorfman, double pool testing, BatchBisect, BatchStop4, and our novel BatchDivide4. All prior works on batching use a constant batch size. In this work, we propose a dynamic batch size technique based on the weighted historical failure rate of the project. We simulate each of the batching strategies across 12 large projects on Travis with varying failures rate. We find that dynamic batching coupled with BatchDivide4 outperforms the other approaches. Compared to TestAll, this approach decreases the number of executions by 47.49% on average across the Travis projects. It outperforms the current state-of-the-art Batch4 by 5.17 percentage points. Our historical weighting approach leads us to a metric that describes the number of consecutive build failures. We find that the correlation between batch savings and FailureSpread is r = -0.97 with a p << 0.0001. This metric easily allows developers to determine the potential of batching on their project. We also contribute a theoretical limit for the savings that can be achieved by batch testing. We show that using dynamic batching, we achieve an across project average of 58.91% of the theoretical limit. Although batching is highly effective, there is still substantial room for improving batching relative to the theoretical batch savings limit. We make our scripts and data available for replication [1].","PeriodicalId":205629,"journal":{"name":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME52107.2021.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Testing is a costly, time-consuming, and challenging part of modern software development. During continuous integration, after submitting each change, it is tested automatically to ensure that it does not break the system's functionality. A common approach to reducing the number of test case executions is to batch changes together for testing. For example, given four changes to test, if we group them in a batch and they pass we use one execution to test all four changes. However, if they fail, additional executions are required to find the culprit change that is responsible for the failure. We evaluate five batch culprit finding approaches: Dorfman, double pool testing, BatchBisect, BatchStop4, and our novel BatchDivide4. All prior works on batching use a constant batch size. In this work, we propose a dynamic batch size technique based on the weighted historical failure rate of the project. We simulate each of the batching strategies across 12 large projects on Travis with varying failures rate. We find that dynamic batching coupled with BatchDivide4 outperforms the other approaches. Compared to TestAll, this approach decreases the number of executions by 47.49% on average across the Travis projects. It outperforms the current state-of-the-art Batch4 by 5.17 percentage points. Our historical weighting approach leads us to a metric that describes the number of consecutive build failures. We find that the correlation between batch savings and FailureSpread is r = -0.97 with a p << 0.0001. This metric easily allows developers to determine the potential of batching on their project. We also contribute a theoretical limit for the savings that can be achieved by batch testing. We show that using dynamic batching, we achieve an across project average of 58.91% of the theoretical limit. Although batching is highly effective, there is still substantial room for improving batching relative to the theoretical batch savings limit. We make our scripts and data available for replication [1].

查看原文本刊更多论文

挖掘历史测试失败以动态批处理测试以节省CI资源

测试是现代软件开发中昂贵、耗时且具有挑战性的部分。在持续集成期间，在提交每个更改之后，将自动对其进行测试，以确保它不会破坏系统的功能。减少测试用例执行数量的常见方法是将更改批处理在一起进行测试。例如，给定四个要测试的更改，如果我们将它们分组在一个批处理中并且它们通过了，我们使用一次执行来测试所有四个更改。但是，如果它们失败，则需要额外的执行来找到导致失败的罪魁祸首更改。我们评估了五种批量查找罪魁祸首的方法:Dorfman、双池测试、BatchBisect、BatchStop4和我们新颖的BatchDivide4。所有先前的批处理工作都使用恒定的批大小。在这项工作中，我们提出了一种基于项目加权历史故障率的动态批大小技术。我们用不同的失败率在Travis上模拟了12个大型项目的批处理策略。我们发现动态批处理与BatchDivide4相结合优于其他方法。与TestAll相比，这种方法在Travis项目中平均减少了47.49%的执行次数。它比当前最先进的Batch4高出5.17个百分点。我们的历史加权方法将我们引向一个描述连续构建失败数量的度量。我们发现批存储与FailureSpread之间的相关性为r = -0.97, p << 0.0001。这个度量可以让开发人员很容易地确定项目中批处理的潜力。我们还提供了通过批量测试可以实现的节约的理论限制。我们表明，使用动态批处理，我们实现了跨项目平均58.91%的理论极限。虽然批处理是非常有效的，但相对于理论批处理节省限制，批处理仍有很大的改进空间。我们使脚本和数据可用于复制[1]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量