FlakeFlagger: Predicting Flakiness Without Rerunning Tests

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) Pub Date : 2021-05-01 DOI:10.1109/ICSE43902.2021.00140

A. Alshammari, Christopher Morris, Michael C Hilton, Jonathan Bell

{"title":"FlakeFlagger: Predicting Flakiness Without Rerunning Tests","authors":"A. Alshammari, Christopher Morris, Michael C Hilton, Jonathan Bell","doi":"10.1109/ICSE43902.2021.00140","DOIUrl":null,"url":null,"abstract":"When developers make changes to their code, they typically run regression tests to detect if their recent changes (re) introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests, and hence, it is important to know which tests are flaky. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some previously identified flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled as flaky at least as many tests as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives. This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. Evaluated on our dataset of 23 projects with flaky tests, FlakeFlagger outperformed the prior approach (by F1 score) on 16 projects and tied on 4 projects. Our results indicate that this approach can be effective for identifying likely flaky tests prior to running time-consuming flaky test detectors.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE43902.2021.00140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

Abstract

When developers make changes to their code, they typically run regression tests to detect if their recent changes (re) introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests, and hence, it is important to know which tests are flaky. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some previously identified flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled as flaky at least as many tests as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives. This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. Evaluated on our dataset of 23 projects with flaky tests, FlakeFlagger outperformed the prior approach (by F1 score) on 16 projects and tied on 4 projects. Our results indicate that this approach can be effective for identifying likely flaky tests prior to running time-consuming flaky test detectors.

查看原文本刊更多论文

FlakeFlagger:在不重新运行测试的情况下预测异常

当开发人员对代码进行更改时，他们通常会运行回归测试来检测最近的更改是否(重新)引入了任何错误。然而，许多测试是不可靠的，它们的结果可能不确定地改变，没有明显的原因就失败了。不可靠的测试在开发过程中是一个非常麻烦的问题，因为它们使开发人员更难以信任他们的测试结果，因此，了解哪些测试是不可靠的非常重要。识别不可靠测试的传统方法是多次重新运行它们:如果在同一代码上观察到测试既通过又失败，那么它肯定是不可靠的。我们进行了一个非常大的实证研究，通过对24个项目的测试套件每个运行10,000次来寻找不可靠的测试，并且发现即使有这么多次的运行，一些先前确定的不可靠的测试仍然没有被检测到。我们提出了FlakeFlagger，这是一种新颖的方法，它收集了一组描述每个测试行为的特征，然后根据类似的行为特征预测可能是片状的测试。我们发现FlakeFlagger正确标记为片状的测试至少与最先进的片状测试分类器一样多，但FlakeFlagger报告的假阳性要少得多。这种较低的假阳性率直接转化为研究人员和开发人员节省的时间，他们使用分类结果指导更昂贵的片状测试检测过程。在我们的23个带有片状测试的项目数据集上进行评估，FlakeFlagger在16个项目上优于先前的方法(以F1得分)，在4个项目上与之前的方法持平。我们的结果表明，在运行耗时的片状测试检测器之前，这种方法可以有效地识别可能的片状测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量