The Impact of Flaky Tests on Historical Test Prioritization on Chrome

2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) Pub Date : 2022-05-01 DOI:10.1145/3510457.3513038

Emad Fallahzadeh, Peter C. Rigby

{"title":"The Impact of Flaky Tests on Historical Test Prioritization on Chrome","authors":"Emad Fallahzadeh, Peter C. Rigby","doi":"10.1145/3510457.3513038","DOIUrl":null,"url":null,"abstract":"Test prioritization algorithms prioritize probable failing tests to give faster feedback to developers in case a failure occurs. Test prioritization approaches that use historical failures to run tests that have failed in the past may be susceptible to flaky tests as these tests often fail and then pass without identifying a fault. Traditionally, flaky failures like other types of failures are considered blocking, i. e. a test that needs to be investigated before the code can move to the next stage. However, on Google Chrome, flaky failures are non-blocking and the code still moves to the next stage in the CI pipeline. In this work, we explain the Chrome testing pipeline and classification. Then, we re-implement two important history based test prioritization algorithms and evaluate them on over 276 million test runs from the Chrome project. We apply these algorithms in two scenarios. First, we consider flaky failures as blocking and then, we use Chrome's approach and consider flaky failures as non-blocking. Our investigation reveals that 99.58% of all failures are flaky. These types of failures are much more repetitive than non-flaky failures, and they are also well distributed over time. We conclude that the prior performance of the prioritization algorithms have been inflated by flaky failures. We release our data and scripts in our replication package [8].","PeriodicalId":119790,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510457.3513038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Test prioritization algorithms prioritize probable failing tests to give faster feedback to developers in case a failure occurs. Test prioritization approaches that use historical failures to run tests that have failed in the past may be susceptible to flaky tests as these tests often fail and then pass without identifying a fault. Traditionally, flaky failures like other types of failures are considered blocking, i. e. a test that needs to be investigated before the code can move to the next stage. However, on Google Chrome, flaky failures are non-blocking and the code still moves to the next stage in the CI pipeline. In this work, we explain the Chrome testing pipeline and classification. Then, we re-implement two important history based test prioritization algorithms and evaluate them on over 276 million test runs from the Chrome project. We apply these algorithms in two scenarios. First, we consider flaky failures as blocking and then, we use Chrome's approach and consider flaky failures as non-blocking. Our investigation reveals that 99.58% of all failures are flaky. These types of failures are much more repetitive than non-flaky failures, and they are also well distributed over time. We conclude that the prior performance of the prioritization algorithms have been inflated by flaky failures. We release our data and scripts in our replication package [8].

查看原文本刊更多论文

片状测试对Chrome历史测试优先级的影响

测试优先排序算法对可能失败的测试进行优先排序，以便在发生失败时更快地向开发人员提供反馈。使用历史失败来运行过去失败的测试的测试优先级方法可能容易受到不可靠测试的影响，因为这些测试经常失败，然后在没有识别错误的情况下通过。传统上，像其他类型的失败一样的片状失败被认为是阻塞的，即需要在代码移动到下一阶段之前进行调查的测试。然而，在Google Chrome上，零星的失败是不阻塞的，代码仍然会转移到CI管道的下一个阶段。在本工作中，我们解释了Chrome的测试流程和分类。然后，我们重新实现了两个重要的基于历史的测试优先级算法，并在Chrome项目的2.76亿次测试中对它们进行了评估。我们在两种情况下应用这些算法。首先，我们将片状失败视为阻塞，然后，我们使用Chrome的方法，将片状失败视为非阻塞。我们的调查显示99.58%的失败都是不可靠的。这些类型的失败比非片状失败更具重复性，并且它们也随着时间的推移而分布良好。我们得出结论，优先排序算法的先前性能已经膨胀的片状故障。我们在复制包中发布数据和脚本[8]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

自引率

0.00%

发文量