DeFlaker: Automatically Detecting Flaky Tests

2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) Pub Date : 2018-05-27 DOI:10.1145/3180155.3180164

Jonathan Bell, Owolabi Legunsen, Michael C Hilton, Lamyaa Eloussi, Tifany Yung, D. Marinov

{"title":"DeFlaker: Automatically Detecting Flaky Tests","authors":"Jonathan Bell, Owolabi Legunsen, Michael C Hilton, Lamyaa Eloussi, Tifany Yung, D. Marinov","doi":"10.1145/3180155.3180164","DOIUrl":null,"url":null,"abstract":"Developers often run tests to check that their latest changes to a code repository did not break any previously working functionality. Ideally, any new test failures would indicate regressions caused by the latest changes. However, some test failures may not be due to the latest changes but due to non-determinism in the tests, popularly called flaky tests. The typical way to detect flaky tests is to rerun failing tests repeatedly. Unfortunately, rerunning failing tests can be costly and can slow down the development cycle. We present the first extensive evaluation of rerunning failing tests and propose a new technique, called DeFlaker, that detects if a test failure is due to a flaky test without rerunning and with very low runtime overhead. DeFlaker monitors the coverage of latest code changes and marks as flaky any newly failing test that did not execute any of the changes. We deployed DeFlaker live, in the build process of 96 Java projects on TravisCI, and found 87 previously unknown flaky tests in 10 of these projects. We also ran experiments on project histories, where DeFlaker detected 1,874 flaky tests from 4,846 failures, with a low false alarm rate (1.5%). DeFlaker had a higher recall (95.5% vs. 23%) of confirmed flaky tests than Maven's default flaky test detector.","PeriodicalId":6560,"journal":{"name":"2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)","volume":"88 1","pages":"433-444"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"156","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3180155.3180164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 156

Abstract

Developers often run tests to check that their latest changes to a code repository did not break any previously working functionality. Ideally, any new test failures would indicate regressions caused by the latest changes. However, some test failures may not be due to the latest changes but due to non-determinism in the tests, popularly called flaky tests. The typical way to detect flaky tests is to rerun failing tests repeatedly. Unfortunately, rerunning failing tests can be costly and can slow down the development cycle. We present the first extensive evaluation of rerunning failing tests and propose a new technique, called DeFlaker, that detects if a test failure is due to a flaky test without rerunning and with very low runtime overhead. DeFlaker monitors the coverage of latest code changes and marks as flaky any newly failing test that did not execute any of the changes. We deployed DeFlaker live, in the build process of 96 Java projects on TravisCI, and found 87 previously unknown flaky tests in 10 of these projects. We also ran experiments on project histories, where DeFlaker detected 1,874 flaky tests from 4,846 failures, with a low false alarm rate (1.5%). DeFlaker had a higher recall (95.5% vs. 23%) of confirmed flaky tests than Maven's default flaky test detector.

查看原文本刊更多论文

DeFlaker:自动检测片状测试

开发人员经常运行测试来检查他们对代码存储库的最新更改没有破坏任何先前工作的功能。理想情况下，任何新的测试失败都将表明由最新更改引起的回归。然而，一些测试失败可能不是由于最新的更改，而是由于测试中的不确定性，通常称为片状测试。检测不可靠测试的典型方法是反复运行失败测试。不幸的是，重新运行失败的测试可能代价高昂，并且会减慢开发周期。我们首次对重新运行失败测试进行了广泛的评估，并提出了一种名为DeFlaker的新技术，该技术可以在不重新运行且运行时开销非常低的情况下检测测试失败是否由不可靠的测试引起。DeFlaker监视最新代码更改的覆盖率，并将任何没有执行任何更改的新失败测试标记为片状。我们在TravisCI上的96个Java项目的构建过程中实时部署了DeFlaker，并在其中10个项目中发现了87个以前未知的不可靠测试。我们还对项目历史进行了实验，其中DeFlaker从4,846个失败中检测出1,874个不可靠的测试，误报率很低(1.5%)。与Maven的默认片状测试检测器相比，DeFlaker对已确认的片状测试的召回率(95.5%对23%)更高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量