Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) Pub Date : 2023-05-01 DOI:10.1109/ICSE48619.2023.00154

Chun Xia, Saikat Dutta, D. Marinov

{"title":"Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests","authors":"Chun Xia, Saikat Dutta, D. Marinov","doi":"10.1109/ICSE48619.2023.00154","DOIUrl":null,"url":null,"abstract":"Testing Machine Learning (ML) projects is challenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs. We present FASER - the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.","PeriodicalId":376379,"journal":{"name":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE48619.2023.00154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Testing Machine Learning (ML) projects is challenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs. We present FASER - the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.

查看原文本刊更多论文

非确定性机器学习测试的有效性和脆弱性平衡

由于各种机器学习算法固有的不确定性以及缺乏可靠的计算参考结果的方法，测试机器学习(ML)项目具有挑战性。开发人员在编写测试时通常依靠他们的直觉来检查ML算法是否产生准确的结果。然而，这种方法导致在选择用于比较测试断言中的实际结果和预期结果的断言边界时选择保守。由于开发人员希望避免测试中的误报失败，因此他们经常将界限设置得过于宽松，从而可能导致遗漏关键错误。我们提出了FASER——通过计算最优断言边界来平衡非确定性测试的故障检测有效性和脆弱性之间的权衡的第一个系统方法。FASER通过改变断言边界将这种权衡作为这些竞争目标之间的优化问题。FASER利用1)统计方法来估计片状率，2)突变测试来估计故障检测的有效性。我们对来自22个流行ML项目的87个非确定性测试进行了FASER评估。FASER发现所研究的87个测试中有23个具有保守界限，并提出了更严格的断言界限，以最大限度地提高测试的故障检测效率，同时限制脆弱性。我们已经向开发人员发送了19个pull request，每个pull request修复一个测试，其中14个pull request已经被接受。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量