FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

2023 IEEE/ACM International Conference on Automation of Software Test (AST) Pub Date : 2023-05-01 DOI:10.1109/AST58925.2023.00018

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, Yves Le Traon

{"title":"FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning","authors":"Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, Yves Le Traon","doi":"10.1109/AST58925.2023.00018","DOIUrl":null,"url":null,"abstract":"Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AST58925.2023.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.

查看原文本刊更多论文

FlakyCat:使用Few-Shot学习预测FlakyCat测试类别

不稳定测试是在程序的同一版本上运行时产生不同结果的测试。这种不确定的行为用错误的信号困扰着持续集成，浪费了开发人员的时间，降低了他们对测试套件的信任。研究强调了保持测试无片状的重要性。最近，研究界一直在通过提出许多静态和动态方法来推动片状测试的检测。虽然有希望，但这些方法主要集中在将测试分类为片状或非片状，即使报告了高性能，也很难理解片状的原因。这部分对于致力于解决这一问题的研究人员和开发人员来说至关重要。为了帮助理解给定的片状测试，我们提出了FlakyCat，这是基于其根本原因类别对片状测试进行分类的第一种方法。FlakyCat依赖CodeBERT进行代码表示，并利用暹罗网络来训练多类分类器。我们在一组从开源Java项目中收集的451个片状测试上训练和评估FlakyCat。我们的评估表明，FlakyCat对片状测试进行了准确的分类，F1得分为73%。此外，我们研究了我们的方法对每个类别的性能，揭示了异步等待、无序集合和与时间相关的片状测试被准确分类，而与并发相关的片状测试更难预测。最后，为了便于理解FlakyCat的预测，我们提出了一种基于codebert的模型可解释性的新技术，该技术突出了影响分类的代码语句。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM International Conference on Automation of Software Test (AST)

自引率

0.00%

发文量