Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

arXiv - CS - Neural and Evolutionary Computing Pub Date : 2024-09-16 DOI:arxiv-2409.10111

Kodjo Mawuena Amekoe, Mustapha Lebbah, Gregoire Jaffre, Hanene Azzag, Zaineb Chelly Dagdia

{"title":"Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection","authors":"Kodjo Mawuena Amekoe, Mustapha Lebbah, Gregoire Jaffre, Hanene Azzag, Zaineb Chelly Dagdia","doi":"arxiv-2409.10111","DOIUrl":null,"url":null,"abstract":"Real-world tabular learning production scenarios typically involve evolving\ndata streams, where data arrives continuously and its distribution may change\nover time. In such a setting, most studies in the literature regarding\nsupervised learning favor the use of instance incremental algorithms due to\ntheir ability to adapt to changes in the data distribution. Another significant\nreason for choosing these algorithms is \\textit{avoid storing observations in\nmemory} as commonly done in batch incremental settings. However, the design of\ninstance incremental algorithms often assumes immediate availability of labels,\nwhich is an optimistic assumption. In many real-world scenarios, such as fraud\ndetection or credit scoring, labels may be delayed. Consequently, batch\nincremental algorithms are widely used in many real-world tasks. This raises an\nimportant question: \"In delayed settings, is instance incremental learning the\nbest option regarding predictive performance and computational efficiency?\"\nUnfortunately, this question has not been studied in depth, probably due to the\nscarcity of real datasets containing delayed information. In this study, we\nconduct a comprehensive empirical evaluation and analysis of this question\nusing a real-world fraud detection problem and commonly used generated\ndatasets. Our findings indicate that instance incremental learning is not the\nsuperior option, considering on one side state-of-the-art models such as\nAdaptive Random Forest (ARF) and other side batch learning models such as\nXGBoost. Additionally, when considering the interpretability of the learning\nsystems, batch incremental solutions tend to be favored. Code:\n\\url{https://github.com/anselmeamekoe/DelayedLabelStream}","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{https://github.com/anselmeamekoe/DelayedLabelStream}

查看原文本刊更多论文

评估延迟标签环境下实例增量学习与批量学习的效果：用于欺诈检测的表格数据流实证研究

现实世界中的表格学习生产场景通常涉及不断发展的数据流，其中数据不断到达，其分布可能随时间发生变化。在这种情况下，大多数关于监督学习的文献研究都倾向于使用实例增量算法，因为它们能够适应数据分布的变化。选择这些算法的另一个重要原因是，它们可以避免将观察结果存储在内存中，而批量增量算法通常就是这样做的。然而，实例增量算法的设计通常假设标签立即可用，这是一个乐观的假设。在现实世界的许多场景中，如欺诈检测或信用评分，标签可能会延迟。因此，批量递增算法被广泛应用于许多实际任务中。这就提出了一个重要问题："不幸的是，这个问题还没有得到深入研究，这可能是由于包含延迟信息的真实数据集非常稀少。在本研究中，我们利用现实世界中的欺诈检测问题和常用的生成数据集对这一问题进行了全面的实证评估和分析。我们的研究结果表明，考虑到自适应随机森林（ARF）等最先进的模型和 XGBoost 等批量学习模型，实例增量学习并不是更优的选择。此外，考虑到学习系统的可解释性，批量增量解决方案往往更受青睐。代码：\url{https://github.com/anselmeamekoe/DelayedLabelStream}

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Neural and Evolutionary Computing

自引率

0.00%

发文量