{"title":"评估延迟标签环境下实例增量学习与批量学习的效果:用于欺诈检测的表格数据流实证研究","authors":"Kodjo Mawuena Amekoe, Mustapha Lebbah, Gregoire Jaffre, Hanene Azzag, Zaineb Chelly Dagdia","doi":"arxiv-2409.10111","DOIUrl":null,"url":null,"abstract":"Real-world tabular learning production scenarios typically involve evolving\ndata streams, where data arrives continuously and its distribution may change\nover time. In such a setting, most studies in the literature regarding\nsupervised learning favor the use of instance incremental algorithms due to\ntheir ability to adapt to changes in the data distribution. Another significant\nreason for choosing these algorithms is \\textit{avoid storing observations in\nmemory} as commonly done in batch incremental settings. However, the design of\ninstance incremental algorithms often assumes immediate availability of labels,\nwhich is an optimistic assumption. In many real-world scenarios, such as fraud\ndetection or credit scoring, labels may be delayed. Consequently, batch\nincremental algorithms are widely used in many real-world tasks. This raises an\nimportant question: \"In delayed settings, is instance incremental learning the\nbest option regarding predictive performance and computational efficiency?\"\nUnfortunately, this question has not been studied in depth, probably due to the\nscarcity of real datasets containing delayed information. In this study, we\nconduct a comprehensive empirical evaluation and analysis of this question\nusing a real-world fraud detection problem and commonly used generated\ndatasets. Our findings indicate that instance incremental learning is not the\nsuperior option, considering on one side state-of-the-art models such as\nAdaptive Random Forest (ARF) and other side batch learning models such as\nXGBoost. Additionally, when considering the interpretability of the learning\nsystems, batch incremental solutions tend to be favored. Code:\n\\url{https://github.com/anselmeamekoe/DelayedLabelStream}","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection\",\"authors\":\"Kodjo Mawuena Amekoe, Mustapha Lebbah, Gregoire Jaffre, Hanene Azzag, Zaineb Chelly Dagdia\",\"doi\":\"arxiv-2409.10111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Real-world tabular learning production scenarios typically involve evolving\\ndata streams, where data arrives continuously and its distribution may change\\nover time. In such a setting, most studies in the literature regarding\\nsupervised learning favor the use of instance incremental algorithms due to\\ntheir ability to adapt to changes in the data distribution. Another significant\\nreason for choosing these algorithms is \\\\textit{avoid storing observations in\\nmemory} as commonly done in batch incremental settings. However, the design of\\ninstance incremental algorithms often assumes immediate availability of labels,\\nwhich is an optimistic assumption. In many real-world scenarios, such as fraud\\ndetection or credit scoring, labels may be delayed. Consequently, batch\\nincremental algorithms are widely used in many real-world tasks. This raises an\\nimportant question: \\\"In delayed settings, is instance incremental learning the\\nbest option regarding predictive performance and computational efficiency?\\\"\\nUnfortunately, this question has not been studied in depth, probably due to the\\nscarcity of real datasets containing delayed information. In this study, we\\nconduct a comprehensive empirical evaluation and analysis of this question\\nusing a real-world fraud detection problem and commonly used generated\\ndatasets. Our findings indicate that instance incremental learning is not the\\nsuperior option, considering on one side state-of-the-art models such as\\nAdaptive Random Forest (ARF) and other side batch learning models such as\\nXGBoost. Additionally, when considering the interpretability of the learning\\nsystems, batch incremental solutions tend to be favored. Code:\\n\\\\url{https://github.com/anselmeamekoe/DelayedLabelStream}\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10111\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection
Real-world tabular learning production scenarios typically involve evolving
data streams, where data arrives continuously and its distribution may change
over time. In such a setting, most studies in the literature regarding
supervised learning favor the use of instance incremental algorithms due to
their ability to adapt to changes in the data distribution. Another significant
reason for choosing these algorithms is \textit{avoid storing observations in
memory} as commonly done in batch incremental settings. However, the design of
instance incremental algorithms often assumes immediate availability of labels,
which is an optimistic assumption. In many real-world scenarios, such as fraud
detection or credit scoring, labels may be delayed. Consequently, batch
incremental algorithms are widely used in many real-world tasks. This raises an
important question: "In delayed settings, is instance incremental learning the
best option regarding predictive performance and computational efficiency?"
Unfortunately, this question has not been studied in depth, probably due to the
scarcity of real datasets containing delayed information. In this study, we
conduct a comprehensive empirical evaluation and analysis of this question
using a real-world fraud detection problem and commonly used generated
datasets. Our findings indicate that instance incremental learning is not the
superior option, considering on one side state-of-the-art models such as
Adaptive Random Forest (ARF) and other side batch learning models such as
XGBoost. Additionally, when considering the interpretability of the learning
systems, batch incremental solutions tend to be favored. Code:
\url{https://github.com/anselmeamekoe/DelayedLabelStream}