Don’t push the button! Exploring data leakage risks in machine learning and transfer learning

IF 13.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Review Pub Date : 2025-08-20 DOI:10.1007/s10462-025-11326-3

Andrea Apicella, Francesco Isgrò, Roberto Prevete

{"title":"Don’t push the button! Exploring data leakage risks in machine learning and transfer learning","authors":"Andrea Apicella, Francesco Isgrò, Roberto Prevete","doi":"10.1007/s10462-025-11326-3","DOIUrl":null,"url":null,"abstract":"<div><p>Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 11","pages":""},"PeriodicalIF":13.9000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11326-3.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11326-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.

查看原文本刊更多论文

不要按按钮！探讨机器学习和迁移学习中的数据泄露风险

机器学习（ML）已经彻底改变了各个领域，在几个领域提供了预测能力。然而，文献中越来越多的证据表明，机器学习方法并不总是被恰当地使用，导致不正确的，有时过于乐观的结果。机器学习使用不当的一个原因可能是机器学习工具的可用性越来越高，导致了我们所说的“按按钮”方法。虽然这种方法提供了便利，但它引起了对结果可靠性的担忧，导致了诸如不正确的性能评估等挑战。特别是，本文解决了机器学习中的一个关键问题，即数据泄漏，其中意外信息污染了训练数据，影响了模型的性能评估。事实上，机器学习管道中的关键步骤可能会被不经意地忽视，导致乐观的性能估计在现实场景中可能不成立。对新数据的评估和实际表现之间的差异是一个值得关注的问题。特别地，本文对机器学习中的数据泄漏进行了分类，讨论了某些条件如何通过机器学习方法工作流传播。此外，它还探讨了数据泄漏与正在处理的特定任务之间的联系，调查了其在迁移学习框架中的发生情况，并比较了标准归纳ML与转换ML范式。结论总结了主要发现，强调了解决数据泄漏对于考虑任务和泛化目标的健壮可靠的ML应用程序的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.