{"title":"Don’t push the button! Exploring data leakage risks in machine learning and transfer learning","authors":"Andrea Apicella, Francesco Isgrò, Roberto Prevete","doi":"10.1007/s10462-025-11326-3","DOIUrl":null,"url":null,"abstract":"<div><p>Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 11","pages":""},"PeriodicalIF":13.9000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11326-3.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11326-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.
期刊介绍:
Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.