Kingsley John , Daniel D. Saurette , Brandon Heung
{"title":"The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping","authors":"Kingsley John , Daniel D. Saurette , Brandon Heung","doi":"10.1016/j.geoderma.2025.117223","DOIUrl":null,"url":null,"abstract":"<div><div>Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning and predictive soil mapping, compromising model reliability. To demonstrate this issue, the 3-dimensional (3D) digital soil mapping (DSM) approach, whereby depth is used as a predictor of soil properties, was investigated. We compare two common approaches from the literature: leave-sample-out cross-validation (LSOCV) versus leave-profile-out cross-validation (LPOCV). Here, we argue that LSOCV results in contamination of the test dataset due to the potential vertical autocorrelation of soil properties from different samples within the same profile, and a more appropriate approach for testing 3D DSM models should be to fully partition all soil samples from the same profile to either the training or test dataset (i.e., LPOCV). Using the Ottawa region of Ontario, Canada, as a case study, cation exchange capacity (CEC), clay content, pH, and total organic carbon (TOC) were predicted using machine learning, and the discrepancy in accuracy metrics was reported. Furthermore, we evaluated the effects of data augmentation (i.e., the creation of additional synthetic data points from the original data) on accuracy metrics, a common practice in 3D DSM. Here, it was shown that with the augmented dataset, LSOCV generated overly optimistic accuracy metrics (e.g., CCC) that were 29–62% higher than LPOCV, while for the non-augmented data, the accuracy metrics were 8–18% higher, suggesting that vertical autocorrelation had a strong influence on inflating model accuracy through data leakage. As such, we strongly urge DSM practitioners to provide greater clarity when describing how model accuracy metrics were ascertained and to consider the use of LPOCV when applied to 3D DSM. This brings about broader concerns that policymakers and stakeholders may use map products with the false impression that the maps are more accurate than they are. Future research should focus on refining DSM methods and considering data structure to prevent data leakage in modelling soil properties.</div></div>","PeriodicalId":12511,"journal":{"name":"Geoderma","volume":"455 ","pages":"Article 117223"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geoderma","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0016706125000618","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOIL SCIENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning and predictive soil mapping, compromising model reliability. To demonstrate this issue, the 3-dimensional (3D) digital soil mapping (DSM) approach, whereby depth is used as a predictor of soil properties, was investigated. We compare two common approaches from the literature: leave-sample-out cross-validation (LSOCV) versus leave-profile-out cross-validation (LPOCV). Here, we argue that LSOCV results in contamination of the test dataset due to the potential vertical autocorrelation of soil properties from different samples within the same profile, and a more appropriate approach for testing 3D DSM models should be to fully partition all soil samples from the same profile to either the training or test dataset (i.e., LPOCV). Using the Ottawa region of Ontario, Canada, as a case study, cation exchange capacity (CEC), clay content, pH, and total organic carbon (TOC) were predicted using machine learning, and the discrepancy in accuracy metrics was reported. Furthermore, we evaluated the effects of data augmentation (i.e., the creation of additional synthetic data points from the original data) on accuracy metrics, a common practice in 3D DSM. Here, it was shown that with the augmented dataset, LSOCV generated overly optimistic accuracy metrics (e.g., CCC) that were 29–62% higher than LPOCV, while for the non-augmented data, the accuracy metrics were 8–18% higher, suggesting that vertical autocorrelation had a strong influence on inflating model accuracy through data leakage. As such, we strongly urge DSM practitioners to provide greater clarity when describing how model accuracy metrics were ascertained and to consider the use of LPOCV when applied to 3D DSM. This brings about broader concerns that policymakers and stakeholders may use map products with the false impression that the maps are more accurate than they are. Future research should focus on refining DSM methods and considering data structure to prevent data leakage in modelling soil properties.
期刊介绍:
Geoderma - the global journal of soil science - welcomes authors, readers and soil research from all parts of the world, encourages worldwide soil studies, and embraces all aspects of soil science and its associated pedagogy. The journal particularly welcomes interdisciplinary work focusing on dynamic soil processes and functions across space and time.