IF 5.6 1区 农林科学 Q1 SOIL SCIENCE
Kingsley John , Daniel D. Saurette , Brandon Heung
{"title":"The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping","authors":"Kingsley John ,&nbsp;Daniel D. Saurette ,&nbsp;Brandon Heung","doi":"10.1016/j.geoderma.2025.117223","DOIUrl":null,"url":null,"abstract":"<div><div>Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning and predictive soil mapping, compromising model reliability. To demonstrate this issue, the 3-dimensional (3D) digital soil mapping (DSM) approach, whereby depth is used as a predictor of soil properties, was investigated. We compare two common approaches from the literature: leave-sample-out cross-validation (LSOCV) versus leave-profile-out cross-validation (LPOCV). Here, we argue that LSOCV results in contamination of the test dataset due to the potential vertical autocorrelation of soil properties from different samples within the same profile, and a more appropriate approach for testing 3D DSM models should be to fully partition all soil samples from the same profile to either the training or test dataset (i.e., LPOCV). Using the Ottawa region of Ontario, Canada, as a case study, cation exchange capacity (CEC), clay content, pH, and total organic carbon (TOC) were predicted using machine learning, and the discrepancy in accuracy metrics was reported. Furthermore, we evaluated the effects of data augmentation (i.e., the creation of additional synthetic data points from the original data) on accuracy metrics, a common practice in 3D DSM. Here, it was shown that with the augmented dataset, LSOCV generated overly optimistic accuracy metrics (e.g., CCC) that were 29–62% higher than LPOCV, while for the non-augmented data, the accuracy metrics were 8–18% higher, suggesting that vertical autocorrelation had a strong influence on inflating model accuracy through data leakage. As such, we strongly urge DSM practitioners to provide greater clarity when describing how model accuracy metrics were ascertained and to consider the use of LPOCV when applied to 3D DSM. This brings about broader concerns that policymakers and stakeholders may use map products with the false impression that the maps are more accurate than they are. Future research should focus on refining DSM methods and considering data structure to prevent data leakage in modelling soil properties.</div></div>","PeriodicalId":12511,"journal":{"name":"Geoderma","volume":"455 ","pages":"Article 117223"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geoderma","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0016706125000618","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOIL SCIENCE","Score":null,"Total":0}
引用次数: 0

摘要

当用于模型拟合和超参数调整的数据与用于测试的数据发生重叠时,就会出现数据泄漏。这种重叠会使模型的性能产生偏差,从而使模型的泛化能力失去参考价值。这是机器学习和预测性土壤制图中的一个重要问题,会影响模型的可靠性。为了证明这个问题,我们研究了三维(3D)数字土壤制图(DSM)方法,即使用深度作为土壤特性的预测指标。我们比较了文献中的两种常见方法:留样交叉验证(LSOCV)和留样剖面交叉验证(LPOCV)。在此,我们认为 LSOCV 会导致测试数据集受到污染,因为同一剖面中不同样本的土壤特性可能存在垂直自相关性,而测试三维 DSM 模型的更合适方法应该是将同一剖面中的所有土壤样本完全划分到训练或测试数据集中(即 LPOCV)。以加拿大安大略省渥太华地区为例,我们使用机器学习预测了阳离子交换容量(CEC)、粘土含量、pH 值和总有机碳(TOC),并报告了准确度指标的差异。此外,我们还评估了数据增强(即从原始数据中创建额外的合成数据点)对准确度指标的影响,这是三维 DSM 中的一种常见做法。结果表明,使用增强数据集时,LSOCV 生成的准确度指标(如 CCC)过于乐观,比 LPOCV 高出 29-62%,而对于非增强数据,准确度指标则高出 8-18%,这表明垂直自相关对通过数据泄漏夸大模型准确度有很大影响。因此,我们强烈敦促 DSM 实践者在描述如何确定模型精度指标时提供更清晰的信息,并考虑在三维 DSM 中使用 LPOCV。这将带来更广泛的担忧,即政策制定者和利益相关者在使用地图产品时,可能会产生地图比实际更准确的错误印象。未来的研究应侧重于完善 DSM 方法,并考虑数据结构,以防止土壤特性建模过程中的数据泄漏。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning and predictive soil mapping, compromising model reliability. To demonstrate this issue, the 3-dimensional (3D) digital soil mapping (DSM) approach, whereby depth is used as a predictor of soil properties, was investigated. We compare two common approaches from the literature: leave-sample-out cross-validation (LSOCV) versus leave-profile-out cross-validation (LPOCV). Here, we argue that LSOCV results in contamination of the test dataset due to the potential vertical autocorrelation of soil properties from different samples within the same profile, and a more appropriate approach for testing 3D DSM models should be to fully partition all soil samples from the same profile to either the training or test dataset (i.e., LPOCV). Using the Ottawa region of Ontario, Canada, as a case study, cation exchange capacity (CEC), clay content, pH, and total organic carbon (TOC) were predicted using machine learning, and the discrepancy in accuracy metrics was reported. Furthermore, we evaluated the effects of data augmentation (i.e., the creation of additional synthetic data points from the original data) on accuracy metrics, a common practice in 3D DSM. Here, it was shown that with the augmented dataset, LSOCV generated overly optimistic accuracy metrics (e.g., CCC) that were 29–62% higher than LPOCV, while for the non-augmented data, the accuracy metrics were 8–18% higher, suggesting that vertical autocorrelation had a strong influence on inflating model accuracy through data leakage. As such, we strongly urge DSM practitioners to provide greater clarity when describing how model accuracy metrics were ascertained and to consider the use of LPOCV when applied to 3D DSM. This brings about broader concerns that policymakers and stakeholders may use map products with the false impression that the maps are more accurate than they are. Future research should focus on refining DSM methods and considering data structure to prevent data leakage in modelling soil properties.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Geoderma
Geoderma 农林科学-土壤科学
CiteScore
11.80
自引率
6.60%
发文量
597
审稿时长
58 days
期刊介绍: Geoderma - the global journal of soil science - welcomes authors, readers and soil research from all parts of the world, encourages worldwide soil studies, and embraces all aspects of soil science and its associated pedagogy. The journal particularly welcomes interdisciplinary work focusing on dynamic soil processes and functions across space and time.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信