Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability

IF 5 1区地球科学 Q2 ENVIRONMENTAL SCIENCES

Water Resources Research Pub Date : 2025-10-09 DOI:10.1029/2024wr039848

Xiaoran Yin, Longcang Shu, Zhe Wang, Long Zhou, Shuyao Niu, Huazhun Ren, Bo Liu, Chengpeng Lu

{"title":"Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability","authors":"Xiaoran Yin, Longcang Shu, Zhe Wang, Long Zhou, Shuyao Niu, Huazhun Ren, Bo Liu, Chengpeng Lu","doi":"10.1029/2024wr039848","DOIUrl":null,"url":null,"abstract":"Data imbalance poses a severe challenge in hydrological machine learning (ML) applications by limiting model performance and interpretability, whereas solutions remain limited. This study evaluates the impact of advanced sampling methods, particularly feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks); mechanism underlying its efficacy; and impact on model interpretability. Using ML algorithms such as random forest (RF) and LightGBM (LGB) across various training set sizes, we demonstrated that FSCS significantly mitigates data imbalance, enhancing model accuracy, feature importance estimation, and interpretability. Two widely used hydrological data sets were analyzed: a large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples) and continuous-value data set of soil properties from the USKSAT database (18,729 samples). In total, 1,720 models were constructed and optimized, combining different sampling methods, training set sizes, and algorithms. Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling. Despite using smaller training sets and simpler RF models, FSCS-trained models matched or surpassed the performance of those using larger data sets or more complex LGB models. SHAP analysis revealed that FSCS enhanced feature–target relationship clarity, emphasizing feature interactions and improving model interpretability. These findings highlight the potential of advanced sampling methods for not only addressing data imbalance but also providing more accurate prior information for model training, thereby enhancing reliability, accuracy, and interpretability in ML for hydrological applications.","PeriodicalId":23799,"journal":{"name":"Water Resources Research","volume":"9 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Water Resources Research","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.1029/2024wr039848","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Data imbalance poses a severe challenge in hydrological machine learning (ML) applications by limiting model performance and interpretability, whereas solutions remain limited. This study evaluates the impact of advanced sampling methods, particularly feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks); mechanism underlying its efficacy; and impact on model interpretability. Using ML algorithms such as random forest (RF) and LightGBM (LGB) across various training set sizes, we demonstrated that FSCS significantly mitigates data imbalance, enhancing model accuracy, feature importance estimation, and interpretability. Two widely used hydrological data sets were analyzed: a large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples) and continuous-value data set of soil properties from the USKSAT database (18,729 samples). In total, 1,720 models were constructed and optimized, combining different sampling methods, training set sizes, and algorithms. Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling. Despite using smaller training sets and simpler RF models, FSCS-trained models matched or surpassed the performance of those using larger data sets or more complex LGB models. SHAP analysis revealed that FSCS enhanced feature–target relationship clarity, emphasizing feature interactions and improving model interpretability. These findings highlight the potential of advanced sampling methods for not only addressing data imbalance but also providing more accurate prior information for model training, thereby enhancing reliability, accuracy, and interpretability in ML for hydrological applications.

查看原文本刊更多论文

解决水文机器学习中的数据不平衡：先进采样方法对性能和可解释性的影响

数据不平衡通过限制模型性能和可解释性，给水文机器学习（ML）应用带来了严峻的挑战，而解决方案仍然有限。本研究评估了先进的采样方法，特别是特征空间覆盖采样（FSCS）对预测森林覆盖类型和饱和水力传导率（Ks）的模型性能的影响；其作用机制；以及对模型可解释性的影响。在不同的训练集上使用随机森林（RF）和LightGBM （LGB）等机器学习算法，我们证明了FSCS显著减轻了数据不平衡，提高了模型准确性、特征重要性估计和可解释性。本文分析了两个广泛使用的水文数据集：来自罗斯福国家森林的大型多类别森林覆盖类型数据集（110393个样本）和来自USKSAT数据库的连续值土壤属性数据集（18729个样本）。结合不同的采样方法、训练集大小和算法，共构建和优化了1720个模型。平衡抽样、条件拉丁超立方抽样和FSCS始终优于简单随机抽样。尽管使用较小的训练集和更简单的RF模型，但fscs训练的模型的性能与使用较大数据集或更复杂的LGB模型的模型相当或超过。SHAP分析显示，FSCS增强了特征-目标关系的清晰度，强调了特征的相互作用，提高了模型的可解释性。这些发现突出了先进采样方法的潜力，不仅可以解决数据不平衡问题，还可以为模型训练提供更准确的先验信息，从而提高ML在水文应用中的可靠性、准确性和可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Water Resources Research 环境科学-湖沼学

CiteScore

8.80

自引率

13.00%

发文量

599

审稿时长

3.5 months

期刊介绍： Water Resources Research (WRR) is an interdisciplinary journal that focuses on hydrology and water resources. It publishes original research in the natural and social sciences of water. It emphasizes the role of water in the Earth system, including physical, chemical, biological, and ecological processes in water resources research and management, including social, policy, and public health implications. It encompasses observational, experimental, theoretical, analytical, numerical, and data-driven approaches that advance the science of water and its management. Submissions are evaluated for their novelty, accuracy, significance, and broader implications of the findings.