{"title":"Analyzing the effects of data splitting and covariate shift on machine learning based streamflow prediction in ungauged basins","authors":"Pin-Ching Li , Sayan Dey , Venkatesh Merwade","doi":"10.1016/j.jhydrol.2025.132731","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) models are alternatives to traditional hydrologic modeling for streamflow predictions in ungauged basins (PUB). The variability in watershed characteristics of ungauged basins; however, adds uncertainties to PUB frameworks based on ML models. These uncertainties arise from the inconsistency in the statistical distributions between the dataset used to train and test a ML model, known as covariate shifts, and the real-world (global) dataset on which the trained model is implemented. In real-world applications, covariate shift is a widespread issue for ML that has not been investigated in hydrological applications. This study evaluates the uncertainty in ML-based PUB method including Random Forest (RF) and Artificial Neural Network (ANN) under the influence of covariate shift. The Monte Carlo method is applied to aggregate simulations of RF and ANN according to various data splitting configurations as predictive distributions. The results indicate that ML performance is not robust under covariate shifts. ML performance is influenced by watershed characteristics displaying heterogeneity, such as drainage area, dam density, and urbanized area. 20–48% simulation results show a departure from the normal distribution under different covariate shift scenarios Furthermore, the efficiency and limitation of Random Forest models for PUB are highlighted by investigating their biased predictions in watersheds with varying dam density, drainage area, and meteorological variables, such as annual snowfall and annual precipitation.</div></div>","PeriodicalId":362,"journal":{"name":"Journal of Hydrology","volume":"653 ","pages":"Article 132731"},"PeriodicalIF":5.9000,"publicationDate":"2025-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hydrology","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0022169425000691","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) models are alternatives to traditional hydrologic modeling for streamflow predictions in ungauged basins (PUB). The variability in watershed characteristics of ungauged basins; however, adds uncertainties to PUB frameworks based on ML models. These uncertainties arise from the inconsistency in the statistical distributions between the dataset used to train and test a ML model, known as covariate shifts, and the real-world (global) dataset on which the trained model is implemented. In real-world applications, covariate shift is a widespread issue for ML that has not been investigated in hydrological applications. This study evaluates the uncertainty in ML-based PUB method including Random Forest (RF) and Artificial Neural Network (ANN) under the influence of covariate shift. The Monte Carlo method is applied to aggregate simulations of RF and ANN according to various data splitting configurations as predictive distributions. The results indicate that ML performance is not robust under covariate shifts. ML performance is influenced by watershed characteristics displaying heterogeneity, such as drainage area, dam density, and urbanized area. 20–48% simulation results show a departure from the normal distribution under different covariate shift scenarios Furthermore, the efficiency and limitation of Random Forest models for PUB are highlighted by investigating their biased predictions in watersheds with varying dam density, drainage area, and meteorological variables, such as annual snowfall and annual precipitation.
期刊介绍:
The Journal of Hydrology publishes original research papers and comprehensive reviews in all the subfields of the hydrological sciences including water based management and policy issues that impact on economics and society. These comprise, but are not limited to the physical, chemical, biogeochemical, stochastic and systems aspects of surface and groundwater hydrology, hydrometeorology and hydrogeology. Relevant topics incorporating the insights and methodologies of disciplines such as climatology, water resource systems, hydraulics, agrohydrology, geomorphology, soil science, instrumentation and remote sensing, civil and environmental engineering are included. Social science perspectives on hydrological problems such as resource and ecological economics, environmental sociology, psychology and behavioural science, management and policy analysis are also invited. Multi-and interdisciplinary analyses of hydrological problems are within scope. The science published in the Journal of Hydrology is relevant to catchment scales rather than exclusively to a local scale or site.