Estimating missing daily streamflow data in a tropical basin with pronounced seasonal variability: A comparative case study from the Guayas River Basin, Ecuador
Daniela Stay-Arevalo , Mijail Arias-Hidalgo , Boris Apolo-Masache , Luis Dominguez-Granda , Gonzalo Villa-Cox
{"title":"Estimating missing daily streamflow data in a tropical basin with pronounced seasonal variability: A comparative case study from the Guayas River Basin, Ecuador","authors":"Daniela Stay-Arevalo , Mijail Arias-Hidalgo , Boris Apolo-Masache , Luis Dominguez-Granda , Gonzalo Villa-Cox","doi":"10.1016/j.envc.2025.101262","DOIUrl":null,"url":null,"abstract":"<div><div>Streamflow data holds significant importance in multiple environmental assessments and management frameworks. Information gaps can markedly influence the precision and reliability of these assessments and practices, especially in developing countries. This study employs a predictive framework implementing Seasonal Autoregressive Integrated Moving Average (SARIMA), k-Nearest Neighbors (kNN) and Random Forest (RF) models to tackle missing information in a daily streamflow dataset of 22 hydrological stations within the Guayas River Basin (GRB), Ecuador. A comparative predictive performance contrast was set between actual observed data and out-of-sample model estimates. Models were evaluated by the computation of performance metrics (e.g. Bias, Normalized Root Mean Square Error (NRMSE), Normalized Mean Absolute Error (NMAE) and Nash–Sutcliffe model Efficiency coefficient (NSE)). We found that the kNN and RF models outperform the SARIMA model, with NSE values ranging from 0.715 to 0.983 when estimating randomly allocated contiguous gaps. Different gap extensions were tested as well, with more than 70% similitude for gap lengths up to 15 days with the RF model. Further estimations fail to reproduce the natural peak-flow dynamics of the original streamflow time series, and exhibit step-like patterns with lower adjustment metrics, generally underestimating observed values. This study opens room for improvement in data mining stages prior to modelling for proper characterization in data scarcity regions.</div></div>","PeriodicalId":34794,"journal":{"name":"Environmental Challenges","volume":"20 ","pages":"Article 101262"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Challenges","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667010025001817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 0
Abstract
Streamflow data holds significant importance in multiple environmental assessments and management frameworks. Information gaps can markedly influence the precision and reliability of these assessments and practices, especially in developing countries. This study employs a predictive framework implementing Seasonal Autoregressive Integrated Moving Average (SARIMA), k-Nearest Neighbors (kNN) and Random Forest (RF) models to tackle missing information in a daily streamflow dataset of 22 hydrological stations within the Guayas River Basin (GRB), Ecuador. A comparative predictive performance contrast was set between actual observed data and out-of-sample model estimates. Models were evaluated by the computation of performance metrics (e.g. Bias, Normalized Root Mean Square Error (NRMSE), Normalized Mean Absolute Error (NMAE) and Nash–Sutcliffe model Efficiency coefficient (NSE)). We found that the kNN and RF models outperform the SARIMA model, with NSE values ranging from 0.715 to 0.983 when estimating randomly allocated contiguous gaps. Different gap extensions were tested as well, with more than 70% similitude for gap lengths up to 15 days with the RF model. Further estimations fail to reproduce the natural peak-flow dynamics of the original streamflow time series, and exhibit step-like patterns with lower adjustment metrics, generally underestimating observed values. This study opens room for improvement in data mining stages prior to modelling for proper characterization in data scarcity regions.