Estimating missing daily streamflow data in a tropical basin with pronounced seasonal variability: A comparative case study from the Guayas River Basin, Ecuador

Q2 Environmental Science

Environmental Challenges Pub Date : 2025-08-08 DOI:10.1016/j.envc.2025.101262

Daniela Stay-Arevalo , Mijail Arias-Hidalgo , Boris Apolo-Masache , Luis Dominguez-Granda , Gonzalo Villa-Cox

{"title":"Estimating missing daily streamflow data in a tropical basin with pronounced seasonal variability: A comparative case study from the Guayas River Basin, Ecuador","authors":"Daniela Stay-Arevalo , Mijail Arias-Hidalgo , Boris Apolo-Masache , Luis Dominguez-Granda , Gonzalo Villa-Cox","doi":"10.1016/j.envc.2025.101262","DOIUrl":null,"url":null,"abstract":"<div><div>Streamflow data holds significant importance in multiple environmental assessments and management frameworks. Information gaps can markedly influence the precision and reliability of these assessments and practices, especially in developing countries. This study employs a predictive framework implementing Seasonal Autoregressive Integrated Moving Average (SARIMA), k-Nearest Neighbors (kNN) and Random Forest (RF) models to tackle missing information in a daily streamflow dataset of 22 hydrological stations within the Guayas River Basin (GRB), Ecuador. A comparative predictive performance contrast was set between actual observed data and out-of-sample model estimates. Models were evaluated by the computation of performance metrics (e.g. Bias, Normalized Root Mean Square Error (NRMSE), Normalized Mean Absolute Error (NMAE) and Nash–Sutcliffe model Efficiency coefficient (NSE)). We found that the kNN and RF models outperform the SARIMA model, with NSE values ranging from 0.715 to 0.983 when estimating randomly allocated contiguous gaps. Different gap extensions were tested as well, with more than 70% similitude for gap lengths up to 15 days with the RF model. Further estimations fail to reproduce the natural peak-flow dynamics of the original streamflow time series, and exhibit step-like patterns with lower adjustment metrics, generally underestimating observed values. This study opens room for improvement in data mining stages prior to modelling for proper characterization in data scarcity regions.</div></div>","PeriodicalId":34794,"journal":{"name":"Environmental Challenges","volume":"20 ","pages":"Article 101262"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Challenges","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667010025001817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Environmental Science","Score":null,"Total":0}

引用次数: 0

Abstract

Streamflow data holds significant importance in multiple environmental assessments and management frameworks. Information gaps can markedly influence the precision and reliability of these assessments and practices, especially in developing countries. This study employs a predictive framework implementing Seasonal Autoregressive Integrated Moving Average (SARIMA), k-Nearest Neighbors (kNN) and Random Forest (RF) models to tackle missing information in a daily streamflow dataset of 22 hydrological stations within the Guayas River Basin (GRB), Ecuador. A comparative predictive performance contrast was set between actual observed data and out-of-sample model estimates. Models were evaluated by the computation of performance metrics (e.g. Bias, Normalized Root Mean Square Error (NRMSE), Normalized Mean Absolute Error (NMAE) and Nash–Sutcliffe model Efficiency coefficient (NSE)). We found that the kNN and RF models outperform the SARIMA model, with NSE values ranging from 0.715 to 0.983 when estimating randomly allocated contiguous gaps. Different gap extensions were tested as well, with more than 70% similitude for gap lengths up to 15 days with the RF model. Further estimations fail to reproduce the natural peak-flow dynamics of the original streamflow time series, and exhibit step-like patterns with lower adjustment metrics, generally underestimating observed values. This study opens room for improvement in data mining stages prior to modelling for proper characterization in data scarcity regions.

查看原文本刊更多论文

估算具有明显季节变化的热带盆地中缺失的日流量数据：来自厄瓜多尔瓜亚斯河流域的比较案例研究

流量数据在多种环境评估和管理框架中具有重要意义。信息差距可以显著影响这些评估和做法的准确性和可靠性，特别是在发展中国家。本研究采用季节性自回归综合移动平均（SARIMA）、k-近邻（kNN）和随机森林（RF）模型的预测框架来处理厄瓜多尔瓜亚斯河流域（GRB） 22个水文站的每日流量数据集中的缺失信息。在实际观测数据和样本外模型估计之间设置了比较预测性能的对比。通过计算性能指标（如偏差、归一化均方根误差（NRMSE）、归一化平均绝对误差（NMAE）和Nash-Sutcliffe模型效率系数（NSE））来评估模型。我们发现，在估计随机分配的连续间隙时，kNN和RF模型的NSE值在0.715至0.983之间，优于SARIMA模型。不同的间隙扩展也进行了测试，在长达15天的间隙长度上，RF模型的相似度超过70%。进一步的估计无法再现原始流量时间序列的自然峰值流量动态，并且在较低的调整指标下呈现阶梯状模式，通常低估了观测值。本研究为在数据稀缺地区进行适当表征建模之前的数据挖掘阶段的改进开辟了空间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊