Machine learning approaches for imputing missing meteorological data in Senegal

IF 3.2 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Applied Computing and Geosciences Pub Date : 2025-08-15 DOI:10.1016/j.acags.2025.100281

Mory Toure , Nana Ama Browne Klutse , Mamadou Adama Sarr , Md Abul Ehsan Bhuiyan , Annine Duclaire Kenne , Wassila Mamadou Thiaw , Daouda Badiane , Amadou Thierno Gaye , Ousmane Ndiaye , Cheikh Mbow

{"title":"Machine learning approaches for imputing missing meteorological data in Senegal","authors":"Mory Toure , Nana Ama Browne Klutse , Mamadou Adama Sarr , Md Abul Ehsan Bhuiyan , Annine Duclaire Kenne , Wassila Mamadou Thiaw , Daouda Badiane , Amadou Thierno Gaye , Ousmane Ndiaye , Cheikh Mbow","doi":"10.1016/j.acags.2025.100281","DOIUrl":null,"url":null,"abstract":"<div><div>This study presents the first comprehensive evaluation in West Africa of four imputation methods, Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB), and Ordinary Kriging (OK), applied to six core meteorological variables across Senegal over a ten-year period (2015–2024). By simulating realistic missing data scenarios informed by field conditions (e.g., power outages, observer absences, sensor failures), it establishes a robust benchmark for climate data reconstruction using machine learning in resource-constrained settings.</div><div>The findings highlight the clear superiority of ensemble learning approaches. XGB consistently outperformed all methods across variables and scenarios, achieving the highest average predictive accuracy with R<sup>2</sup> values up to [95 % CI: 0.82–0.88], along with lower Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). RF yielded comparable performance, especially for maximum and minimum temperature (TMAX, TMIN), maintaining strong stability even at 20 % missingness. In contrast, DT performance declined sharply with increased data loss, and OK was constrained by the sparse spatial distribution of meteorological stations, notably impairing its ability to impute precipitation (PRCP) and wind speed (WDSP).</div><div>This work contributes a multivariable imputation framework specifically adapted to West African climatic and infrastructural realities. It also integrates block bootstrap methods to quantify uncertainty and derive 95 % confidence intervals for all error metrics. Results confirm that imputation effectiveness is highly variable-dependent: continuous and temporally autocorrelated variables (TMAX, TMIN, dew point temperature — DEWP) are well reconstructed, whereas discontinuous or noisy variables (WDSP and PRCP) remain challenging.</div><div>Although ensemble models offer clear advantages, their computational demands and need for hyperparameter tuning may limit real-time implementation in low-resource national meteorological services. Furthermore, the exclusion of satellite or reanalysis inputs may constrain model generalizability.</div><div>Ultimately, this study reinforces the role of advanced machine learning methods in improving climate data completeness and reliability in Africa. Although not a substitute for direct observations, imputation emerges as a critical complementary tool to support robust and resilient climate information systems essential for agriculture, public health, and disaster risk management under intensifying climate variability.</div></div>","PeriodicalId":33804,"journal":{"name":"Applied Computing and Geosciences","volume":"27 ","pages":"Article 100281"},"PeriodicalIF":3.2000,"publicationDate":"2025-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590197425000631","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

This study presents the first comprehensive evaluation in West Africa of four imputation methods, Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB), and Ordinary Kriging (OK), applied to six core meteorological variables across Senegal over a ten-year period (2015–2024). By simulating realistic missing data scenarios informed by field conditions (e.g., power outages, observer absences, sensor failures), it establishes a robust benchmark for climate data reconstruction using machine learning in resource-constrained settings.

The findings highlight the clear superiority of ensemble learning approaches. XGB consistently outperformed all methods across variables and scenarios, achieving the highest average predictive accuracy with R² values up to [95 % CI: 0.82–0.88], along with lower Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). RF yielded comparable performance, especially for maximum and minimum temperature (TMAX, TMIN), maintaining strong stability even at 20 % missingness. In contrast, DT performance declined sharply with increased data loss, and OK was constrained by the sparse spatial distribution of meteorological stations, notably impairing its ability to impute precipitation (PRCP) and wind speed (WDSP).

This work contributes a multivariable imputation framework specifically adapted to West African climatic and infrastructural realities. It also integrates block bootstrap methods to quantify uncertainty and derive 95 % confidence intervals for all error metrics. Results confirm that imputation effectiveness is highly variable-dependent: continuous and temporally autocorrelated variables (TMAX, TMIN, dew point temperature — DEWP) are well reconstructed, whereas discontinuous or noisy variables (WDSP and PRCP) remain challenging.

Although ensemble models offer clear advantages, their computational demands and need for hyperparameter tuning may limit real-time implementation in low-resource national meteorological services. Furthermore, the exclusion of satellite or reanalysis inputs may constrain model generalizability.

Ultimately, this study reinforces the role of advanced machine learning methods in improving climate data completeness and reliability in Africa. Although not a substitute for direct observations, imputation emerges as a critical complementary tool to support robust and resilient climate information systems essential for agriculture, public health, and disaster risk management under intensifying climate variability.

查看原文本刊更多论文

塞内加尔丢失气象数据的机器学习方法

本研究首次在西非对决策树（DT）、随机森林（RF）、极端梯度增强（XGB）和普通克里格（OK）四种估算方法进行了综合评估，这些方法应用于塞内加尔10年（2015-2024）期间的六个核心气象变量。通过模拟根据现场条件（例如，停电、观察员缺席、传感器故障）通知的真实丢失数据情景，它为在资源受限的环境下使用机器学习重建气候数据建立了一个强大的基准。研究结果突出了集成学习方法的明显优势。XGB始终优于所有变量和场景的方法，实现最高的平均预测精度，R2值高达[95% CI: 0.82-0.88]，同时具有较低的均方根误差（RMSE）和平均绝对误差（MAE）。RF产生了相当的性能，特别是在最高和最低温度（TMAX， TMIN）下，即使在丢失20%时也保持了很强的稳定性。而DT的性能则随着数据丢失的增加而急剧下降，OK受到气象站稀疏空间分布的限制，其估算降水（PRCP）和风速（WDSP）的能力明显受损。这项工作提供了一个特别适应西非气候和基础设施现实的多变量imputation框架。它还集成了块引导方法来量化不确定性，并为所有误差度量导出95%的置信区间。结果证实了插值的有效性是高度变量依赖的：连续和时间自相关的变量（TMAX， TMIN，露点温度- DEWP）可以很好地重建，而不连续或有噪声的变量（WDSP和PRCP）仍然具有挑战性。尽管集成模型具有明显的优势，但其计算需求和对超参数调优的需求可能会限制在资源匮乏的国家气象服务中的实时实施。此外，排除卫星或再分析输入可能会限制模型的泛化性。最终，本研究加强了先进机器学习方法在提高非洲气候数据完整性和可靠性方面的作用。虽然不能替代直接观测，但在气候变异性加剧的情况下，归因作为一种重要的补充工具，可以支持对农业、公共卫生和灾害风险管理至关重要的强大和有弹性的气候信息系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊