{"title":"在全球范围内预测地表土壤水分含量的优化机器学习算法集合","authors":"Qianqian Han, Yijian Zeng, Lijie Zhang, Calimanut-Ionut Cira, Egor Prikaziuk, Ting Duan, Chao Wang, Brigitta Szabó, Salvatore Manfreda, Ruodan Zhuang, Bob Su","doi":"10.5194/gmd-16-5825-2023","DOIUrl":null,"url":null,"abstract":"Abstract. Accurate information on surface soil moisture (SSM) content at a global scale under different climatic conditions is important for hydrological and climatological applications. Machine-learning-based systematic integration of in situ hydrological measurements, complex environmental and climate data, and satellite observation facilitate the generation of reliable data products to monitor and analyse the exchange of water, energy, and carbon in the Earth system at a proper space–time resolution. This study investigates the estimation of daily SSM using 8 optimised machine learning (ML) algorithms and 10 ensemble models (constructed via model bootstrap aggregating techniques and five-fold cross-validation). The algorithmic implementations were trained and tested using International Soil Moisture Network (ISMN) data collected from 1722 stations distributed across the world. The result showed that the K-neighbours Regressor (KNR) had the lowest root-mean-square error (0.0379 cm3 cm−3) on the “test_random” set (for testing the performance of randomly split data during training), the Random Forest Regressor (RFR) had the lowest RMSE (0.0599 cm3 cm−3) on the “test_temporal” set (for testing the performance on the period that was not used in training), and AdaBoost (AB) had the lowest RMSE (0.0786 cm3 cm−3) on the “test_independent-stations” set (for testing the performance on the stations that were not used in training). Independent evaluation on novel stations across different climate zones was conducted. For the optimised ML algorithms, the median RMSE values were below 0.1 cm3 cm−3. GradientBoosting (GB), Multi-layer Perceptron Regressor (MLPR), Stochastic Gradient Descent Regressor (SGDR), and RFR achieved a median r score of 0.6 in 12, 11, 9, and 9 climate zones, respectively, out of 15 climate zones. The performance of ensemble models improved significantly, with the median RMSE value below 0.075 cm3 cm−3 for all climate zones. All voting regressors achieved r scores of above 0.6 in 13 climate zones; BSh (hot semi-arid climate) and BWh (hot desert climate) were the exceptions because of the sparse distribution of training stations. The metric evaluation showed that ensemble models can improve the performance of single ML algorithms and achieve more stable results. Based on the results computed for three different test sets, the ensemble model with KNR, RFR and Extreme Gradient Boosting (XB) performed the best. Overall, our investigation shows that ensemble machine learning algorithms have a greater capability with respect to predicting SSM compared with the optimised or base ML algorithms; this indicates their huge potential applicability in estimating water cycle budgets, managing irrigation, and predicting crop yields.","PeriodicalId":12799,"journal":{"name":"Geoscientific Model Development","volume":"4 1","pages":"0"},"PeriodicalIF":4.0000,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Ensemble of optimised machine learning algorithms for predicting surface soil moisture content at a global scale\",\"authors\":\"Qianqian Han, Yijian Zeng, Lijie Zhang, Calimanut-Ionut Cira, Egor Prikaziuk, Ting Duan, Chao Wang, Brigitta Szabó, Salvatore Manfreda, Ruodan Zhuang, Bob Su\",\"doi\":\"10.5194/gmd-16-5825-2023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract. Accurate information on surface soil moisture (SSM) content at a global scale under different climatic conditions is important for hydrological and climatological applications. Machine-learning-based systematic integration of in situ hydrological measurements, complex environmental and climate data, and satellite observation facilitate the generation of reliable data products to monitor and analyse the exchange of water, energy, and carbon in the Earth system at a proper space–time resolution. This study investigates the estimation of daily SSM using 8 optimised machine learning (ML) algorithms and 10 ensemble models (constructed via model bootstrap aggregating techniques and five-fold cross-validation). The algorithmic implementations were trained and tested using International Soil Moisture Network (ISMN) data collected from 1722 stations distributed across the world. The result showed that the K-neighbours Regressor (KNR) had the lowest root-mean-square error (0.0379 cm3 cm−3) on the “test_random” set (for testing the performance of randomly split data during training), the Random Forest Regressor (RFR) had the lowest RMSE (0.0599 cm3 cm−3) on the “test_temporal” set (for testing the performance on the period that was not used in training), and AdaBoost (AB) had the lowest RMSE (0.0786 cm3 cm−3) on the “test_independent-stations” set (for testing the performance on the stations that were not used in training). Independent evaluation on novel stations across different climate zones was conducted. For the optimised ML algorithms, the median RMSE values were below 0.1 cm3 cm−3. GradientBoosting (GB), Multi-layer Perceptron Regressor (MLPR), Stochastic Gradient Descent Regressor (SGDR), and RFR achieved a median r score of 0.6 in 12, 11, 9, and 9 climate zones, respectively, out of 15 climate zones. The performance of ensemble models improved significantly, with the median RMSE value below 0.075 cm3 cm−3 for all climate zones. All voting regressors achieved r scores of above 0.6 in 13 climate zones; BSh (hot semi-arid climate) and BWh (hot desert climate) were the exceptions because of the sparse distribution of training stations. The metric evaluation showed that ensemble models can improve the performance of single ML algorithms and achieve more stable results. Based on the results computed for three different test sets, the ensemble model with KNR, RFR and Extreme Gradient Boosting (XB) performed the best. Overall, our investigation shows that ensemble machine learning algorithms have a greater capability with respect to predicting SSM compared with the optimised or base ML algorithms; this indicates their huge potential applicability in estimating water cycle budgets, managing irrigation, and predicting crop yields.\",\"PeriodicalId\":12799,\"journal\":{\"name\":\"Geoscientific Model Development\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2023-10-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Geoscientific Model Development\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5194/gmd-16-5825-2023\",\"RegionNum\":3,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GEOSCIENCES, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geoscientific Model Development","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5194/gmd-16-5825-2023","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 1
摘要
摘要在不同气候条件下,准确的全球尺度土壤表层水分信息对水文和气候学应用具有重要意义。基于机器学习的现场水文测量、复杂的环境和气候数据以及卫星观测的系统集成,有助于生成可靠的数据产品,以适当的时空分辨率监测和分析地球系统中水、能量和碳的交换。本研究使用8种优化的机器学习(ML)算法和10个集成模型(通过模型自举聚合技术和五倍交叉验证构建)来研究每日SSM的估计。使用分布在世界各地的1722个站点收集的国际土壤湿度网络(ISMN)数据对算法实施进行了培训和测试。结果表明,k -邻居回归器(KNR)在“test_random”集(用于测试训练期间随机分割数据的性能)上具有最低的均方根误差(0.0379 cm3 cm - 3),随机森林回归器(RFR)在“test_temporal”集(用于测试未用于训练的时间段的性能)上具有最低的RMSE (0.0599 cm3 cm - 3)。AdaBoost (AB)在“test_independence -stations”集(用于测试未用于训练的工作站的性能)上的RMSE最低(0.0786 cm3 cm - 3)。对不同气候带的新站进行了独立评价。对于优化的ML算法,中位数RMSE值低于0.1 cm3 cm - 3。在15个气候带中,梯度增强(GB)、多层感知器回归(MLPR)、随机梯度下降回归(SGDR)和RFR分别在12个、11个、9个和9个气候带中值r值为0.6。集合模式的性能显著提高,所有气候带的RMSE中值均低于0.075 cm3 cm - 3。13个气候带的投票回归因子r值均在0.6以上;由于训练站分布稀疏,BSh(炎热半干旱气候)和BWh(炎热沙漠气候)是例外。度量评价表明,集成模型可以提高单一机器学习算法的性能,并获得更稳定的结果。基于三个不同测试集的计算结果,具有KNR、RFR和极端梯度增强(Extreme Gradient Boosting, XB)的集成模型表现最好。总体而言,我们的研究表明,与优化或基本ML算法相比,集成机器学习算法在预测SSM方面具有更大的能力;这表明它们在估算水循环预算、管理灌溉和预测作物产量方面具有巨大的潜在适用性。
Ensemble of optimised machine learning algorithms for predicting surface soil moisture content at a global scale
Abstract. Accurate information on surface soil moisture (SSM) content at a global scale under different climatic conditions is important for hydrological and climatological applications. Machine-learning-based systematic integration of in situ hydrological measurements, complex environmental and climate data, and satellite observation facilitate the generation of reliable data products to monitor and analyse the exchange of water, energy, and carbon in the Earth system at a proper space–time resolution. This study investigates the estimation of daily SSM using 8 optimised machine learning (ML) algorithms and 10 ensemble models (constructed via model bootstrap aggregating techniques and five-fold cross-validation). The algorithmic implementations were trained and tested using International Soil Moisture Network (ISMN) data collected from 1722 stations distributed across the world. The result showed that the K-neighbours Regressor (KNR) had the lowest root-mean-square error (0.0379 cm3 cm−3) on the “test_random” set (for testing the performance of randomly split data during training), the Random Forest Regressor (RFR) had the lowest RMSE (0.0599 cm3 cm−3) on the “test_temporal” set (for testing the performance on the period that was not used in training), and AdaBoost (AB) had the lowest RMSE (0.0786 cm3 cm−3) on the “test_independent-stations” set (for testing the performance on the stations that were not used in training). Independent evaluation on novel stations across different climate zones was conducted. For the optimised ML algorithms, the median RMSE values were below 0.1 cm3 cm−3. GradientBoosting (GB), Multi-layer Perceptron Regressor (MLPR), Stochastic Gradient Descent Regressor (SGDR), and RFR achieved a median r score of 0.6 in 12, 11, 9, and 9 climate zones, respectively, out of 15 climate zones. The performance of ensemble models improved significantly, with the median RMSE value below 0.075 cm3 cm−3 for all climate zones. All voting regressors achieved r scores of above 0.6 in 13 climate zones; BSh (hot semi-arid climate) and BWh (hot desert climate) were the exceptions because of the sparse distribution of training stations. The metric evaluation showed that ensemble models can improve the performance of single ML algorithms and achieve more stable results. Based on the results computed for three different test sets, the ensemble model with KNR, RFR and Extreme Gradient Boosting (XB) performed the best. Overall, our investigation shows that ensemble machine learning algorithms have a greater capability with respect to predicting SSM compared with the optimised or base ML algorithms; this indicates their huge potential applicability in estimating water cycle budgets, managing irrigation, and predicting crop yields.
期刊介绍:
Geoscientific Model Development (GMD) is an international scientific journal dedicated to the publication and public discussion of the description, development, and evaluation of numerical models of the Earth system and its components. The following manuscript types can be considered for peer-reviewed publication:
* geoscientific model descriptions, from statistical models to box models to GCMs;
* development and technical papers, describing developments such as new parameterizations or technical aspects of running models such as the reproducibility of results;
* new methods for assessment of models, including work on developing new metrics for assessing model performance and novel ways of comparing model results with observational data;
* papers describing new standard experiments for assessing model performance or novel ways of comparing model results with observational data;
* model experiment descriptions, including experimental details and project protocols;
* full evaluations of previously published models.