Data fusion-based improvements in empirical regression and machine learning for global daily ∼ 8 km resolution sea surface nitrate estimation and interpretation
Aifen Zhong , Difeng Wang , Fang Gong , Jingjing Huang , Zhuoqi Zheng , Xianqiang He , Qing Zhang , Qiankun Zhu
{"title":"Data fusion-based improvements in empirical regression and machine learning for global daily ∼ 8 km resolution sea surface nitrate estimation and interpretation","authors":"Aifen Zhong , Difeng Wang , Fang Gong , Jingjing Huang , Zhuoqi Zheng , Xianqiang He , Qing Zhang , Qiankun Zhu","doi":"10.1016/j.jag.2025.104800","DOIUrl":null,"url":null,"abstract":"<div><div>Assessing sea surface nitrate (SSN) concentrations and dynamics is crucial for understanding marine ecosystem health, yet optical remote sensing of SSN remains challenging because of the lack of distinct spectral features. While various global-scale SSN regression and machine learning algorithms based on SSN-environment variable relationships have been developed, the prediction accuracy and spatiotemporal resolution of their applications continue to face limitations. Additionally, there has been relatively little reporting on the interannual variability of global SSN in previous studies. Here we aim to enhance the accuracy and spatial resolution of SSN retrievals by developing improved regression and machine learning models, enabling the generation of global daily ∼ 8 km SSN products from satellite and model data. To construct the empirical regression models, the global ocean was divided into five regions on the basis of the relationship between sea surface temperature (SST) and SSN: 80° S to 40° N, the North Pacific, the North Atlantic, the Arabian Sea, and the eastern equatorial Pacific. After adding SSN-related physical variables, high-accuracy regional empirical models are developed, with root mean square deviations (RMSDs) of 1.641, 2.701, 1.221, 1.298, and 2.379 μmol/kg for the studied regions. For the machine learning models, seven algorithms, namely, extremely randomized trees (ET), multilayer perceptron (MLP), stacking random forest (SRF), Gaussian process regression (GPR), support vector machine (SVM), gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost) algorithms, were tested. After modeling, validation, and extensive tests using independent cruise dataset, the XGBoost model outperformed others (RMSD = 1.189 μmol/kg) and bypassed the need for regional segmentation. Mechanistic analysis revealed the driving variables influencing SSN in both regional empirical and XGBoost models, improving interpretability. Comparative validation confirmed that our models surpass traditional approaches in accuracy and applicability, demonstrating their potential to advance global SSN monitoring. Using XGBoost-derived products, we find a slight weak decreasing trend in SSN over 23 years. The proposed robust and explainable SSN retrieval models have the potential to assist in ocean environmental management.</div></div>","PeriodicalId":73423,"journal":{"name":"International journal of applied earth observation and geoinformation : ITC journal","volume":"143 ","pages":"Article 104800"},"PeriodicalIF":8.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of applied earth observation and geoinformation : ITC journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1569843225004479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"REMOTE SENSING","Score":null,"Total":0}
引用次数: 0
Abstract
Assessing sea surface nitrate (SSN) concentrations and dynamics is crucial for understanding marine ecosystem health, yet optical remote sensing of SSN remains challenging because of the lack of distinct spectral features. While various global-scale SSN regression and machine learning algorithms based on SSN-environment variable relationships have been developed, the prediction accuracy and spatiotemporal resolution of their applications continue to face limitations. Additionally, there has been relatively little reporting on the interannual variability of global SSN in previous studies. Here we aim to enhance the accuracy and spatial resolution of SSN retrievals by developing improved regression and machine learning models, enabling the generation of global daily ∼ 8 km SSN products from satellite and model data. To construct the empirical regression models, the global ocean was divided into five regions on the basis of the relationship between sea surface temperature (SST) and SSN: 80° S to 40° N, the North Pacific, the North Atlantic, the Arabian Sea, and the eastern equatorial Pacific. After adding SSN-related physical variables, high-accuracy regional empirical models are developed, with root mean square deviations (RMSDs) of 1.641, 2.701, 1.221, 1.298, and 2.379 μmol/kg for the studied regions. For the machine learning models, seven algorithms, namely, extremely randomized trees (ET), multilayer perceptron (MLP), stacking random forest (SRF), Gaussian process regression (GPR), support vector machine (SVM), gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost) algorithms, were tested. After modeling, validation, and extensive tests using independent cruise dataset, the XGBoost model outperformed others (RMSD = 1.189 μmol/kg) and bypassed the need for regional segmentation. Mechanistic analysis revealed the driving variables influencing SSN in both regional empirical and XGBoost models, improving interpretability. Comparative validation confirmed that our models surpass traditional approaches in accuracy and applicability, demonstrating their potential to advance global SSN monitoring. Using XGBoost-derived products, we find a slight weak decreasing trend in SSN over 23 years. The proposed robust and explainable SSN retrieval models have the potential to assist in ocean environmental management.
期刊介绍:
The International Journal of Applied Earth Observation and Geoinformation publishes original papers that utilize earth observation data for natural resource and environmental inventory and management. These data primarily originate from remote sensing platforms, including satellites and aircraft, supplemented by surface and subsurface measurements. Addressing natural resources such as forests, agricultural land, soils, and water, as well as environmental concerns like biodiversity, land degradation, and hazards, the journal explores conceptual and data-driven approaches. It covers geoinformation themes like capturing, databasing, visualization, interpretation, data quality, and spatial uncertainty.