Yeonpyeong Jo , Palash Panja , Hanseup Kim , Milind Deo
{"title":"Soil organic carbon (SOC) prediction using super learner algorithm based on the remote sensing variables","authors":"Yeonpyeong Jo , Palash Panja , Hanseup Kim , Milind Deo","doi":"10.1016/j.envc.2025.101160","DOIUrl":null,"url":null,"abstract":"<div><div>The absorption of carbon into the soil and its accurate monitoring is crucial for crop production rates and for mitigating global warming through increased carbon sequestration. Soil organic carbon (SOC) predictions using machine learning techniques have been actively researched because of their ability to handle non-linear relationships and predict accurately with limited prior assumptions about underlying mechanisms. However, the selection of appropriate machine learning methods remains a subject of debate, since each study area has unique data patterns, leading to various prediction performance across different algorithm types. To address these challenges, superlearner algorithm was employed to predict SOC with data from four U.S. states: Arkansas, Idaho, Nebraska, and Utah. Remote sensing variables derived from Sentinel-2 and ALOS PALSAR were used as predictors, with feature selection applied. Results indicated that the linear regression-based superlearner achieved higher accuracy (nRMSE: 7.6 %, R²: 0.804) compared to the random forest-based model (nRMSE: 8.3 %, R²: 0.768), likely due to its ability to better capture the specific data patterns through careful base learner selection and hyperparameter optimization. In contrast, the random forest-based model demonstrated low variance in accuracy across different base learner combinations. Both models were used to predict SOC at new locations in Salt Lake City, Utah, with the linear regression-based model showing more accurate prediction results (nRMSE: 52.9 %, RMSE: 0.48 % OC). This study of the selection of ML algorithms facilitates more reliable monitoring of SOC in various environmental circumstances, supporting establishment of strategies for addressing climate change and for agricultural production by quantifying SOC accurately.</div></div>","PeriodicalId":34794,"journal":{"name":"Environmental Challenges","volume":"19 ","pages":"Article 101160"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Challenges","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667010025000794","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 0
Abstract
The absorption of carbon into the soil and its accurate monitoring is crucial for crop production rates and for mitigating global warming through increased carbon sequestration. Soil organic carbon (SOC) predictions using machine learning techniques have been actively researched because of their ability to handle non-linear relationships and predict accurately with limited prior assumptions about underlying mechanisms. However, the selection of appropriate machine learning methods remains a subject of debate, since each study area has unique data patterns, leading to various prediction performance across different algorithm types. To address these challenges, superlearner algorithm was employed to predict SOC with data from four U.S. states: Arkansas, Idaho, Nebraska, and Utah. Remote sensing variables derived from Sentinel-2 and ALOS PALSAR were used as predictors, with feature selection applied. Results indicated that the linear regression-based superlearner achieved higher accuracy (nRMSE: 7.6 %, R²: 0.804) compared to the random forest-based model (nRMSE: 8.3 %, R²: 0.768), likely due to its ability to better capture the specific data patterns through careful base learner selection and hyperparameter optimization. In contrast, the random forest-based model demonstrated low variance in accuracy across different base learner combinations. Both models were used to predict SOC at new locations in Salt Lake City, Utah, with the linear regression-based model showing more accurate prediction results (nRMSE: 52.9 %, RMSE: 0.48 % OC). This study of the selection of ML algorithms facilitates more reliable monitoring of SOC in various environmental circumstances, supporting establishment of strategies for addressing climate change and for agricultural production by quantifying SOC accurately.