Soil textural class modeling using digital soil mapping approaches: Effect of resampling strategies on imbalanced dataset predictions

IF 3.3 2区农林科学 Q2 SOIL SCIENCE

Geoderma Regional Pub Date : 2024-06-15 DOI:10.1016/j.geodrs.2024.e00821

Fereshteh Mirzaei , Alireza Amirian-Chakan , Ruhollah Taghizadeh-Mehrjardi , Hamid Reza Matinfar , Ruth Kerry

{"title":"Soil textural class modeling using digital soil mapping approaches: Effect of resampling strategies on imbalanced dataset predictions","authors":"Fereshteh Mirzaei , Alireza Amirian-Chakan , Ruhollah Taghizadeh-Mehrjardi , Hamid Reza Matinfar , Ruth Kerry","doi":"10.1016/j.geodrs.2024.e00821","DOIUrl":null,"url":null,"abstract":"<div><p>In a digital soil mapping (DSM) context, machine learning (ML) algorithms are widely used to model soil textural classes (STCs). However, in the real world most soil class datasets exhibit imbalanced distributions. This poses a challenge as ML algorithms are designed to handle balanced classes, leading to a bias towards the majority classes while often overlooking the minority classes. Furthermore, within the DSM framework, two strategies can be employed to model STCs: direct and indirect approaches. In the direct approach, STCs are directly inputted into the model for prediction. In contrast, the indirect approach involves introducing soil texture fractions (i.e., clay, silt, sand) as initial inputs, then STCs are obtained from the outputs. Limited research has been conducted on the impact of data balancing on STC predictions, and there is a lack of comparative analysis between direct and indirect approaches in this context. Therefore, this study aimed to evaluate the efficacy of a resampling technique (SMOTE: synthetic minority oversampling technique) in handling an imbalanced soil texture dataset collected from the Kuhdasht region in western Iran. Additionally, the study sought to compare the performance of direct and indirect modeling approaches. Environmental covariates derived from Landsat 8 and Sentinel 2 images along with a digital elevation model (DEM) were used as input variables to a random forest (RF) model to model STCs and soil texture fractions. The results revealed that terrain attributes and Euclidean distances played a more significant role in modeling both balanced and imbalanced datasets compared to remotely sensed data. Kappa indices for balanced and imbalanced datasets, as well as for the indirect approach were found to be 89%, 68% and 38% respectively. In the same way, the overall accuracies were 91%, 79% and 68%, respectively. Among the imbalanced classes, clay loam and loam which accounted for the majority of observations showed the highest recall values, followed by sandy clay loam, sandy loam and silty clay loam. When employing the indirect approach, the RF model failed to capture the minority classes in terms of validation statistics. Additionally, modeling with the imbalanced dataset resulted in the exclusion of three minority STCs from the final map. Overall, this study showed the importance of balancing STCs prior to modeling to achieve more accurate estimates of STCs, as well as the superiority of employing the direct approach (using balanced data sets) over the indirect approach.</p></div>","PeriodicalId":56001,"journal":{"name":"Geoderma Regional","volume":"38 ","pages":"Article e00821"},"PeriodicalIF":3.3000,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geoderma Regional","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352009424000683","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOIL SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In a digital soil mapping (DSM) context, machine learning (ML) algorithms are widely used to model soil textural classes (STCs). However, in the real world most soil class datasets exhibit imbalanced distributions. This poses a challenge as ML algorithms are designed to handle balanced classes, leading to a bias towards the majority classes while often overlooking the minority classes. Furthermore, within the DSM framework, two strategies can be employed to model STCs: direct and indirect approaches. In the direct approach, STCs are directly inputted into the model for prediction. In contrast, the indirect approach involves introducing soil texture fractions (i.e., clay, silt, sand) as initial inputs, then STCs are obtained from the outputs. Limited research has been conducted on the impact of data balancing on STC predictions, and there is a lack of comparative analysis between direct and indirect approaches in this context. Therefore, this study aimed to evaluate the efficacy of a resampling technique (SMOTE: synthetic minority oversampling technique) in handling an imbalanced soil texture dataset collected from the Kuhdasht region in western Iran. Additionally, the study sought to compare the performance of direct and indirect modeling approaches. Environmental covariates derived from Landsat 8 and Sentinel 2 images along with a digital elevation model (DEM) were used as input variables to a random forest (RF) model to model STCs and soil texture fractions. The results revealed that terrain attributes and Euclidean distances played a more significant role in modeling both balanced and imbalanced datasets compared to remotely sensed data. Kappa indices for balanced and imbalanced datasets, as well as for the indirect approach were found to be 89%, 68% and 38% respectively. In the same way, the overall accuracies were 91%, 79% and 68%, respectively. Among the imbalanced classes, clay loam and loam which accounted for the majority of observations showed the highest recall values, followed by sandy clay loam, sandy loam and silty clay loam. When employing the indirect approach, the RF model failed to capture the minority classes in terms of validation statistics. Additionally, modeling with the imbalanced dataset resulted in the exclusion of three minority STCs from the final map. Overall, this study showed the importance of balancing STCs prior to modeling to achieve more accurate estimates of STCs, as well as the superiority of employing the direct approach (using balanced data sets) over the indirect approach.

查看原文本刊更多论文

利用数字土壤制图方法进行土壤纹理分类建模：重采样策略对不平衡数据集预测的影响

在数字土壤制图（DSM）中，机器学习（ML）算法被广泛用于土壤纹理类别（STC）建模。然而，在现实世界中，大多数土壤类别数据集都呈现不平衡分布。这就提出了一个挑战，因为 ML 算法是为处理平衡类而设计的，这会导致偏向于多数类，而往往忽略少数类。此外，在 DSM 框架内，可以采用两种策略对 STC 进行建模：直接方法和间接方法。在直接方法中，STC 直接输入模型进行预测。而间接方法则是将土壤质地组分（即粘土、粉土、砂土）作为初始输入，然后从输出结果中获得 STC。有关数据平衡对 STC 预测影响的研究有限，在这方面也缺乏直接和间接方法之间的比较分析。因此，本研究旨在评估一种重采样技术（SMOTE：合成少数超采样技术）在处理从伊朗西部库赫达什特地区采集的不平衡土壤质地数据集方面的功效。此外，该研究还试图比较直接和间接建模方法的性能。从大地遥感卫星 8 号和哨兵 2 号图像中获取的环境协变量以及数字高程模型（DEM）被用作随机森林（RF）模型的输入变量，以模拟 STC 和土壤质地分数。结果显示，与遥感数据相比，地形属性和欧氏距离在平衡数据集和不平衡数据集的建模中发挥了更重要的作用。平衡和不平衡数据集以及间接方法的 Kappa 指数分别为 89%、68% 和 38%。同样，总体准确率分别为 91%、79% 和 68%。在不平衡类别中，占大多数观测值的粘壤土和壤土的召回值最高，其次是砂质粘壤土、砂质壤土和淤泥质粘壤土。在采用间接方法时，RF 模型在验证统计数据方面未能捕捉到少数类别。此外，使用不平衡数据集建模导致最终地图中排除了三个少数族群 STC。总之，这项研究表明，在建模前平衡 STC 对获得更准确的 STC 估计值非常重要，而且采用直接方法（使用平衡数据集）比间接方法更有优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Geoderma Regional Agricultural and Biological Sciences-Soil Science

CiteScore

6.10

自引率

7.30%

发文量

122

审稿时长

76 days

期刊介绍： Global issues require studies and solutions on national and regional levels. Geoderma Regional focuses on studies that increase understanding and advance our scientific knowledge of soils in all regions of the world. The journal embraces every aspect of soil science and welcomes reviews of regional progress.