Multi-Modal Vision Transformer for high-resolution soil texture prediction of German agricultural soils using remote sensing imagery

IF 11.4 1区地球科学 Q1 ENVIRONMENTAL SCIENCES

Remote Sensing of Environment Pub Date : 2025-09-04 DOI:10.1016/j.rse.2025.114985

Lucas Wittstruck, Björn Waske, Thomas Jarmer

{"title":"Multi-Modal Vision Transformer for high-resolution soil texture prediction of German agricultural soils using remote sensing imagery","authors":"Lucas Wittstruck, Björn Waske, Thomas Jarmer","doi":"10.1016/j.rse.2025.114985","DOIUrl":null,"url":null,"abstract":"<div><div>The quantification and mapping of important soil properties, such as soil texture, are vital for effective crop management and the assessment of overall soil health in agricultural systems. In this study, we propose a multi-modal Visual Transformer (MMVT) architecture to predict and map the soil particle size distribution of agricultural topsoils in Germany at a high spatial resolution of 10 meters. Our modeling utilized multi-source bare soil satellite image composites with terrain and soil-related covariates. To optimize the model’s ability to capture spatial soil context, various image sizes were evaluated. The study findings highlighted the effectiveness of our MMVT model, demonstrating improved estimation accuracies compared to a two-dimensional Convolutional Neural Network (2D CNN) and a Random Forest (RF) model. Specifically, the proposed transformer network achieved the highest averaged validated accuracy in predicting the soil texture when incorporating a contextual image surrounding of 320 × 320 m around the soil sampling positions (Sand: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.74, RMSE = 14.78%, and RPIQ = 3.52, Silt: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.73, RMSE = 12.36%, and RPIQ = 3.50, Clay: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.52, RMSE = 6.30%, and RPIQ = 1.95). This integrated approach underscores the potential of advanced deep learning techniques and multi-modal learning in providing comprehensive insights into soil characteristics with high resolution and at a large scale.</div></div>","PeriodicalId":417,"journal":{"name":"Remote Sensing of Environment","volume":"331 ","pages":"Article 114985"},"PeriodicalIF":11.4000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Remote Sensing of Environment","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003442572500389X","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The quantification and mapping of important soil properties, such as soil texture, are vital for effective crop management and the assessment of overall soil health in agricultural systems. In this study, we propose a multi-modal Visual Transformer (MMVT) architecture to predict and map the soil particle size distribution of agricultural topsoils in Germany at a high spatial resolution of 10 meters. Our modeling utilized multi-source bare soil satellite image composites with terrain and soil-related covariates. To optimize the model’s ability to capture spatial soil context, various image sizes were evaluated. The study findings highlighted the effectiveness of our MMVT model, demonstrating improved estimation accuracies compared to a two-dimensional Convolutional Neural Network (2D CNN) and a Random Forest (RF) model. Specifically, the proposed transformer network achieved the highest averaged validated accuracy in predicting the soil texture when incorporating a contextual image surrounding of 320 × 320 m around the soil sampling positions (Sand:

R^{2}

= 0.74, RMSE = 14.78%, and RPIQ = 3.52, Silt:

R^{2}

= 0.73, RMSE = 12.36%, and RPIQ = 3.50, Clay:

R^{2}

= 0.52, RMSE = 6.30%, and RPIQ = 1.95). This integrated approach underscores the potential of advanced deep learning techniques and multi-modal learning in providing comprehensive insights into soil characteristics with high resolution and at a large scale.

查看原文本刊更多论文

基于遥感影像的德国农业土壤高分辨率土壤质地预测的多模态视觉变压器

土壤质地等重要土壤特性的量化和制图对于有效的作物管理和农业系统整体土壤健康评估至关重要。在这项研究中，我们提出了一个多模态可视化变压器（MMVT）架构，以10米的高空间分辨率预测和绘制德国农业表土的土壤粒度分布。我们的建模使用了多源裸地卫星图像，其中包含地形和土壤相关协变量。为了优化模型捕捉空间土壤环境的能力，对不同的图像尺寸进行了评估。研究结果强调了MMVT模型的有效性，与二维卷积神经网络（2D CNN）和随机森林（RF）模型相比，MMVT模型的估计精度得到了提高。具体而言，当结合土壤采样位置周围320 × 320 m的背景图像时，所提出的变压器网络在预测土壤质地方面取得了最高的平均验证精度（沙子：R2R2 = 0.74, RMSE = 14.78%, RPIQ = 3.52，淤泥：R2R2 = 0.73, RMSE = 12.36%, RPIQ = 3.50，粘土：R2R2 = 0.52, RMSE = 6.30%, RPIQ = 1.95）。这种综合方法强调了先进的深度学习技术和多模式学习在提供高分辨率和大规模的土壤特征综合见解方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Remote Sensing of Environment 环境科学-成像科学与照相技术

CiteScore

25.10

自引率

8.90%

发文量

455

审稿时长

53 days

期刊介绍： Remote Sensing of Environment (RSE) serves the Earth observation community by disseminating results on the theory, science, applications, and technology that contribute to advancing the field of remote sensing. With a thoroughly interdisciplinary approach, RSE encompasses terrestrial, oceanic, and atmospheric sensing. The journal emphasizes biophysical and quantitative approaches to remote sensing at local to global scales, covering a diverse range of applications and techniques. RSE serves as a vital platform for the exchange of knowledge and advancements in the dynamic field of remote sensing.