{"title":"基于遥感影像的德国农业土壤高分辨率土壤质地预测的多模态视觉变压器","authors":"Lucas Wittstruck, Björn Waske, Thomas Jarmer","doi":"10.1016/j.rse.2025.114985","DOIUrl":null,"url":null,"abstract":"<div><div>The quantification and mapping of important soil properties, such as soil texture, are vital for effective crop management and the assessment of overall soil health in agricultural systems. In this study, we propose a multi-modal Visual Transformer (MMVT) architecture to predict and map the soil particle size distribution of agricultural topsoils in Germany at a high spatial resolution of 10 meters. Our modeling utilized multi-source bare soil satellite image composites with terrain and soil-related covariates. To optimize the model’s ability to capture spatial soil context, various image sizes were evaluated. The study findings highlighted the effectiveness of our MMVT model, demonstrating improved estimation accuracies compared to a two-dimensional Convolutional Neural Network (2D CNN) and a Random Forest (RF) model. Specifically, the proposed transformer network achieved the highest averaged validated accuracy in predicting the soil texture when incorporating a contextual image surrounding of 320 × 320 m around the soil sampling positions (Sand: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.74, RMSE = 14.78%, and RPIQ = 3.52, Silt: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.73, RMSE = 12.36%, and RPIQ = 3.50, Clay: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.52, RMSE = 6.30%, and RPIQ = 1.95). This integrated approach underscores the potential of advanced deep learning techniques and multi-modal learning in providing comprehensive insights into soil characteristics with high resolution and at a large scale.</div></div>","PeriodicalId":417,"journal":{"name":"Remote Sensing of Environment","volume":"331 ","pages":"Article 114985"},"PeriodicalIF":11.4000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Modal Vision Transformer for high-resolution soil texture prediction of German agricultural soils using remote sensing imagery\",\"authors\":\"Lucas Wittstruck, Björn Waske, Thomas Jarmer\",\"doi\":\"10.1016/j.rse.2025.114985\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The quantification and mapping of important soil properties, such as soil texture, are vital for effective crop management and the assessment of overall soil health in agricultural systems. In this study, we propose a multi-modal Visual Transformer (MMVT) architecture to predict and map the soil particle size distribution of agricultural topsoils in Germany at a high spatial resolution of 10 meters. Our modeling utilized multi-source bare soil satellite image composites with terrain and soil-related covariates. To optimize the model’s ability to capture spatial soil context, various image sizes were evaluated. The study findings highlighted the effectiveness of our MMVT model, demonstrating improved estimation accuracies compared to a two-dimensional Convolutional Neural Network (2D CNN) and a Random Forest (RF) model. Specifically, the proposed transformer network achieved the highest averaged validated accuracy in predicting the soil texture when incorporating a contextual image surrounding of 320 × 320 m around the soil sampling positions (Sand: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.74, RMSE = 14.78%, and RPIQ = 3.52, Silt: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.73, RMSE = 12.36%, and RPIQ = 3.50, Clay: <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.52, RMSE = 6.30%, and RPIQ = 1.95). This integrated approach underscores the potential of advanced deep learning techniques and multi-modal learning in providing comprehensive insights into soil characteristics with high resolution and at a large scale.</div></div>\",\"PeriodicalId\":417,\"journal\":{\"name\":\"Remote Sensing of Environment\",\"volume\":\"331 \",\"pages\":\"Article 114985\"},\"PeriodicalIF\":11.4000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Remote Sensing of Environment\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S003442572500389X\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Remote Sensing of Environment","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003442572500389X","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
Multi-Modal Vision Transformer for high-resolution soil texture prediction of German agricultural soils using remote sensing imagery
The quantification and mapping of important soil properties, such as soil texture, are vital for effective crop management and the assessment of overall soil health in agricultural systems. In this study, we propose a multi-modal Visual Transformer (MMVT) architecture to predict and map the soil particle size distribution of agricultural topsoils in Germany at a high spatial resolution of 10 meters. Our modeling utilized multi-source bare soil satellite image composites with terrain and soil-related covariates. To optimize the model’s ability to capture spatial soil context, various image sizes were evaluated. The study findings highlighted the effectiveness of our MMVT model, demonstrating improved estimation accuracies compared to a two-dimensional Convolutional Neural Network (2D CNN) and a Random Forest (RF) model. Specifically, the proposed transformer network achieved the highest averaged validated accuracy in predicting the soil texture when incorporating a contextual image surrounding of 320 × 320 m around the soil sampling positions (Sand: = 0.74, RMSE = 14.78%, and RPIQ = 3.52, Silt: = 0.73, RMSE = 12.36%, and RPIQ = 3.50, Clay: = 0.52, RMSE = 6.30%, and RPIQ = 1.95). This integrated approach underscores the potential of advanced deep learning techniques and multi-modal learning in providing comprehensive insights into soil characteristics with high resolution and at a large scale.
期刊介绍:
Remote Sensing of Environment (RSE) serves the Earth observation community by disseminating results on the theory, science, applications, and technology that contribute to advancing the field of remote sensing. With a thoroughly interdisciplinary approach, RSE encompasses terrestrial, oceanic, and atmospheric sensing.
The journal emphasizes biophysical and quantitative approaches to remote sensing at local to global scales, covering a diverse range of applications and techniques.
RSE serves as a vital platform for the exchange of knowledge and advancements in the dynamic field of remote sensing.