Self-supervised learning of Vision Transformers for digital soil mapping using visual data

IF 5.6 1区农林科学 Q1 SOIL SCIENCE

Geoderma Pub Date : 2024-10-01 DOI:10.1016/j.geoderma.2024.117056

Paul Tresson , Maxime Dumont , Marc Jaeger , Frédéric Borne , Stéphane Boivin , Loïc Marie-Louise , Jérémie François , Hassan Boukcim , Hervé Goëau

{"title":"Self-supervised learning of Vision Transformers for digital soil mapping using visual data","authors":"Paul Tresson , Maxime Dumont , Marc Jaeger , Frédéric Borne , Stéphane Boivin , Loïc Marie-Louise , Jérémie François , Hassan Boukcim , Hervé Goëau","doi":"10.1016/j.geoderma.2024.117056","DOIUrl":null,"url":null,"abstract":"<div><div>In arid environments, prospecting cultivable land is challenging due to harsh climatic conditions and vast, hard-to-access areas. However, the soil is often bare, with little vegetation cover, making it easy to observe from above. Hence, remote sensing can drastically reduce costs to explore these areas. For the past few years, deep learning has extended remote sensing analysis, first with Convolutional Neural Networks (CNNs), then with Vision Transformers (ViTs). The main drawback of deep learning methods is their reliance on large calibration datasets, as data collection is a cumbersome and costly task, particularly in drylands. However, recent studies demonstrate that ViTs can be trained in a self-supervised manner to take advantage of large amounts of unlabelled data to pre-train models. These backbone models can then be finetuned to learn a supervised regression model with few labelled data.</div><div>In our study, we trained ViTs in a self-supervised way with a 9500 km<sup>2</sup> satellite image of dry-lands in Saudi Arabia with a spatial resolution of 1.5 m per pixel. The resulting models were used to extract features describing the bare soil and predict soil attributes (pH H<sub>2</sub>O, pH KCl, Si composition). Using only RGB data, we can accurately predict these soil properties and achieve, for instance, an RMSE of 0.40 ± 0.03 when predicting alkaline soil pH. We also assess the effectiveness of adding additional covariates, such as elevation. The pretrained models can as well be used as visual features extractors. These features can be used to automatically generate a clustered map of an area or as input of random forests models, providing a versatile way to generate maps with limited labelled data and input variables.</div></div>","PeriodicalId":12511,"journal":{"name":"Geoderma","volume":"450 ","pages":"Article 117056"},"PeriodicalIF":5.6000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geoderma","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0016706124002854","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOIL SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In arid environments, prospecting cultivable land is challenging due to harsh climatic conditions and vast, hard-to-access areas. However, the soil is often bare, with little vegetation cover, making it easy to observe from above. Hence, remote sensing can drastically reduce costs to explore these areas. For the past few years, deep learning has extended remote sensing analysis, first with Convolutional Neural Networks (CNNs), then with Vision Transformers (ViTs). The main drawback of deep learning methods is their reliance on large calibration datasets, as data collection is a cumbersome and costly task, particularly in drylands. However, recent studies demonstrate that ViTs can be trained in a self-supervised manner to take advantage of large amounts of unlabelled data to pre-train models. These backbone models can then be finetuned to learn a supervised regression model with few labelled data.

In our study, we trained ViTs in a self-supervised way with a 9500 km² satellite image of dry-lands in Saudi Arabia with a spatial resolution of 1.5 m per pixel. The resulting models were used to extract features describing the bare soil and predict soil attributes (pH H₂O, pH KCl, Si composition). Using only RGB data, we can accurately predict these soil properties and achieve, for instance, an RMSE of 0.40 ± 0.03 when predicting alkaline soil pH. We also assess the effectiveness of adding additional covariates, such as elevation. The pretrained models can as well be used as visual features extractors. These features can be used to automatically generate a clustered map of an area or as input of random forests models, providing a versatile way to generate maps with limited labelled data and input variables.

查看原文本刊更多论文

利用视觉数据进行数字土壤制图的视觉变换器自我监督学习

在干旱的环境中，由于气候条件恶劣，耕地面积广阔，难以进入，因此勘探耕地具有挑战性。然而，土壤通常是裸露的，植被覆盖很少，便于从高空进行观测。因此，遥感技术可以大大降低探索这些地区的成本。过去几年，深度学习扩展了遥感分析，首先是卷积神经网络（CNN），然后是视觉转换器（ViT）。深度学习方法的主要缺点是依赖大型校准数据集，因为数据收集是一项繁琐且成本高昂的任务，尤其是在干旱地区。不过，最近的研究表明，ViTs 可以通过自我监督的方式进行训练，以利用大量未标记的数据对模型进行预训练。在我们的研究中，我们利用沙特阿拉伯 9500 平方公里的旱地卫星图像，以每像素 1.5 米的空间分辨率对 ViTs 进行了自我监督式训练。所得模型用于提取裸露土壤的特征，并预测土壤属性（pH H2O、pH KCl、Si 成分）。仅使用 RGB 数据，我们就能准确预测这些土壤属性，例如，在预测碱性土壤 pH 值时，RMSE 为 0.40 ± 0.03。我们还评估了添加海拔等其他协变量的效果。预训练模型还可用作视觉特征提取器。这些特征可用于自动生成一个区域的聚类地图，或作为随机森林模型的输入，为利用有限的标注数据和输入变量生成地图提供了一种通用方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Geoderma 农林科学-土壤科学

CiteScore

11.80

自引率

6.60%

发文量

597

审稿时长

58 days

期刊介绍： Geoderma - the global journal of soil science - welcomes authors, readers and soil research from all parts of the world, encourages worldwide soil studies, and embraces all aspects of soil science and its associated pedagogy. The journal particularly welcomes interdisciplinary work focusing on dynamic soil processes and functions across space and time.