Geostatistical semi-supervised learning for spatial prediction

IF 4.2

Artificial Intelligence in Geosciences Pub Date : 2022-12-01 DOI:10.1016/j.aiig.2022.12.002

Francky Fouedjio , Hassan Talebi

{"title":"Geostatistical semi-supervised learning for spatial prediction","authors":"Francky Fouedjio , Hassan Talebi","doi":"10.1016/j.aiig.2022.12.002","DOIUrl":null,"url":null,"abstract":"<div><p>Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms. Typically, the target variable is observed at a few sampling locations due to the relatively time-consuming and costly process of obtaining measurements. In contrast, auxiliary variables are often exhaustively observed within the region under study through the increasing development of remote sensing platforms and sensor networks. Supervised machine learning methods do not fully leverage this large amount of auxiliary spatial data. Indeed, in these methods, the training dataset includes only labeled data locations (where both target and auxiliary variables were measured). At the same time, unlabeled data locations (where auxiliary variables were measured but not the target variable) are not considered during the model training phase. Consequently, only a limited amount of auxiliary spatial data is utilized during the model training stage. As an alternative to supervised learning, semi-supervised learning, which learns from labeled as well as unlabeled data, can be used to address this problem. However, conventional semi-supervised learning techniques do not account for the specificities of spatial data. This paper introduces a spatial semi-supervised learning framework where geostatistics and machine learning are combined to harness a large amount of unlabeled spatial data in combination with typically a smaller set of labeled spatial data. The main idea consists of leveraging the target variable’s spatial autocorrelation to generate pseudo labels at unlabeled data points that are geographically close to labeled data points. This is achieved through geostatistical conditional simulation, where an ensemble of pseudo labels is generated to account for the uncertainty in the pseudo labeling process. The observed labels are augmented by this ensemble of pseudo labels to create an ensemble of pseudo training datasets. A supervised machine learning model is then trained on each pseudo training dataset, followed by an aggregation of trained models. The proposed geostatistical semi-supervised learning method is applied to synthetic and real-world spatial datasets. Its predictive performance is compared with some classical supervised and semi-supervised machine learning methods. It appears that it can effectively leverage a large amount of unlabeled spatial data to improve the target variable’s spatial prediction.</p></div>","PeriodicalId":100124,"journal":{"name":"Artificial Intelligence in Geosciences","volume":"3 ","pages":"Pages 162-178"},"PeriodicalIF":4.2000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666544122000351/pdfft?md5=94a8bd0caaee0a5284420ed1a1305ce9&pid=1-s2.0-S2666544122000351-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666544122000351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms. Typically, the target variable is observed at a few sampling locations due to the relatively time-consuming and costly process of obtaining measurements. In contrast, auxiliary variables are often exhaustively observed within the region under study through the increasing development of remote sensing platforms and sensor networks. Supervised machine learning methods do not fully leverage this large amount of auxiliary spatial data. Indeed, in these methods, the training dataset includes only labeled data locations (where both target and auxiliary variables were measured). At the same time, unlabeled data locations (where auxiliary variables were measured but not the target variable) are not considered during the model training phase. Consequently, only a limited amount of auxiliary spatial data is utilized during the model training stage. As an alternative to supervised learning, semi-supervised learning, which learns from labeled as well as unlabeled data, can be used to address this problem. However, conventional semi-supervised learning techniques do not account for the specificities of spatial data. This paper introduces a spatial semi-supervised learning framework where geostatistics and machine learning are combined to harness a large amount of unlabeled spatial data in combination with typically a smaller set of labeled spatial data. The main idea consists of leveraging the target variable’s spatial autocorrelation to generate pseudo labels at unlabeled data points that are geographically close to labeled data points. This is achieved through geostatistical conditional simulation, where an ensemble of pseudo labels is generated to account for the uncertainty in the pseudo labeling process. The observed labels are augmented by this ensemble of pseudo labels to create an ensemble of pseudo training datasets. A supervised machine learning model is then trained on each pseudo training dataset, followed by an aggregation of trained models. The proposed geostatistical semi-supervised learning method is applied to synthetic and real-world spatial datasets. Its predictive performance is compared with some classical supervised and semi-supervised machine learning methods. It appears that it can effectively leverage a large amount of unlabeled spatial data to improve the target variable’s spatial prediction.

查看原文本刊更多论文

用于空间预测的地统计学半监督学习

地球科学家越来越多地使用监督机器学习算法在辅助信息存在的情况下对目标变量进行空间预测。通常，由于获得测量的过程相对耗时和昂贵，目标变量在几个采样位置被观察到。相反，辅助变量往往是通过遥感平台和传感器网络的日益发展而在研究区域内详尽地观测到的。监督式机器学习方法并不能充分利用大量的辅助空间数据。事实上，在这些方法中，训练数据集只包括标记的数据位置(目标变量和辅助变量都被测量)。同时，在模型训练阶段不考虑未标记的数据位置(测量辅助变量而不是目标变量)。因此，在模型训练阶段只使用有限数量的辅助空间数据。作为监督学习的替代方案，半监督学习可以从标记和未标记的数据中学习，可以用来解决这个问题。然而，传统的半监督学习技术并没有考虑到空间数据的特殊性。本文介绍了一种空间半监督学习框架，将地质统计学和机器学习相结合，利用大量未标记的空间数据和通常较小的标记空间数据集。其主要思想是利用目标变量的空间自相关性，在地理上接近标记数据点的未标记数据点上生成伪标签。这是通过地质统计学条件模拟实现的，其中生成了一个伪标签的集合，以解释伪标签过程中的不确定性。观察到的标签通过这个伪标签的集合来增强，以创建伪训练数据集的集合。然后在每个伪训练数据集上训练有监督的机器学习模型，然后是训练模型的聚合。提出的地统计学半监督学习方法应用于合成空间数据集和真实空间数据集。将其预测性能与一些经典的监督和半监督机器学习方法进行了比较。它可以有效地利用大量未标记的空间数据来提高目标变量的空间预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence in Geosciences

CiteScore

4.20

自引率

0.00%

发文量