Random forest for spatial prediction of censored response variables

Francky Fouedjio
{"title":"Random forest for spatial prediction of censored response variables","authors":"Francky Fouedjio","doi":"10.1016/j.aiig.2022.02.001","DOIUrl":null,"url":null,"abstract":"<div><p>The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.</p></div>","PeriodicalId":100124,"journal":{"name":"Artificial Intelligence in Geosciences","volume":"2 ","pages":"Pages 115-127"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666544122000016/pdfft?md5=5c1b45229424d5b90fff743abbbc97b8&pid=1-s2.0-S2666544122000016-main.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666544122000016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.

随机森林对截尾响应变量的空间预测
当空间穷尽预测变量在研究区域内可用时,连续响应变量的空间预测已经在许多地球科学领域中普遍存在。由于所使用的测量仪器或采样方案的限制,响应变量经常受到检测限的限制。因此,响应变量的观测值被删减(左删减、右删减或区间删减)。致力于对未删减响应变量的空间预测的机器学习方法不能明确地解释响应变量的删减观测值。在这种情况下,它们通常通过特别的方法应用,例如忽略响应变量的审查观察值或用任意值替换它们。因此,响应变量的空间预测可能是不准确的,并且对这些任意选择所涉及的假设和近似很敏感。本文介绍了一种基于随机森林的机器学习方法,用于空间预测截尾响应变量,其中响应变量的截尾观测被明确地考虑在内。基本思想包括通过在仅包含响应变量的未删节观测值的数据子集上训练经典回归随机森林来构建回归树预测器的集合。然后,应用于该集合的主成分分析允许将响应变量的观测值(未审查和审查)转换为线性等式和不等式系统。这个线性等式和不等式系统是通过随机二次规划来解决的,它允许获得重建回归树预测器的集合,这些预测器完全尊重响应变量的观察值(未审查和审查)。响应变量的空间预测是通过对后一个集合求平均值得到的。所提出的机器学习方法的有效性在模拟数据上得到了说明,这些数据可以获得地面真相,并在现实世界数据(包括地球化学数据)上得到了展示。结果表明,所提出的机器学习技术允许比临时方法更好地利用响应变量的审查观察值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.20
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信