{"title":"预测物种丰度分布的随机森林和空间交叉验证性能","authors":"Ciza Arsène Mushagalusa, Adandé Belarmain Fandohan, Romain Glèlè Kakaï","doi":"10.1186/s40068-024-00352-9","DOIUrl":null,"url":null,"abstract":"Random forests (RF) have been widely used to predict spatial variables. Several studies have shown that spatial cross-validation (CV) methods consistently cause RF to yield larger prediction errors compared to standard CV methods. This study examined the impact of species characteristics and data features on the performance of the standard RF and spatial CV approaches for predicting species abundance distribution. It compared the standard 5-fold CV, design-based validation, and three different spatial CV methods, such as spatial buffering, environmental blocking, and spatial blocking. Validation samples were randomly selected for design-based validation without replacement. We evaluated their predictive performance (accuracy and discrimination metrics) using artificial species abundance data generated by a linear function of a constant term ( $$\\beta _0$$ ) and a random error term following a zero-mean Gaussian process with a covariance matrix determined by an exponential correlation function. The model was tuned over multiple simulations to consider different mean levels of species abundance, spatial autocorrelation variation, and species detection probability. Here we found that the standard RF had poor predictive performance when spatial autocorrelation was high and the species probability of detection was low. Design-based validation and standard K-fold CV were found to be the most effective strategies for evaluating RF performance compared to spatial CV methods, even in the presence of high spatial autocorrelation and imperfect detection for random samples. For weakly or moderately clustered samples, they yielded good modelling efficiency but overestimated RF’s predictive power, while they overestimated modelling efficiency, predictive power, and accuracy for strongly clustered samples with high spatial autocorrelation. Globally, the checkerboard pattern in the allocation of blocks to folds in blocked spatial CV was found to be the most effective CV approach for clustered samples, whatever the degree of clustering, spatial autocorrelation, or species abundance class. The checkerboard pattern in spatial CV was found to be the best method for random or systematic samples with spatial autocorrelation, but less effective than non-spatial CV approaches. Failing to take data features into account when validating models can lead to unrealistic predictions of species abundance and related parameters and, therefore, incorrect interpretations of patterns and conclusions. Further research should explore the benefits of using blocked spatial K-fold CV with checkerboard assignment of blocks to folds for clustered samples with high spatial autocorrelation.","PeriodicalId":12037,"journal":{"name":"Environmental Systems Research","volume":"81 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Random forest and spatial cross-validation performance in predicting species abundance distributions\",\"authors\":\"Ciza Arsène Mushagalusa, Adandé Belarmain Fandohan, Romain Glèlè Kakaï\",\"doi\":\"10.1186/s40068-024-00352-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Random forests (RF) have been widely used to predict spatial variables. Several studies have shown that spatial cross-validation (CV) methods consistently cause RF to yield larger prediction errors compared to standard CV methods. This study examined the impact of species characteristics and data features on the performance of the standard RF and spatial CV approaches for predicting species abundance distribution. It compared the standard 5-fold CV, design-based validation, and three different spatial CV methods, such as spatial buffering, environmental blocking, and spatial blocking. Validation samples were randomly selected for design-based validation without replacement. We evaluated their predictive performance (accuracy and discrimination metrics) using artificial species abundance data generated by a linear function of a constant term ( $$\\\\beta _0$$ ) and a random error term following a zero-mean Gaussian process with a covariance matrix determined by an exponential correlation function. The model was tuned over multiple simulations to consider different mean levels of species abundance, spatial autocorrelation variation, and species detection probability. Here we found that the standard RF had poor predictive performance when spatial autocorrelation was high and the species probability of detection was low. Design-based validation and standard K-fold CV were found to be the most effective strategies for evaluating RF performance compared to spatial CV methods, even in the presence of high spatial autocorrelation and imperfect detection for random samples. For weakly or moderately clustered samples, they yielded good modelling efficiency but overestimated RF’s predictive power, while they overestimated modelling efficiency, predictive power, and accuracy for strongly clustered samples with high spatial autocorrelation. Globally, the checkerboard pattern in the allocation of blocks to folds in blocked spatial CV was found to be the most effective CV approach for clustered samples, whatever the degree of clustering, spatial autocorrelation, or species abundance class. The checkerboard pattern in spatial CV was found to be the best method for random or systematic samples with spatial autocorrelation, but less effective than non-spatial CV approaches. Failing to take data features into account when validating models can lead to unrealistic predictions of species abundance and related parameters and, therefore, incorrect interpretations of patterns and conclusions. Further research should explore the benefits of using blocked spatial K-fold CV with checkerboard assignment of blocks to folds for clustered samples with high spatial autocorrelation.\",\"PeriodicalId\":12037,\"journal\":{\"name\":\"Environmental Systems Research\",\"volume\":\"81 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Systems Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s40068-024-00352-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Systems Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s40068-024-00352-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Random forest and spatial cross-validation performance in predicting species abundance distributions
Random forests (RF) have been widely used to predict spatial variables. Several studies have shown that spatial cross-validation (CV) methods consistently cause RF to yield larger prediction errors compared to standard CV methods. This study examined the impact of species characteristics and data features on the performance of the standard RF and spatial CV approaches for predicting species abundance distribution. It compared the standard 5-fold CV, design-based validation, and three different spatial CV methods, such as spatial buffering, environmental blocking, and spatial blocking. Validation samples were randomly selected for design-based validation without replacement. We evaluated their predictive performance (accuracy and discrimination metrics) using artificial species abundance data generated by a linear function of a constant term ( $$\beta _0$$ ) and a random error term following a zero-mean Gaussian process with a covariance matrix determined by an exponential correlation function. The model was tuned over multiple simulations to consider different mean levels of species abundance, spatial autocorrelation variation, and species detection probability. Here we found that the standard RF had poor predictive performance when spatial autocorrelation was high and the species probability of detection was low. Design-based validation and standard K-fold CV were found to be the most effective strategies for evaluating RF performance compared to spatial CV methods, even in the presence of high spatial autocorrelation and imperfect detection for random samples. For weakly or moderately clustered samples, they yielded good modelling efficiency but overestimated RF’s predictive power, while they overestimated modelling efficiency, predictive power, and accuracy for strongly clustered samples with high spatial autocorrelation. Globally, the checkerboard pattern in the allocation of blocks to folds in blocked spatial CV was found to be the most effective CV approach for clustered samples, whatever the degree of clustering, spatial autocorrelation, or species abundance class. The checkerboard pattern in spatial CV was found to be the best method for random or systematic samples with spatial autocorrelation, but less effective than non-spatial CV approaches. Failing to take data features into account when validating models can lead to unrealistic predictions of species abundance and related parameters and, therefore, incorrect interpretations of patterns and conclusions. Further research should explore the benefits of using blocked spatial K-fold CV with checkerboard assignment of blocks to folds for clustered samples with high spatial autocorrelation.