Simulating multi-scale optimization and variable selection in species distribution modeling

IF 5.8 2区环境科学与生态学 Q1 ECOLOGY

Ecological Informatics Pub Date : 2024-09-25 DOI:10.1016/j.ecoinf.2024.102832

Samuel A. Cushman , Zaneta M. Kaszta , Patrick Burns , Christopher R. Hakkenberg , Patrick Jantz , David W. Macdonald , Jedediah F. Brodie , Mairin C.M. Deith , Scott Goetz

{"title":"Simulating multi-scale optimization and variable selection in species distribution modeling","authors":"Samuel A. Cushman , Zaneta M. Kaszta , Patrick Burns , Christopher R. Hakkenberg , Patrick Jantz , David W. Macdonald , Jedediah F. Brodie , Mairin C.M. Deith , Scott Goetz","doi":"10.1016/j.ecoinf.2024.102832","DOIUrl":null,"url":null,"abstract":"<div><div>Species distribution modeling (SDM) is a fundamental tool in theoretical and applied ecology. However, relatively little is known about the performance of different approaches for scale optimization, model selection, and algorithmic prediction in the context of nonlinear, multiscale and interactive relationships between environmental variables and species occurrence. Modelers often struggle to optimize a tradeoff between ecological relevance, model robustness, complexity, and overfitting. In this paper, we investigated several methods designed to optimize spatial scale and variable selection in SDMs, in each case evaluating model fitness, parsimony and predictive performance. We used a simulation approach to produce a large pool of alternative underlying habitat relationships that reflect a broad range of realistic habitat associations. We also compared several different modeling algorithms, including logistic regression with a generalized linear model (GLM), Lasso and Elastic-Net Regularized GLMs (GLMNet), and random forest (RF), as well as alternative variable and scale selection methods. We found that GLM methods employing all-subsets dredge routines for variable selection were consistently the best predictors based on all criteria of our model performance assessment and across all attributes of the simulated underlying relationship, including nonlinearity and interaction. We had expected machine learning approaches, such as random forest, to perform better in these more complex forms of species-environment relationships. GLM using dredge variable selection was also the method that included the fewest spurious covariates and included the most correct predictors as a proportion of all predictors. We found that univariate scaling was the most robust method of variable and scale selection, along with Minimal Redundancy Maximal Relevancy (MRMR) which performed equivalently. The simulation experiment presented here provides a robust assessment of simulated multi-species distribution model performance, complexity and fidelity. By simulating a large range of potential habitat relationships with varying spatial scale, effect sizes, linearity, and interactions, we comprehensively evaluated model performance across gradients of complexity of the underlying relationships and violations of classical statistical assumptions. This study provides a valuable assessment and a broader example of the power and utility of controlled simulation experiments in habitat relationships and other ecological spatial predictive modeling.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"83 ","pages":"Article 102832"},"PeriodicalIF":5.8000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954124003741","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Species distribution modeling (SDM) is a fundamental tool in theoretical and applied ecology. However, relatively little is known about the performance of different approaches for scale optimization, model selection, and algorithmic prediction in the context of nonlinear, multiscale and interactive relationships between environmental variables and species occurrence. Modelers often struggle to optimize a tradeoff between ecological relevance, model robustness, complexity, and overfitting. In this paper, we investigated several methods designed to optimize spatial scale and variable selection in SDMs, in each case evaluating model fitness, parsimony and predictive performance. We used a simulation approach to produce a large pool of alternative underlying habitat relationships that reflect a broad range of realistic habitat associations. We also compared several different modeling algorithms, including logistic regression with a generalized linear model (GLM), Lasso and Elastic-Net Regularized GLMs (GLMNet), and random forest (RF), as well as alternative variable and scale selection methods. We found that GLM methods employing all-subsets dredge routines for variable selection were consistently the best predictors based on all criteria of our model performance assessment and across all attributes of the simulated underlying relationship, including nonlinearity and interaction. We had expected machine learning approaches, such as random forest, to perform better in these more complex forms of species-environment relationships. GLM using dredge variable selection was also the method that included the fewest spurious covariates and included the most correct predictors as a proportion of all predictors. We found that univariate scaling was the most robust method of variable and scale selection, along with Minimal Redundancy Maximal Relevancy (MRMR) which performed equivalently. The simulation experiment presented here provides a robust assessment of simulated multi-species distribution model performance, complexity and fidelity. By simulating a large range of potential habitat relationships with varying spatial scale, effect sizes, linearity, and interactions, we comprehensively evaluated model performance across gradients of complexity of the underlying relationships and violations of classical statistical assumptions. This study provides a valuable assessment and a broader example of the power and utility of controlled simulation experiments in habitat relationships and other ecological spatial predictive modeling.

查看原文本刊更多论文

模拟物种分布模型中的多尺度优化和变量选择

物种分布建模（SDM）是理论和应用生态学的基本工具。然而，在环境变量与物种出现之间的非线性、多尺度和交互关系的背景下，人们对尺度优化、模型选择和算法预测等不同方法的性能知之甚少。建模者通常需要在生态相关性、模型稳健性、复杂性和过度拟合之间进行优化权衡。在本文中，我们研究了几种旨在优化 SDM 中空间尺度和变量选择的方法，每种方法都对模型的适宜性、简约性和预测性能进行了评估。我们使用模拟方法生成了大量可供选择的基本生境关系，这些关系反映了广泛的现实生境关联。我们还比较了几种不同的建模算法，包括具有广义线性模型（GLM）的逻辑回归、Lasso 和弹性网正则化 GLMs（GLMNet）、随机森林（RF），以及其他变量和尺度选择方法。我们发现，根据模型性能评估的所有标准以及模拟基础关系的所有属性（包括非线性和交互作用），采用全子集挖掘例程进行变量选择的 GLM 方法始终是最佳预测方法。我们曾期望机器学习方法（如随机森林）在这些更为复杂的物种-环境关系中表现得更好。使用疏浚变量选择的 GLM 也是包含虚假协变量最少的方法，而且包含的正确预测因子占所有预测因子的比例最高。我们发现，单变量标度是最稳健的变量和标度选择方法，而最小冗余最大相关性（MRMR）的表现也相当不错。本文介绍的模拟实验对模拟多物种分布模型的性能、复杂性和保真度进行了可靠的评估。通过模拟具有不同空间尺度、效应大小、线性和交互作用的大量潜在生境关系，我们全面评估了基本关系复杂性梯度和违反经典统计假设情况下的模型性能。这项研究为栖息地关系和其他生态空间预测建模中受控模拟实验的能力和实用性提供了宝贵的评估和更广泛的范例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ecological Informatics 环境科学-生态学

CiteScore

8.30

自引率

11.80%

发文量

346

审稿时长

46 days

期刊介绍： The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.