Samuel A. Cushman , Zaneta M. Kaszta , Patrick Burns , Christopher R. Hakkenberg , Patrick Jantz , David W. Macdonald , Jedediah F. Brodie , Mairin C.M. Deith , Scott Goetz
{"title":"Simulating multi-scale optimization and variable selection in species distribution modeling","authors":"Samuel A. Cushman , Zaneta M. Kaszta , Patrick Burns , Christopher R. Hakkenberg , Patrick Jantz , David W. Macdonald , Jedediah F. Brodie , Mairin C.M. Deith , Scott Goetz","doi":"10.1016/j.ecoinf.2024.102832","DOIUrl":null,"url":null,"abstract":"<div><div>Species distribution modeling (SDM) is a fundamental tool in theoretical and applied ecology. However, relatively little is known about the performance of different approaches for scale optimization, model selection, and algorithmic prediction in the context of nonlinear, multiscale and interactive relationships between environmental variables and species occurrence. Modelers often struggle to optimize a tradeoff between ecological relevance, model robustness, complexity, and overfitting. In this paper, we investigated several methods designed to optimize spatial scale and variable selection in SDMs, in each case evaluating model fitness, parsimony and predictive performance. We used a simulation approach to produce a large pool of alternative underlying habitat relationships that reflect a broad range of realistic habitat associations. We also compared several different modeling algorithms, including logistic regression with a generalized linear model (GLM), Lasso and Elastic-Net Regularized GLMs (GLMNet), and random forest (RF), as well as alternative variable and scale selection methods. We found that GLM methods employing all-subsets dredge routines for variable selection were consistently the best predictors based on all criteria of our model performance assessment and across all attributes of the simulated underlying relationship, including nonlinearity and interaction. We had expected machine learning approaches, such as random forest, to perform better in these more complex forms of species-environment relationships. GLM using dredge variable selection was also the method that included the fewest spurious covariates and included the most correct predictors as a proportion of all predictors. We found that univariate scaling was the most robust method of variable and scale selection, along with Minimal Redundancy Maximal Relevancy (MRMR) which performed equivalently. The simulation experiment presented here provides a robust assessment of simulated multi-species distribution model performance, complexity and fidelity. By simulating a large range of potential habitat relationships with varying spatial scale, effect sizes, linearity, and interactions, we comprehensively evaluated model performance across gradients of complexity of the underlying relationships and violations of classical statistical assumptions. This study provides a valuable assessment and a broader example of the power and utility of controlled simulation experiments in habitat relationships and other ecological spatial predictive modeling.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":null,"pages":null},"PeriodicalIF":5.8000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954124003741","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Species distribution modeling (SDM) is a fundamental tool in theoretical and applied ecology. However, relatively little is known about the performance of different approaches for scale optimization, model selection, and algorithmic prediction in the context of nonlinear, multiscale and interactive relationships between environmental variables and species occurrence. Modelers often struggle to optimize a tradeoff between ecological relevance, model robustness, complexity, and overfitting. In this paper, we investigated several methods designed to optimize spatial scale and variable selection in SDMs, in each case evaluating model fitness, parsimony and predictive performance. We used a simulation approach to produce a large pool of alternative underlying habitat relationships that reflect a broad range of realistic habitat associations. We also compared several different modeling algorithms, including logistic regression with a generalized linear model (GLM), Lasso and Elastic-Net Regularized GLMs (GLMNet), and random forest (RF), as well as alternative variable and scale selection methods. We found that GLM methods employing all-subsets dredge routines for variable selection were consistently the best predictors based on all criteria of our model performance assessment and across all attributes of the simulated underlying relationship, including nonlinearity and interaction. We had expected machine learning approaches, such as random forest, to perform better in these more complex forms of species-environment relationships. GLM using dredge variable selection was also the method that included the fewest spurious covariates and included the most correct predictors as a proportion of all predictors. We found that univariate scaling was the most robust method of variable and scale selection, along with Minimal Redundancy Maximal Relevancy (MRMR) which performed equivalently. The simulation experiment presented here provides a robust assessment of simulated multi-species distribution model performance, complexity and fidelity. By simulating a large range of potential habitat relationships with varying spatial scale, effect sizes, linearity, and interactions, we comprehensively evaluated model performance across gradients of complexity of the underlying relationships and violations of classical statistical assumptions. This study provides a valuable assessment and a broader example of the power and utility of controlled simulation experiments in habitat relationships and other ecological spatial predictive modeling.
期刊介绍:
The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.