{"title":"Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length.","authors":"Zihao Wen, David L Dowe","doi":"10.3390/e27010006","DOIUrl":null,"url":null,"abstract":"<p><p>Species distribution modeling is fundamental to biodiversity, evolution, conservation science, and the study of invasive species. Given environmental data and species distribution data, model selection techniques are frequently used to help identify relevant features. Existing studies aim to find the relevant features by selecting the best models using different criteria, and they deem the predictors in the best models as the relevant features. However, they mostly consider only a given model family, making them vulnerable to model family misspecification. To address this issue, this paper introduces the Bayesian information-theoretic minimum message length (MML) principle to species distribution model selection. In particular, we provide a framework that allows the message length of models from multiple model families to be calculated and compared, and by doing so, the model selection is both accurate and robust against model family misspecification and data aggregation. To find the relevant features efficiently, we further develop a novel search algorithm that does not require calculating the message length for all possible subsets of features. Experimental results demonstrate that our proposed method outperforms competing methods by selecting the best models on both artificial and real-world datasets. More specifically, there was one test on artificial data that all methods got wrong. On the other 10 tests on artificial data, the MML method got everything correct, but the alternative methods all failed on a variety of tests. Our real-world data pertained to two plant species from Barro Colorado Island, Panama. Compared to the alternative methods, for both the plant species, the MML method selects the simplest model while also having the overall best predictions.</p>","PeriodicalId":11694,"journal":{"name":"Entropy","volume":"27 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11764307/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Entropy","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.3390/e27010006","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Species distribution modeling is fundamental to biodiversity, evolution, conservation science, and the study of invasive species. Given environmental data and species distribution data, model selection techniques are frequently used to help identify relevant features. Existing studies aim to find the relevant features by selecting the best models using different criteria, and they deem the predictors in the best models as the relevant features. However, they mostly consider only a given model family, making them vulnerable to model family misspecification. To address this issue, this paper introduces the Bayesian information-theoretic minimum message length (MML) principle to species distribution model selection. In particular, we provide a framework that allows the message length of models from multiple model families to be calculated and compared, and by doing so, the model selection is both accurate and robust against model family misspecification and data aggregation. To find the relevant features efficiently, we further develop a novel search algorithm that does not require calculating the message length for all possible subsets of features. Experimental results demonstrate that our proposed method outperforms competing methods by selecting the best models on both artificial and real-world datasets. More specifically, there was one test on artificial data that all methods got wrong. On the other 10 tests on artificial data, the MML method got everything correct, but the alternative methods all failed on a variety of tests. Our real-world data pertained to two plant species from Barro Colorado Island, Panama. Compared to the alternative methods, for both the plant species, the MML method selects the simplest model while also having the overall best predictions.
期刊介绍:
Entropy (ISSN 1099-4300), an international and interdisciplinary journal of entropy and information studies, publishes reviews, regular research papers and short notes. Our aim is to encourage scientists to publish as much as possible their theoretical and experimental details. There is no restriction on the length of the papers. If there are computation and the experiment, the details must be provided so that the results can be reproduced.