Cross validation for model selection: A review with examples from ecology

IF 7.1 1区环境科学与生态学 Q1 ECOLOGY

Ecological Monographs Pub Date : 2022-11-13 DOI:10.1002/ecm.1557

Luke A. Yates, Zach Aandahl, Shane A. Richards, Barry W. Brook

{"title":"Cross validation for model selection: A review with examples from ecology","authors":"Luke A. Yates, Zach Aandahl, Shane A. Richards, Barry W. Brook","doi":"10.1002/ecm.1557","DOIUrl":null,"url":null,"abstract":"Specifying, assessing, and selecting among candidate statistical models is fundamental to ecological research. Commonly used approaches to model selection are based on predictive scores and include information criteria such as Akaike's information criterion, and cross validation. Based on data splitting, cross validation is particularly versatile because it can be used even when it is not possible to derive a likelihood (e.g., many forms of machine learning) or count parameters precisely (e.g., mixed-effects models). However, much of the literature on cross validation is technical and spread across statistical journals, making it difficult for ecological analysts to assess and choose among the wide range of options. Here we provide a comprehensive, accessible review that explains important—but often overlooked—technical aspects of cross validation for model selection, such as: bias correction, estimation uncertainty, choice of scores, and selection rules to mitigate overfitting. We synthesize the relevant statistical advances to make recommendations for the choice of cross-validation technique and we present two ecological case studies to illustrate their application. In most instances, we recommend using exact or approximate leave-one-out cross validation to minimize bias, or otherwise k-fold with bias correction if k < 10. To mitigate overfitting when using cross validation, we recommend calibrated selection via our recently introduced modified one-standard-error rule. We advocate for the use of predictive scores in model selection across a range of typical modeling goals, such as exploration, hypothesis testing, and prediction, provided that models are specified in accordance with the stated goal. We also emphasize, as others have done, that inference on parameter estimates is biased if preceded by model selection and instead requires a carefully specified single model or further technical adjustments.","PeriodicalId":11505,"journal":{"name":"Ecological Monographs","volume":"93 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2022-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ecm.1557","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Monographs","FirstCategoryId":"93","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ecm.1557","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}

引用次数: 22

Abstract

Specifying, assessing, and selecting among candidate statistical models is fundamental to ecological research. Commonly used approaches to model selection are based on predictive scores and include information criteria such as Akaike's information criterion, and cross validation. Based on data splitting, cross validation is particularly versatile because it can be used even when it is not possible to derive a likelihood (e.g., many forms of machine learning) or count parameters precisely (e.g., mixed-effects models). However, much of the literature on cross validation is technical and spread across statistical journals, making it difficult for ecological analysts to assess and choose among the wide range of options. Here we provide a comprehensive, accessible review that explains important—but often overlooked—technical aspects of cross validation for model selection, such as: bias correction, estimation uncertainty, choice of scores, and selection rules to mitigate overfitting. We synthesize the relevant statistical advances to make recommendations for the choice of cross-validation technique and we present two ecological case studies to illustrate their application. In most instances, we recommend using exact or approximate leave-one-out cross validation to minimize bias, or otherwise k-fold with bias correction if k < 10. To mitigate overfitting when using cross validation, we recommend calibrated selection via our recently introduced modified one-standard-error rule. We advocate for the use of predictive scores in model selection across a range of typical modeling goals, such as exploration, hypothesis testing, and prediction, provided that models are specified in accordance with the stated goal. We also emphasize, as others have done, that inference on parameter estimates is biased if preceded by model selection and instead requires a carefully specified single model or further technical adjustments.

Abstract Image

查看原文本刊更多论文

模型选择的交叉验证：生态学实例综述

指定、评估和选择候选统计模型是生态学研究的基础。常用的模型选择方法是基于预测分数，包括信息标准，如赤池信息标准和交叉验证。基于数据分割，交叉验证是特别通用的，因为它甚至可以在不可能导出可能性(例如，许多形式的机器学习)或精确计数参数(例如，混合效应模型)时使用。然而，许多关于交叉验证的文献都是技术性的，并且分布在统计期刊上，这使得生态分析师很难在广泛的选择中进行评估和选择。在这里，我们提供了一个全面的，易于理解的回顾，解释了交叉验证模型选择的重要但经常被忽视的技术方面，如:偏差校正，估计不确定性，分数的选择和选择规则，以减轻过拟合。我们综合了相关的统计进展，对交叉验证技术的选择提出了建议，并提出了两个生态案例研究来说明它们的应用。在大多数情况下，我们建议使用精确或近似的留一交叉验证来最小化偏差，或者如果k < 10，则使用k倍的偏差校正。为了减轻交叉验证时的过拟合，我们建议通过我们最近引入的修改后的单标准误差规则进行校准选择。我们提倡在跨一系列典型建模目标的模型选择中使用预测分数，例如探索、假设检验和预测，只要模型是按照既定目标指定的。我们还强调，正如其他人所做的那样，如果在模型选择之前对参数估计进行推断是有偏差的，而是需要仔细指定单个模型或进一步的技术调整。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ecological Monographs 环境科学-生态学

CiteScore

12.20

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The vision for Ecological Monographs is that it should be the place for publishing integrative, synthetic papers that elaborate new directions for the field of ecology. Original Research Papers published in Ecological Monographs will continue to document complex observational, experimental, or theoretical studies that by their very integrated nature defy dissolution into shorter publications focused on a single topic or message. Reviews will be comprehensive and synthetic papers that establish new benchmarks in the field, define directions for future research, contribute to fundamental understanding of ecological principles, and derive principles for ecological management in its broadest sense (including, but not limited to: conservation, mitigation, restoration, and pro-active protection of the environment). Reviews should reflect the full development of a topic and encompass relevant natural history, observational and experimental data, analyses, models, and theory. Reviews published in Ecological Monographs should further blur the boundaries between “basic” and “applied” ecology. Concepts and Synthesis papers will conceptually advance the field of ecology. These papers are expected to go well beyond works being reviewed and include discussion of new directions, new syntheses, and resolutions of old questions. In this world of rapid scientific advancement and never-ending environmental change, there needs to be room for the thoughtful integration of scientific ideas, data, and concepts that feeds the mind and guides the development of the maturing science of ecology. Ecological Monographs provides that room, with an expansive view to a sustainable future.