Jean-François Plante, Maxime Larocque, Michel Adès
{"title":"Objective model selection with parallel genetic algorithms using an eradication strategy","authors":"Jean-François Plante, Maxime Larocque, Michel Adès","doi":"10.1002/cjs.11775","DOIUrl":null,"url":null,"abstract":"<p>In supervised learning, feature selection methods identify the most relevant predictors to include in a model. For linear models, the inclusion or exclusion of each variable may be represented as a vector of bits playing the role of the genetic material that defines the model. Genetic algorithms reproduce the strategies of natural selection on a population of models to identify the best. We derive the distribution of the importance scores for parallel genetic algorithms under the null hypothesis that none of the features has predictive power. They, hence, provide an objective threshold for feature selection that does not require the visual inspection of a bubble plot. We also introduce the eradication strategy, akin to forward stepwise selection, where the genes of useful variables are sequentially forced into the models. The method is illustrated on real data, and simulation studies are run to describe its performance.</p>","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11775","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cjs.11775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In supervised learning, feature selection methods identify the most relevant predictors to include in a model. For linear models, the inclusion or exclusion of each variable may be represented as a vector of bits playing the role of the genetic material that defines the model. Genetic algorithms reproduce the strategies of natural selection on a population of models to identify the best. We derive the distribution of the importance scores for parallel genetic algorithms under the null hypothesis that none of the features has predictive power. They, hence, provide an objective threshold for feature selection that does not require the visual inspection of a bubble plot. We also introduce the eradication strategy, akin to forward stepwise selection, where the genes of useful variables are sequentially forced into the models. The method is illustrated on real data, and simulation studies are run to describe its performance.