Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismaïl Ahmed
{"title":"高维回归中变量选择的一个简单信息准则。","authors":"Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismaïl Ahmed","doi":"10.1002/sim.10275","DOIUrl":null,"url":null,"abstract":"<p><p>High-dimensional regression problems, for example with genomic or drug exposure data, typically involve automated selection of a sparse set of regressors. Penalized regression methods like the LASSO can deliver a family of candidate sparse models. To select one, there are criteria balancing log-likelihood and model size, the most common being AIC and BIC. These two methods do not take into account the implicit multiple testing performed when selecting variables in a high-dimensional regression, which makes them too liberal. We propose the extended AIC (EAIC), a new information criterion for sparse model selection in high-dimensional regressions. It allows for asymptotic FWER control when the candidate regressors are independent. It is based on a simple formula involving model log-likelihood, model size, the total number of candidate regressors, and the FWER target. In a simulation study over a wide range of linear and logistic regression settings, we assessed the variable selection performance of the EAIC and of other information criteria (including some that also use the number of candidate regressors: mBIC, mAIC, and EBIC) in conjunction with the LASSO. Our method controls the FWER in nearly all settings, in contrast to the AIC and BIC, which produce many false positives. We also illustrate it for the automated signal detection of adverse drug reactions on the French pharmacovigilance spontaneous reporting database.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"e10275"},"PeriodicalIF":1.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Simple Information Criterion for Variable Selection in High-Dimensional Regression.\",\"authors\":\"Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismaïl Ahmed\",\"doi\":\"10.1002/sim.10275\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>High-dimensional regression problems, for example with genomic or drug exposure data, typically involve automated selection of a sparse set of regressors. Penalized regression methods like the LASSO can deliver a family of candidate sparse models. To select one, there are criteria balancing log-likelihood and model size, the most common being AIC and BIC. These two methods do not take into account the implicit multiple testing performed when selecting variables in a high-dimensional regression, which makes them too liberal. We propose the extended AIC (EAIC), a new information criterion for sparse model selection in high-dimensional regressions. It allows for asymptotic FWER control when the candidate regressors are independent. It is based on a simple formula involving model log-likelihood, model size, the total number of candidate regressors, and the FWER target. In a simulation study over a wide range of linear and logistic regression settings, we assessed the variable selection performance of the EAIC and of other information criteria (including some that also use the number of candidate regressors: mBIC, mAIC, and EBIC) in conjunction with the LASSO. Our method controls the FWER in nearly all settings, in contrast to the AIC and BIC, which produce many false positives. We also illustrate it for the automated signal detection of adverse drug reactions on the French pharmacovigilance spontaneous reporting database.</p>\",\"PeriodicalId\":21879,\"journal\":{\"name\":\"Statistics in Medicine\",\"volume\":\" \",\"pages\":\"e10275\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistics in Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/sim.10275\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/12/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.10275","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/12 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
A Simple Information Criterion for Variable Selection in High-Dimensional Regression.
High-dimensional regression problems, for example with genomic or drug exposure data, typically involve automated selection of a sparse set of regressors. Penalized regression methods like the LASSO can deliver a family of candidate sparse models. To select one, there are criteria balancing log-likelihood and model size, the most common being AIC and BIC. These two methods do not take into account the implicit multiple testing performed when selecting variables in a high-dimensional regression, which makes them too liberal. We propose the extended AIC (EAIC), a new information criterion for sparse model selection in high-dimensional regressions. It allows for asymptotic FWER control when the candidate regressors are independent. It is based on a simple formula involving model log-likelihood, model size, the total number of candidate regressors, and the FWER target. In a simulation study over a wide range of linear and logistic regression settings, we assessed the variable selection performance of the EAIC and of other information criteria (including some that also use the number of candidate regressors: mBIC, mAIC, and EBIC) in conjunction with the LASSO. Our method controls the FWER in nearly all settings, in contrast to the AIC and BIC, which produce many false positives. We also illustrate it for the automated signal detection of adverse drug reactions on the French pharmacovigilance spontaneous reporting database.
期刊介绍:
The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.