Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee
{"title":"基因组学背景下使用 Oracle 方法进行交叉验证的经验性能。","authors":"Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee","doi":"10.1198/tas.2011.11052","DOIUrl":null,"url":null,"abstract":"<p><p>When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281424/pdf/nihms355303.pdf","citationCount":"0","resultStr":"{\"title\":\"Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.\",\"authors\":\"Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee\",\"doi\":\"10.1198/tas.2011.11052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.</p>\",\"PeriodicalId\":50801,\"journal\":{\"name\":\"American Statistician\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2011-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281424/pdf/nihms355303.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Statistician\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1198/tas.2011.11052\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Statistician","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1198/tas.2011.11052","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
在使用平滑截断绝对偏差(SCAD)和自适应套索(Adaptive Lasso)等具有甲骨文特性的模型选择方法时,通常会通过 m 倍交叉验证来估计平滑参数,例如 m = 10。在真实回归函数稀疏、信号量大的问题中,这种交叉验证通常效果很好。然而,在涉及单核苷酸多态性(SNP)的基因组研究回归建模中,真正的回归函数虽然被认为是稀疏的,但信号并不大。我们通过实证证明,在此类问题中,使用 SCAD 和自适应套索法(10 倍交叉验证)所选变量的数量是一个随机变量,其变化相当大,令人惊讶。类似的结论也适用于 Lasso 等非oracle 方法。我们的研究强烈质疑对任何甲骨文方法(不仅仅是 SCAD 和 Adaptive Lasso)只进行一次 m 倍交叉验证是否合适。
Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.
期刊介绍:
Are you looking for general-interest articles about current national and international statistical problems and programs; interesting and fun articles of a general nature about statistics and its applications; or the teaching of statistics? Then you are looking for The American Statistician (TAS), published quarterly by the American Statistical Association. TAS contains timely articles organized into the following sections: Statistical Practice, General, Teacher''s Corner, History Corner, Interdisciplinary, Statistical Computing and Graphics, Reviews of Books and Teaching Materials, and Letters to the Editor.