The winner's curse under dependence: repairing empirical Bayes using convoluted densities.

IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Stijn Hawinkel, Olivier Thas, Steven Maere
{"title":"The winner's curse under dependence: repairing empirical Bayes using convoluted densities.","authors":"Stijn Hawinkel, Olivier Thas, Steven Maere","doi":"10.1093/biostatistics/kxaf025","DOIUrl":null,"url":null,"abstract":"<p><p>The winner's curse is a form of selection bias that arises when estimates are obtained for a large number of features, but only a subset of most extreme estimates is reported. It occurs in large scale significance testing as well as in rank-based selection, and imperils reproducibility of findings and follow-up study design. Several methods correcting for this selection bias have been proposed, but questions remain on their susceptibility to dependence between features since theoretical analyses and comparative studies are few. We prove that estimation through Tweedie's formula is biased in presence of strong dependence, and propose a convolution of its density estimator to restore its competitive performance, which also aids other empirical Bayes methods. Furthermore, we perform a comprehensive simulation study comparing different classes of winner's curse correction methods for point estimates as well as confidence intervals under dependence. We find a bootstrap method and empirical Bayes methods with density convolution to perform best at correcting the selection bias, although this correction generally does not improve the feature ranking. Finally, we apply the methods to a comparison of single-feature versus multi-feature prediction models in predicting Brassica napus phenotypes from gene expression data, demonstrating that the superiority of the best single-feature model may be illusory.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biostatistics/kxaf025","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The winner's curse is a form of selection bias that arises when estimates are obtained for a large number of features, but only a subset of most extreme estimates is reported. It occurs in large scale significance testing as well as in rank-based selection, and imperils reproducibility of findings and follow-up study design. Several methods correcting for this selection bias have been proposed, but questions remain on their susceptibility to dependence between features since theoretical analyses and comparative studies are few. We prove that estimation through Tweedie's formula is biased in presence of strong dependence, and propose a convolution of its density estimator to restore its competitive performance, which also aids other empirical Bayes methods. Furthermore, we perform a comprehensive simulation study comparing different classes of winner's curse correction methods for point estimates as well as confidence intervals under dependence. We find a bootstrap method and empirical Bayes methods with density convolution to perform best at correcting the selection bias, although this correction generally does not improve the feature ranking. Finally, we apply the methods to a comparison of single-feature versus multi-feature prediction models in predicting Brassica napus phenotypes from gene expression data, demonstrating that the superiority of the best single-feature model may be illusory.

依赖下的赢家诅咒:用卷积密度修复经验贝叶斯。
赢家的诅咒是一种选择偏差的形式,当获得了大量特征的估计,但只有最极端估计的子集被报告时,就会出现这种偏差。它发生在大规模显著性检验以及基于秩的选择中,并危及结果的可重复性和后续研究设计。已经提出了几种纠正这种选择偏差的方法,但由于理论分析和比较研究很少,它们对特征之间依赖性的敏感性仍然存在问题。我们证明了Tweedie公式的估计在存在强依赖性的情况下是有偏差的,并提出了其密度估计器的卷积来恢复其竞争性能,这也有助于其他经验贝叶斯方法。此外,我们进行了全面的模拟研究,比较了不同类别的赢家诅咒校正方法的点估计以及依赖下的置信区间。我们发现带密度卷积的bootstrap方法和经验贝叶斯方法在校正选择偏差方面表现最好,尽管这种校正通常不会提高特征排名。最后,我们将这些方法应用于单特征和多特征预测模型在从基因表达数据预测甘蓝型表型方面的比较,表明最佳单特征模型的优势可能是虚幻的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Biostatistics
Biostatistics 生物-数学与计算生物学
CiteScore
5.10
自引率
4.80%
发文量
45
审稿时长
6-12 weeks
期刊介绍: Among the important scientific developments of the 20th century is the explosive growth in statistical reasoning and methods for application to studies of human health. Examples include developments in likelihood methods for inference, epidemiologic statistics, clinical trials, survival analysis, and statistical genetics. Substantive problems in public health and biomedical research have fueled the development of statistical methods, which in turn have improved our ability to draw valid inferences from data. The objective of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public''s health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信