False discovery rate control with unknown null distribution: Is it possible to mimic the oracle?

The Annals of Statistics Pub Date : 2022-04-01 DOI:10.1214/21-aos2141

Étienne Roquain, N. Verzelen

{"title":"False discovery rate control with unknown null distribution: Is it possible to mimic the oracle?","authors":"Étienne Roquain, N. Verzelen","doi":"10.1214/21-aos2141","DOIUrl":null,"url":null,"abstract":"Classical multiple testing theory prescribes the null distribution, which is often a too stringent assumption for nowadays large scale experiments. This paper presents theoretical foundations to understand the limitations caused by ignoring the null distribution, and how it can be properly learned from the (same) data-set, when possible. We explore this issue in the case where the null distributions are Gaussian with an unknown rescaling parameters (mean and variance) and the alternative distribution is let arbitrary. While an oracle procedure in that case is the Benjamini Hochberg procedure applied with the true (unknown) null distribution, we pursue the aim of building a procedure that asymptotically mimics the performance of the oracle (AMO in short). Our main result states that an AMO procedure exists if and only if the sparsity parameter k (number of false nulls) is of order less than n/ log(n), where n is the total number of tests. Further sparsity boundaries are derived for general location models where the shape of the null distribution is not necessarily Gaussian. Given our impossibility results, we also pursue a weaker objective, which is to find a confidence region for the oracle. To this end, we develop a distribution-dependent confidence region for the null distribution. As practical by-products, this provides a goodness of fit test for the null distribution, as well as a visual method assessing the reliability of empirical null multiple testing methods. Our results are illustrated with numerical experiments and a companion vignette Roquain and Verzelen (2020). AMS 2000 subject classifications: Primary 62G10; secondary 62C20.","PeriodicalId":22375,"journal":{"name":"The Annals of Statistics","volume":"227 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Annals of Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1214/21-aos2141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Classical multiple testing theory prescribes the null distribution, which is often a too stringent assumption for nowadays large scale experiments. This paper presents theoretical foundations to understand the limitations caused by ignoring the null distribution, and how it can be properly learned from the (same) data-set, when possible. We explore this issue in the case where the null distributions are Gaussian with an unknown rescaling parameters (mean and variance) and the alternative distribution is let arbitrary. While an oracle procedure in that case is the Benjamini Hochberg procedure applied with the true (unknown) null distribution, we pursue the aim of building a procedure that asymptotically mimics the performance of the oracle (AMO in short). Our main result states that an AMO procedure exists if and only if the sparsity parameter k (number of false nulls) is of order less than n/ log(n), where n is the total number of tests. Further sparsity boundaries are derived for general location models where the shape of the null distribution is not necessarily Gaussian. Given our impossibility results, we also pursue a weaker objective, which is to find a confidence region for the oracle. To this end, we develop a distribution-dependent confidence region for the null distribution. As practical by-products, this provides a goodness of fit test for the null distribution, as well as a visual method assessing the reliability of empirical null multiple testing methods. Our results are illustrated with numerical experiments and a companion vignette Roquain and Verzelen (2020). AMS 2000 subject classifications: Primary 62G10; secondary 62C20.

查看原文本刊更多论文

错误发现率控制与未知null分布:是否有可能模仿oracle?

经典的多重检验理论规定了零分布，这对于当今的大规模实验来说往往是一个过于严格的假设。本文提供了理解忽略零分布所造成的限制的理论基础，以及如何在可能的情况下从(相同)数据集中正确地学习它。我们在零分布是高斯分布的情况下探讨这个问题，其中零分布具有未知的重标参数(均值和方差)，而替代分布是任意的。在这种情况下，oracle过程是应用真实(未知)零分布的Benjamini Hochberg过程，而我们追求的目标是构建一个渐进地模仿oracle(简称AMO)性能的过程。我们的主要结果表明，当且仅当稀疏性参数k(假空数)小于n/ log(n)的数量级时存在AMO过程，其中n是测试的总数。对于零分布形状不一定是高斯分布的一般位置模型，导出了进一步的稀疏性边界。鉴于我们的不可能结果，我们还追求一个较弱的目标，即为神谕找到一个置信区域。为此，我们为零分布建立了一个分布相关的置信区域。作为实际的副产品，这提供了零分布的拟合优度检验，以及评估经验零多重检验方法可靠性的可视化方法。我们的结果用数值实验和配套的小插图Roquain和Verzelen(2020)来说明。AMS 2000学科分类:初级62G10;二次62甜。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Annals of Statistics

自引率

0.00%

发文量