A computationally efficient approach to false discovery rate control and power maximisation via randomisation and mirror statistic.

IF 1.9 3区医学 Q3 HEALTH CARE SCIENCES & SERVICES

Statistical Methods in Medical Research Pub Date : 2025-06-01 Epub Date: 2025-03-31 DOI:10.1177/09622802251329768

Marco Molinari, Magne Thoresen

{"title":"A computationally efficient approach to false discovery rate control and power maximisation via randomisation and mirror statistic.","authors":"Marco Molinari, Magne Thoresen","doi":"10.1177/09622802251329768","DOIUrl":null,"url":null,"abstract":"<p><p>Simultaneously performing variable selection and inference in high-dimensional regression models is an open challenge in statistics and machine learning. The increasing availability of vast amounts of variables requires the adoption of specific statistical procedures to accurately select the most important predictors in a high-dimensional space, while controlling the false discovery rate (FDR) associated with the variable selection procedure. In this paper, we propose the joint adoption of the Mirror Statistic approach to FDR control, coupled with outcome randomisation to maximise the statistical power of the variable selection procedure, measured through the true positive rate. Through extensive simulations, we show how our proposed strategy allows us to combine the benefits of the two techniques. The Mirror Statistic is a flexible method to control FDR, which only requires mild model assumptions, but requires two sets of independent regression coefficient estimates, usually obtained after splitting the original dataset. Outcome randomisation is an alternative to data splitting that allows to generate two independent outcomes, which can then be used to estimate the coefficients that go into the construction of the Mirror Statistic. The combination of these two approaches provides increased testing power in a number of scenarios, such as highly correlated covariates and high percentages of active variables. Moreover, it is scalable to very high-dimensional problems, since the algorithm has a low memory footprint and only requires a single run on the full dataset, as opposed to iterative alternatives such as multiple data splitting.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1233-1253"},"PeriodicalIF":1.9000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12209545/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802251329768","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Simultaneously performing variable selection and inference in high-dimensional regression models is an open challenge in statistics and machine learning. The increasing availability of vast amounts of variables requires the adoption of specific statistical procedures to accurately select the most important predictors in a high-dimensional space, while controlling the false discovery rate (FDR) associated with the variable selection procedure. In this paper, we propose the joint adoption of the Mirror Statistic approach to FDR control, coupled with outcome randomisation to maximise the statistical power of the variable selection procedure, measured through the true positive rate. Through extensive simulations, we show how our proposed strategy allows us to combine the benefits of the two techniques. The Mirror Statistic is a flexible method to control FDR, which only requires mild model assumptions, but requires two sets of independent regression coefficient estimates, usually obtained after splitting the original dataset. Outcome randomisation is an alternative to data splitting that allows to generate two independent outcomes, which can then be used to estimate the coefficients that go into the construction of the Mirror Statistic. The combination of these two approaches provides increased testing power in a number of scenarios, such as highly correlated covariates and high percentages of active variables. Moreover, it is scalable to very high-dimensional problems, since the algorithm has a low memory footprint and only requires a single run on the full dataset, as opposed to iterative alternatives such as multiple data splitting.

Abstract Image

查看原文本刊更多论文

一种通过随机化和镜像统计实现错误发现率控制和功率最大化的高效计算方法。

在高维回归模型中同时进行变量选择和推理是统计学和机器学习中的一个开放挑战。大量变量的可用性不断增加，需要采用特定的统计程序来准确地选择高维空间中最重要的预测因子，同时控制与变量选择过程相关的错误发现率（FDR）。在本文中，我们建议联合采用镜像统计方法来控制FDR，再加上结果随机化，以最大限度地提高变量选择过程的统计能力，通过真阳性率来衡量。通过广泛的模拟，我们展示了我们提出的策略如何使我们能够结合这两种技术的优点。镜像统计是一种灵活的控制FDR的方法，它只需要温和的模型假设，但需要两组独立的回归系数估计，通常是在原始数据集分裂后得到的。结果随机化是数据分割的另一种选择，它允许生成两个独立的结果，然后可以用来估计进入镜像统计构建的系数。这两种方法的结合在许多场景中提供了更高的测试能力，例如高度相关的协变量和高百分比的活动变量。此外，它可以扩展到非常高维的问题，因为该算法具有低内存占用，并且只需要在整个数据集上运行一次，而不是迭代替代方案，如多次数据分割。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Methods in Medical Research 医学-数学与计算生物学

CiteScore

4.10

自引率

4.30%

发文量

127

审稿时长

>12 weeks

期刊介绍： Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)