LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK.

IF 1.4 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics Pub Date : 2022-09-01 Epub Date: 2022-07-19 DOI:10.1214/21-aoas1575

Junyang Qian, Yosuke Tanigawa, Ruilin Li, Robert Tibshirani, Manuel A Rivas, Trevor Hastie

{"title":"LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK.","authors":"Junyang Qian, Yosuke Tanigawa, Ruilin Li, Robert Tibshirani, Manuel A Rivas, Trevor Hastie","doi":"10.1214/21-aoas1575","DOIUrl":null,"url":null,"abstract":"<p><p>In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 3","pages":"1891-1918"},"PeriodicalIF":1.4000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9454085/pdf/nihms-1830548.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/21-aoas1575","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/7/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

查看原文本刊更多论文

大规模多元稀疏回归在英国生物银行中的应用。

在高维回归问题中，通常相对较小的特征子集与预测结果相关，并且在解决方案上施加稀疏性的方法很受欢迎。当多个相关结果可用（多任务）时，降阶回归是一种有效的方法，可以借用强度并捕获数据背后的潜在结构。我们的提议是由英国生物银行基于人群的队列研究激发的，在该研究中，我们面临着大规模、超高维特征，并且可以获得大量的结果（表型）——生活方式测量、生物标志物和疾病结果。因此，我们使用允许我们扩展到这种规模的问题的计算策略来拟合稀疏降阶回归模型。我们使用一种交替解决稀疏回归问题和求解降阶分解的方案。对于稀疏回归组件，我们提出了一种基于自适应筛选的可扩展迭代算法，该算法利用稀疏性假设，使我们能够专注于解决更小的子问题。通过最优性条件对完整解进行重构和测试，以确保它是原始问题的有效解。我们进一步扩展了该方法来处理实际问题，例如在表型中包含混淆变量和缺失值的imputation。在合成数据和UK Biobank数据上的实验证明了该方法和算法的有效性。我们提供了multiSnpnet包，可在http://github.com/junyangq/multiSnpnet上获得，它在PLINK2文件上工作，我们预计它将成为一个有价值的工具，用于从人类遗传研究中生成多基因风险评分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Applied Statistics 社会科学-统计学与概率论

CiteScore

3.10

自引率

5.60%

发文量

131

审稿时长

6-12 weeks

期刊介绍： Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.