一个快速和可扩展的框架,用于大规模和超高维稀疏回归,并应用于英国生物银行。

IF 4.5 2区 生物学 Q1 Agricultural and Biological Sciences
PLoS Genetics Pub Date : 2020-10-23 eCollection Date: 2020-10-01 DOI:10.1371/journal.pgen.1009141
Junyang Qian, Yosuke Tanigawa, Wenfei Du, Matthew Aguirre, Chris Chang, Robert Tibshirani, Manuel A Rivas, Trevor Hastie
{"title":"一个快速和可扩展的框架,用于大规模和超高维稀疏回归,并应用于英国生物银行。","authors":"Junyang Qian,&nbsp;Yosuke Tanigawa,&nbsp;Wenfei Du,&nbsp;Matthew Aguirre,&nbsp;Chris Chang,&nbsp;Robert Tibshirani,&nbsp;Manuel A Rivas,&nbsp;Trevor Hastie","doi":"10.1371/journal.pgen.1009141","DOIUrl":null,"url":null,"abstract":"<p><p>The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.</p>","PeriodicalId":20266,"journal":{"name":"PLoS Genetics","volume":" ","pages":"e1009141"},"PeriodicalIF":4.5000,"publicationDate":"2020-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7641476/pdf/","citationCount":"68","resultStr":"{\"title\":\"A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.\",\"authors\":\"Junyang Qian,&nbsp;Yosuke Tanigawa,&nbsp;Wenfei Du,&nbsp;Matthew Aguirre,&nbsp;Chris Chang,&nbsp;Robert Tibshirani,&nbsp;Manuel A Rivas,&nbsp;Trevor Hastie\",\"doi\":\"10.1371/journal.pgen.1009141\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.</p>\",\"PeriodicalId\":20266,\"journal\":{\"name\":\"PLoS Genetics\",\"volume\":\" \",\"pages\":\"e1009141\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2020-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7641476/pdf/\",\"citationCount\":\"68\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Genetics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pgen.1009141\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2020/10/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1009141","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/10/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 68

摘要

英国生物库是一个非常大的,前瞻性的人群为基础的队列研究在英国。它为研究人员提供了前所未有的机会来研究基因型信息和感兴趣的表型之间的关系。与全基因组关联研究(GWAS)相比,多元回归方法已被证明可以大大提高对多种表型的预测性能。在高维环境下,套索自首次在统计学中提出以来,已被证明是同时进行变量选择和估计的有效方法。然而,在英国生物银行看到的大规模和超高维度对套索方法的应用提出了新的挑战,因为许多现有的算法及其实现都不能扩展到大型应用中。在本文中,我们提出了一个称为批量筛选迭代套索(BASIL)的计算框架,它可以利用任何现有的套索求解器,并轻松构建一个可扩展的解决方案,用于非常大的数据,包括那些大于内存大小的数据。我们介绍了snpnet,这是一个R包,它在glmnet之上实现了所提出的算法,并针对单核苷酸多态性(SNP)数据集进行了优化。目前支持1惩罚线性模型、logistic回归、Cox模型,并扩展到1/ 2惩罚弹性网络。我们在UK Biobank数据集上展示了结果,与其他已建立的多基因风险评分方法相比,我们仅使用一小部分变体,就实现了对所有四种表型(身高、体重指数、哮喘、高胆固醇)的竞争性预测性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
PLoS Genetics
PLoS Genetics 生物-遗传学
CiteScore
8.10
自引率
2.20%
发文量
438
审稿时长
1 months
期刊介绍: PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill). Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信