基于快速高效记忆算法的多基因评分全基因组推断。

IF 8.1 1区生物学 Q1 GENETICS & HEREDITY

American journal of human genetics Pub Date : 2025-07-03 Epub Date: 2025-05-26 DOI:10.1016/j.ajhg.2025.05.002

Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li

{"title":"基于快速高效记忆算法的多基因评分全基因组推断。","authors":"Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li","doi":"10.1016/j.ajhg.2025.05.002","DOIUrl":null,"url":null,"abstract":"With improved whole-genome sequencing and variant imputation techniques, modern genome-wide association studies (GWASs) have enriched our understanding of the landscape of genetic associations for thousands of disease phenotypes. However, translating the marginal associations for millions of genetic variants to integrated polygenic risk scores (PRSs) that capture their joint effects on the phenotype remains a major challenge. Due to technical and statistical constraints, commonly used PRS methods in this setting either perform heuristic pruning and thresholding or overlook most genetic association signals by restricting inference to small variant sets, such as HapMap3. Here, we present a set of algorithmic improvements and compact data structures that enable scaling summary-statistics-based PRS inference to tens of millions of variants while avoiding numerical instabilities common in such high-dimensional settings. These enhancements consist of highly compressed linkage-disequilibrium (LD) matrix format, which integrates with streamlined and parallel coordinate-ascent updating schemes. When incorporated into our existing PRS method (VIPRS), the proposed algorithms yield over 50-fold reductions in storage requirements and lead to orders-of-magnitude improvements in runtime and memory efficiency. The updated VIPRS software can now perform variational Bayesian regression over 1.1 million HapMap3 variants in under a minute. Using this scalable implementation, we applied VIPRS to 75 of the most heritable, continuous phenotypes in the UK Biobank, leveraging marginal associations for up to 18 million bi-allelic variants. These experiments demonstrated that VIPRS is 1-2 orders of magnitude more efficient than popular baselines while being competitive with the best-performing methods in terms of prediction accuracy.","PeriodicalId":7659,"journal":{"name":"American journal of human genetics","volume":" ","pages":"1528-1546"},"PeriodicalIF":8.1000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12256920/pdf/","citationCount":"0","resultStr":"{\"title\":\"Toward whole-genome inference of polygenic scores with fast and memory-efficient algorithms.\",\"authors\":\"Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li\",\"doi\":\"10.1016/j.ajhg.2025.05.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With improved whole-genome sequencing and variant imputation techniques, modern genome-wide association studies (GWASs) have enriched our understanding of the landscape of genetic associations for thousands of disease phenotypes. However, translating the marginal associations for millions of genetic variants to integrated polygenic risk scores (PRSs) that capture their joint effects on the phenotype remains a major challenge. Due to technical and statistical constraints, commonly used PRS methods in this setting either perform heuristic pruning and thresholding or overlook most genetic association signals by restricting inference to small variant sets, such as HapMap3. Here, we present a set of algorithmic improvements and compact data structures that enable scaling summary-statistics-based PRS inference to tens of millions of variants while avoiding numerical instabilities common in such high-dimensional settings. These enhancements consist of highly compressed linkage-disequilibrium (LD) matrix format, which integrates with streamlined and parallel coordinate-ascent updating schemes. When incorporated into our existing PRS method (VIPRS), the proposed algorithms yield over 50-fold reductions in storage requirements and lead to orders-of-magnitude improvements in runtime and memory efficiency. The updated VIPRS software can now perform variational Bayesian regression over 1.1 million HapMap3 variants in under a minute. Using this scalable implementation, we applied VIPRS to 75 of the most heritable, continuous phenotypes in the UK Biobank, leveraging marginal associations for up to 18 million bi-allelic variants. These experiments demonstrated that VIPRS is 1-2 orders of magnitude more efficient than popular baselines while being competitive with the best-performing methods in terms of prediction accuracy.\",\"PeriodicalId\":7659,\"journal\":{\"name\":\"American journal of human genetics\",\"volume\":\" \",\"pages\":\"1528-1546\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12256920/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American journal of human genetics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1016/j.ajhg.2025.05.002\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/5/26 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of human genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.ajhg.2025.05.002","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/26 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

随着全基因组测序和变异代入技术的改进，现代全基因组关联研究（GWASs）丰富了我们对数千种疾病表型遗传关联景观的理解。然而，将数百万遗传变异的边际关联转化为综合多基因风险评分（PRSs），以捕获它们对表型的共同影响，仍然是一个主要挑战。由于技术和统计方面的限制，在这种情况下，常用的PRS方法要么执行启发式修剪和阈值处理，要么通过将推断限制在小变异集（如HapMap3）上而忽略大多数遗传关联信号。在这里，我们提出了一组算法改进和紧凑的数据结构，可以将基于汇总统计的PRS推理扩展到数千万个变量，同时避免在这种高维设置中常见的数值不稳定性。这些增强功能包括高度压缩的链接-不平衡（LD）矩阵格式，该格式集成了流线型和平行坐标上升更新方案。当结合到我们现有的PRS方法（VIPRS）中时，所提出的算法可以将存储需求降低50倍以上，并在运行时和内存效率方面带来数量级的改进。更新后的VIPRS软件现在可以在一分钟内对110万个HapMap3变体执行变分贝叶斯回归。使用这种可扩展的实现，我们将VIPRS应用于英国生物银行中75种最具遗传性的连续表型，利用多达1800万个双等位基因变异的边际关联。这些实验表明，VIPRS比流行的基线效率高1-2个数量级，同时在预测精度方面与表现最好的方法竞争。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Toward whole-genome inference of polygenic scores with fast and memory-efficient algorithms.

With improved whole-genome sequencing and variant imputation techniques, modern genome-wide association studies (GWASs) have enriched our understanding of the landscape of genetic associations for thousands of disease phenotypes. However, translating the marginal associations for millions of genetic variants to integrated polygenic risk scores (PRSs) that capture their joint effects on the phenotype remains a major challenge. Due to technical and statistical constraints, commonly used PRS methods in this setting either perform heuristic pruning and thresholding or overlook most genetic association signals by restricting inference to small variant sets, such as HapMap3. Here, we present a set of algorithmic improvements and compact data structures that enable scaling summary-statistics-based PRS inference to tens of millions of variants while avoiding numerical instabilities common in such high-dimensional settings. These enhancements consist of highly compressed linkage-disequilibrium (LD) matrix format, which integrates with streamlined and parallel coordinate-ascent updating schemes. When incorporated into our existing PRS method (VIPRS), the proposed algorithms yield over 50-fold reductions in storage requirements and lead to orders-of-magnitude improvements in runtime and memory efficiency. The updated VIPRS software can now perform variational Bayesian regression over 1.1 million HapMap3 variants in under a minute. Using this scalable implementation, we applied VIPRS to 75 of the most heritable, continuous phenotypes in the UK Biobank, leveraging marginal associations for up to 18 million bi-allelic variants. These experiments demonstrated that VIPRS is 1-2 orders of magnitude more efficient than popular baselines while being competitive with the best-performing methods in terms of prediction accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

American journal of human genetics 生物-遗传学

CiteScore

14.70

自引率

4.10%

发文量

185

审稿时长

1 months

期刊介绍： The American Journal of Human Genetics (AJHG) is a monthly journal published by Cell Press, chosen by The American Society of Human Genetics (ASHG) as its premier publication starting from January 2008. AJHG represents Cell Press's first society-owned journal, and both ASHG and Cell Press anticipate significant synergies between AJHG content and that of other Cell Press titles.