Genetic Prediction Modeling in Large Cohort Studies via Boosting Targeted Loss Functions.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine Pub Date : 2024-12-10 Epub Date: 2024-10-23 DOI:10.1002/sim.10249

Hannah Klinkhammer, Christian Staerk, Carlo Maj, Peter M Krawitz, Andreas Mayr

{"title":"Genetic Prediction Modeling in Large Cohort Studies via Boosting Targeted Loss Functions.","authors":"Hannah Klinkhammer, Christian Staerk, Carlo Maj, Peter M Krawitz, Andreas Mayr","doi":"10.1002/sim.10249","DOIUrl":null,"url":null,"abstract":"<p><p>Polygenic risk scores (PRS) aim to predict a trait from genetic information, relying on common genetic variants with low to medium effect sizes. As genotype data are high-dimensional in nature, it is crucial to develop methods that can be applied to large-scale data (large <math> <semantics><mrow><mi>n</mi></mrow> <annotation>$$ n $$</annotation></semantics> </math> and large <math> <semantics><mrow><mi>p</mi></mrow> <annotation>$$ p $$</annotation></semantics> </math> ). Many PRS tools aggregate univariate summary statistics from genome-wide association studies into a single score. Recent advancements allow simultaneous modeling of variant effects from individual-level genotype data. In this context, we introduced snpboost, an algorithm that applies statistical boosting on individual-level genotype data to estimate PRS via multivariable regression models. By processing variants iteratively in batches, snpboost can deal with large-scale cohort data. Having solved the technical obstacles due to data dimensionality, the methodological scope can now be broadened-focusing on key objectives for the clinical application of PRS. Similar to most methods in this context, snpboost has, so far, been restricted to quantitative and binary traits. Now, we incorporate more advanced alternatives-targeted to the particular aim and outcome. Adapting the loss function extends the snpboost framework to further data situations such as time-to-event and count data. Furthermore, alternative loss functions for continuous outcomes allow us to focus not only on the mean of the conditional distribution but also on other aspects that may be more helpful in the risk stratification of individual patients and can quantify prediction uncertainty, for example, median or quantile regression. This work enhances PRS fitting across multiple model classes previously unfeasible for this data type.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5412-5430"},"PeriodicalIF":1.8000,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586906/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.10249","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/23 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Polygenic risk scores (PRS) aim to predict a trait from genetic information, relying on common genetic variants with low to medium effect sizes. As genotype data are high-dimensional in nature, it is crucial to develop methods that can be applied to large-scale data (large $n$ and large $p$ ). Many PRS tools aggregate univariate summary statistics from genome-wide association studies into a single score. Recent advancements allow simultaneous modeling of variant effects from individual-level genotype data. In this context, we introduced snpboost, an algorithm that applies statistical boosting on individual-level genotype data to estimate PRS via multivariable regression models. By processing variants iteratively in batches, snpboost can deal with large-scale cohort data. Having solved the technical obstacles due to data dimensionality, the methodological scope can now be broadened-focusing on key objectives for the clinical application of PRS. Similar to most methods in this context, snpboost has, so far, been restricted to quantitative and binary traits. Now, we incorporate more advanced alternatives-targeted to the particular aim and outcome. Adapting the loss function extends the snpboost framework to further data situations such as time-to-event and count data. Furthermore, alternative loss functions for continuous outcomes allow us to focus not only on the mean of the conditional distribution but also on other aspects that may be more helpful in the risk stratification of individual patients and can quantify prediction uncertainty, for example, median or quantile regression. This work enhances PRS fitting across multiple model classes previously unfeasible for this data type.

查看原文本刊更多论文

通过提升目标损失函数在大型队列研究中建立遗传预测模型

多基因风险评分（PRS）的目的是从遗传信息中预测性状，依靠的是中低效应量的常见基因变异。由于基因型数据是高维数据，因此开发适用于大规模数据（大 n$ n$ 和大 p$ p$ ）的方法至关重要。许多 PRS 工具将全基因组关联研究中的单变量汇总统计汇总为一个分数。最近的研究进展允许同时对来自个体水平基因型数据的变异效应建模。在这种情况下，我们引入了 snpboost，这是一种在个体水平基因型数据上应用统计增强的算法，通过多变量回归模型来估计 PRS。通过分批迭代处理变异，snpboost 可以处理大规模队列数据。在解决了数据维度带来的技术障碍后，现在可以扩大方法论的范围，重点关注 PRS 临床应用的关键目标。与这方面的大多数方法类似，迄今为止，snpboost 也仅限于定量和二元性状。现在，我们针对特定的目标和结果采用了更先进的替代方法。调整损失函数可将 snpboost 框架扩展到更多数据情况，如时间到事件和计数数据。此外，连续结果的替代损失函数使我们不仅能关注条件分布的均值，还能关注其他方面，这些方面可能更有助于对个体患者进行风险分层，并能量化预测的不确定性，例如中位数或量子回归。这项工作增强了以前对这种数据类型不可行的多个模型类别的 PRS 拟合能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics in Medicine 医学-公共卫生、环境卫生与职业卫生

CiteScore

3.40

自引率

10.00%

发文量

334

审稿时长

2-4 weeks

期刊介绍： The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.