Sparse latent factor regression models for genome-wide and epigenome-wide association studies

IF 0.4 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology Pub Date : 2022-01-01 DOI:10.1515/sagmb-2021-0035

Basile Jumentier, Kevin Caye, Barbara Heude, Johanna Lepeule, Olivier François

{"title":"Sparse latent factor regression models for genome-wide and epigenome-wide association studies","authors":"Basile Jumentier, Kevin Caye, Barbara Heude, Johanna Lepeule, Olivier François","doi":"10.1515/sagmb-2021-0035","DOIUrl":null,"url":null,"abstract":"Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"4 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2021-0035","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.

查看原文本刊更多论文

全基因组和表观全基因组关联研究的稀疏潜在因子回归模型

表型或暴露与基因组和表观基因组数据的关联面临着重要的统计挑战。其中一个挑战是解释由于未观察到的混杂因素引起的变异，例如个体祖先或组织中的细胞类型组成。这个问题可以通过惩罚潜在因素回归模型来解决，其中引入惩罚来处理数据中的高维。如果相对较小比例的基因组或表观基因组标记与感兴趣的变量相关，稀疏度惩罚可能有助于捕获相关关联，但非稀疏方法的改进尚未得到充分评估。在这里，我们提出了最小二乘算法，联合估计稀疏潜在因素回归模型中的效应大小和混杂因素。在模拟数据中，稀疏潜因子回归模型通常比其他稀疏方法具有更高的统计性能，包括最小绝对收缩和选择算子以及贝叶斯稀疏线性混合模型。在生成模型模拟中，统计性能略低于非稀疏方法(但与之相当)，但在基于经验数据的模拟中，稀疏潜在因素回归模型比非稀疏方法对偏离模型的鲁棒性更强。我们将稀疏潜在因子回归模型应用于拟南芥开花性状的全基因组关联研究和孕妇吸烟状况的全基因组关联研究。对于这两种应用，稀疏潜在因素回归模型有助于估计非零效应大小，同时克服了多个测试问题。结果不仅与先前的发现一致，而且他们还确定了与每种应用相关的功能注释的新基因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Applications in Genetics and Molecular Biology BIOCHEMISTRY & MOLECULAR BIOLOGY-MATHEMATICAL & COMPUTATIONAL BIOLOGY

自引率

11.10%

发文量

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.