Sub-sampling graph neural networks for genomic prediction of quantitative phenotypes

IF 2.1 3区 生物学 Q3 GENETICS & HEREDITY
Ragini Kihlman, Ilkka Launonen, Mikko J Sillanpää, Patrik Waldmann
{"title":"Sub-sampling graph neural networks for genomic prediction of quantitative phenotypes","authors":"Ragini Kihlman, Ilkka Launonen, Mikko J Sillanpää, Patrik Waldmann","doi":"10.1093/g3journal/jkae216","DOIUrl":null,"url":null,"abstract":"In genomics, use of deep learning (DL) is rapidly growing and DL has successfully demonstrated its ability to uncover complex relationships in large biological and biomedical data sets. With the development of high-throughput sequencing techniques, genomic markers can now be allocated to large sections of a genome. By analysing allele sharing between individuals, one may calculate realized genomic relationships from single nucleotide polymorphisms (SNPs) data rather than relying on known pedigree relationships under polygenic model. The traditional approaches in genome-wide prediction (GWP) of quantitative phenotypes utilise genomic relationships in fixed global covariance modelling, possibly with some non-linear kernel mapping (for example Gaussian processes). On the other hand, the DL approaches proposed so far for GWP fail to take into account the non-Euclidean graph structure of relationships between individuals over several generations. In this paper, we propose one global convolutional neural network (GCN) and one local sub-sampling architecture (GCN-RS) that are specifically designed to perform regression analysis based on genomic relationship information. A GCN is tailored to non-Euclidean spaces and consists of several layers of graph convolutions. The GCN-RS architecture is designed to further improve the GCN’s performance by sub-sampling the graph to reduce the dimensionality of the input data. Through these graph convolutional layers, the GCN maps input genomic markers to their quantitative phenotype values. The graphs are constructed using an iterative nearest neighbour approach. Comparisons show that the GCN-RS outperforms the popular Genomic Best Linear Unbiased Predictor (GBLUP) method on one simulated and three real data sets from wheat, mice and pig with a predictive improvement of 4.4% to 49.4% in terms of test mean squared error (MSE). This indicates that GCN-RS is a promising tool for genomic predictions in plants and animals. Furthermore, GCN-RS is computationally efficient, making it a viable option for large-scale applications.","PeriodicalId":12468,"journal":{"name":"G3: Genes|Genomes|Genetics","volume":null,"pages":null},"PeriodicalIF":2.1000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"G3: Genes|Genomes|Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/g3journal/jkae216","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

In genomics, use of deep learning (DL) is rapidly growing and DL has successfully demonstrated its ability to uncover complex relationships in large biological and biomedical data sets. With the development of high-throughput sequencing techniques, genomic markers can now be allocated to large sections of a genome. By analysing allele sharing between individuals, one may calculate realized genomic relationships from single nucleotide polymorphisms (SNPs) data rather than relying on known pedigree relationships under polygenic model. The traditional approaches in genome-wide prediction (GWP) of quantitative phenotypes utilise genomic relationships in fixed global covariance modelling, possibly with some non-linear kernel mapping (for example Gaussian processes). On the other hand, the DL approaches proposed so far for GWP fail to take into account the non-Euclidean graph structure of relationships between individuals over several generations. In this paper, we propose one global convolutional neural network (GCN) and one local sub-sampling architecture (GCN-RS) that are specifically designed to perform regression analysis based on genomic relationship information. A GCN is tailored to non-Euclidean spaces and consists of several layers of graph convolutions. The GCN-RS architecture is designed to further improve the GCN’s performance by sub-sampling the graph to reduce the dimensionality of the input data. Through these graph convolutional layers, the GCN maps input genomic markers to their quantitative phenotype values. The graphs are constructed using an iterative nearest neighbour approach. Comparisons show that the GCN-RS outperforms the popular Genomic Best Linear Unbiased Predictor (GBLUP) method on one simulated and three real data sets from wheat, mice and pig with a predictive improvement of 4.4% to 49.4% in terms of test mean squared error (MSE). This indicates that GCN-RS is a promising tool for genomic predictions in plants and animals. Furthermore, GCN-RS is computationally efficient, making it a viable option for large-scale applications.
用于定量表型基因组预测的子采样图神经网络
在基因组学领域,深度学习(DL)的应用正在迅速增长,深度学习已成功证明其有能力发现大型生物和生物医学数据集中的复杂关系。随着高通量测序技术的发展,现在可以将基因组标记分配到基因组的大部分区域。通过分析个体间的等位基因共享,人们可以从单核苷酸多态性(SNPs)数据中计算出实现的基因组关系,而不是依赖多基因模型下的已知血统关系。对定量表型进行全基因组预测(GWP)的传统方法是在固定的全局协方差模型中利用基因组关系,可能还有一些非线性核映射(如高斯过程)。另一方面,迄今为止针对 GWP 提出的 DL 方法未能考虑几代个体间关系的非欧几里得图结构。在本文中,我们提出了一种全局卷积神经网络(GCN)和一种局部子采样架构(GCN-RS),专门用于根据基因组关系信息进行回归分析。GCN 专为非欧几里得空间量身定制,由多层图卷积组成。GCN-RS 架构旨在通过对图进行子采样来降低输入数据的维度,从而进一步提高 GCN 的性能。通过这些图卷积层,GCN 将输入的基因组标记映射到其定量表型值上。图的构建采用迭代近邻法。比较结果表明,GCN-RS 在小麦、小鼠和猪的一个模拟数据集和三个真实数据集上的表现优于流行的基因组最佳线性无偏预测(GBLUP)方法,在测试均方误差(MSE)方面的预测能力提高了 4.4% 至 49.4%。这表明,GCN-RS 是一种很有前途的动植物基因组预测工具。此外,GCN-RS 的计算效率很高,是大规模应用的可行选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
G3: Genes|Genomes|Genetics
G3: Genes|Genomes|Genetics GENETICS & HEREDITY-
CiteScore
5.10
自引率
3.80%
发文量
305
审稿时长
3-8 weeks
期刊介绍: G3: Genes, Genomes, Genetics provides a forum for the publication of high‐quality foundational research, particularly research that generates useful genetic and genomic information such as genome maps, single gene studies, genome‐wide association and QTL studies, as well as genome reports, mutant screens, and advances in methods and technology. The Editorial Board of G3 believes that rapid dissemination of these data is the necessary foundation for analysis that leads to mechanistic insights. G3, published by the Genetics Society of America, meets the critical and growing need of the genetics community for rapid review and publication of important results in all areas of genetics. G3 offers the opportunity to publish the puzzling finding or to present unpublished results that may not have been submitted for review and publication due to a perceived lack of a potential high-impact finding. G3 has earned the DOAJ Seal, which is a mark of certification for open access journals, awarded by DOAJ to journals that achieve a high level of openness, adhere to Best Practice and high publishing standards.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信