EmbedGEM:一个评估嵌入在基因发现中的效用的框架。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2024-09-17 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae135
Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos
{"title":"EmbedGEM:一个评估嵌入在基因发现中的效用的框架。","authors":"Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos","doi":"10.1093/bioadv/vbae135","DOIUrl":null,"url":null,"abstract":"<p><strong>Summary: </strong>Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (<b>Embed</b>ding <b>G</b>enetic <b>E</b>valuation <b>M</b>ethods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean <math> <mrow> <mrow> <msup><mrow><mo>χ</mo></mrow> <mn>2</mn></msup> </mrow> </mrow> </math> statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.</p><p><strong>Availability and implementation: </strong>https://github.com/insitro/EmbedGEM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae135"},"PeriodicalIF":2.4000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632179/pdf/","citationCount":"0","resultStr":"{\"title\":\"EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.\",\"authors\":\"Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos\",\"doi\":\"10.1093/bioadv/vbae135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Summary: </strong>Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (<b>Embed</b>ding <b>G</b>enetic <b>E</b>valuation <b>M</b>ethods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean <math> <mrow> <mrow> <msup><mrow><mo>χ</mo></mrow> <mn>2</mn></msup> </mrow> </mrow> </math> statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.</p><p><strong>Availability and implementation: </strong>https://github.com/insitro/EmbedGEM.</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"4 1\",\"pages\":\"vbae135\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632179/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbae135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

摘要:机器学习衍生的嵌入是高内容数据模式的压缩表示。嵌入可以捕获有关疾病状态的详细信息,并已定性地证明在遗传发现中是有用的。尽管它们很有前途,但嵌入有一个主要的限制:不清楚与嵌入相关的遗传变异是否与感兴趣的疾病或特征相关。在这项工作中,我们描述了嵌入遗传评估方法(EmbedGEM),这是一个系统评估嵌入在遗传发现中的效用的框架。EmbedGEM侧重于沿着两个轴比较嵌入:遗传性和疾病相关性。作为遗传力的度量,我们考虑了全基因组显著关联的数量和显著位点的平均χ 2统计量。对于疾病相关性,我们计算每个嵌入主成分的多基因风险评分,然后评估它们与高置信度疾病或特征标签的相关性。虽然我们开发EmbedGEM的动机是嵌入,但该方法通常适用于多变量特征,并且可以很容易地扩展以适应沿着评估轴的其他指标。我们通过在两个独立的数据集中评估嵌入和多变量特征来展示EmbedGEM的实用性:(i)模拟的合成数据集,以证明该框架能够根据其遗传性和疾病相关性对特征进行正确排序;(ii)来自UK Biobank的真实数据,包括代谢和肝脏相关特征。重要的是,我们表明更大的疾病相关性并不自动遵循更大的遗传性。可用性和实现:https://github.com/insitro/EmbedGEM。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.

Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean χ 2 statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.

Availability and implementation: https://github.com/insitro/EmbedGEM.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信