EmbedGEM：一个评估嵌入在基因发现中的效用的框架。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2024-09-17 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae135

Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos

{"title":"EmbedGEM：一个评估嵌入在基因发现中的效用的框架。","authors":"Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos","doi":"10.1093/bioadv/vbae135","DOIUrl":null,"url":null,"abstract":"Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean <math> <mrow> <mrow> <msup><mrow><mo>χ</mo></mrow> <mn>2</mn></msup> </mrow> </mrow> </math> statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.Availability and implementation: https://github.com/insitro/EmbedGEM.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae135"},"PeriodicalIF":2.4000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632179/pdf/","citationCount":"0","resultStr":"{\"title\":\"EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.\",\"authors\":\"Sumit Mukherjee, Zachary R McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, Hari Somineni, Christoph Klein, Santhosh Satapati, David Lloyd, Christopher Probert, Daphne Koller, Colm O'Dushlaine, Theofanis Karaletsos\",\"doi\":\"10.1093/bioadv/vbae135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean <math> <mrow> <mrow> <msup><mrow><mo>χ</mo></mrow> <mn>2</mn></msup> </mrow> </mrow> </math> statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.Availability and implementation: https://github.com/insitro/EmbedGEM.\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"4 1\",\"pages\":\"vbae135\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632179/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbae135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

摘要：机器学习衍生的嵌入是高内容数据模式的压缩表示。嵌入可以捕获有关疾病状态的详细信息，并已定性地证明在遗传发现中是有用的。尽管它们很有前途，但嵌入有一个主要的限制：不清楚与嵌入相关的遗传变异是否与感兴趣的疾病或特征相关。在这项工作中，我们描述了嵌入遗传评估方法（EmbedGEM），这是一个系统评估嵌入在遗传发现中的效用的框架。EmbedGEM侧重于沿着两个轴比较嵌入：遗传性和疾病相关性。作为遗传力的度量，我们考虑了全基因组显著关联的数量和显著位点的平均χ 2统计量。对于疾病相关性，我们计算每个嵌入主成分的多基因风险评分，然后评估它们与高置信度疾病或特征标签的相关性。虽然我们开发EmbedGEM的动机是嵌入，但该方法通常适用于多变量特征，并且可以很容易地扩展以适应沿着评估轴的其他指标。我们通过在两个独立的数据集中评估嵌入和多变量特征来展示EmbedGEM的实用性：(i)模拟的合成数据集，以证明该框架能够根据其遗传性和疾病相关性对特征进行正确排序；（ii）来自UK Biobank的真实数据，包括代谢和肝脏相关特征。重要的是，我们表明更大的疾病相关性并不自动遵循更大的遗传性。可用性和实现：https://github.com/insitro/EmbedGEM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.

Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean $χ^{2}$ statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.

Availability and implementation: https://github.com/insitro/EmbedGEM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量