$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data

Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna
{"title":"$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data","authors":"Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna","doi":"arxiv-2403.01078","DOIUrl":null,"url":null,"abstract":"Natural systems with emergent behaviors often organize along low-dimensional\nsubsets of high-dimensional spaces. For example, despite the tens of thousands\nof genes in the human genome, the principled study of genomics is fruitful\nbecause biological processes rely on coordinated organization that results in\nlower dimensional phenotypes. To uncover this organization, many nonlinear\ndimensionality reduction techniques have successfully embedded high-dimensional\ndata into low-dimensional spaces by preserving local similarities between data\npoints. However, the nonlinearities in these methods allow for too much\ncurvature to preserve general trends across multiple non-neighboring data\nclusters, thereby limiting their interpretability and generalizability to\nout-of-distribution data. Here, we address both of these limitations by\nregularizing the curvature of manifolds generated by variational autoencoders,\na process we coin ``$\\Gamma$-VAE''. We demonstrate its utility using two\nexample data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the\nGenotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage\ntracing experiment in hematopoietic stem cell differentiation. We find that the\nresulting regularized manifolds identify mesoscale structure associated with\ndifferent cancer cell types, and accurately re-embed tissues from completely\nunseen, out-of distribution cancers as if they were originally trained on them.\nFinally, we show that preserving long-range relationships to differentiated\ncells separates undifferentiated cells -- which have not yet specialized --\naccording to their eventual fate. Broadly, we anticipate that regularizing the\ncurvature of generative models will enable more consistent, predictive, and\ngeneralizable models in any high-dimensional system with emergent\nlow-dimensional behavior.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.01078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.
$Γ$-VAE:在高维数据中发现新兴低维几何结构的曲率正则化变分自动编码器
具有突现行为的自然系统往往是沿着高维空间的低维子集组织起来的。例如,尽管人类基因组中有数以万计的基因,但对基因组学的原则性研究却硕果累累,因为生物过程依赖于协调组织,从而产生低维表型。为了揭示这种组织结构,许多非线性降维技术通过保留数据点之间的局部相似性,成功地将高维数据嵌入低维空间。然而,这些方法中的非线性允许过多的曲率,无法保留多个非相邻数据集群的一般趋势,从而限制了它们对分布外数据的可解释性和普适性。在这里,我们通过对变异自动编码器生成的流形的曲率进行规则化来解决这两个局限性,我们称之为"$\Gamma$-VAE"。我们使用两个示例数据集证明了这一方法的实用性:来自癌症基因组图谱(TCGA)和基因型组织表达(GTEx)的大容量 RNA-seq;以及来自造血干细胞分化的系谱追踪实验的单细胞 RNA-seq。我们发现,经过正则化处理的流形可以识别与不同癌细胞类型相关的中尺度结构,并能准确地从完全未见的、不在分布范围内的癌症组织中重新嵌入组织,就像最初对它们进行训练一样。最后,我们证明,保留与已分化细胞的长程关系可以根据未分化细胞(尚未特化)的最终命运将它们分开。从广义上讲,我们预计正则化生成模型的曲率将使任何具有新兴低维行为的高维系统中的模型更具一致性、预测性和通用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信