HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data.

ArXiv Pub Date : 2025-09-25

Hiren Madhu, João Felipe Rocha, Tinglin Huang, Siddharth Viswanath, Smita Krishnaswamy, Rex Ying

{"title":"HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data.","authors":"Hiren Madhu, João Felipe Rocha, Tinglin Huang, Siddharth Viswanath, Smita Krishnaswamy, Rex Ying","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability unseen genes. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486056/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability unseen genes. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.

本刊更多论文

HEIST：空间转录组学和蛋白质组学数据的图形基础模型。

单细胞转录组学和蛋白质组学已经成为数据驱动的生物学见解的重要来源，可以使用先进的深度学习方法来了解单细胞水平的细胞异质性和基因表达。随着空间组学数据的出现，我们有希望在其组织背景下表征细胞，因为它提供了空间坐标和细胞内转录或蛋白质计数。蛋白质组学通过直接测量蛋白质提供了一种互补的观点，蛋白质是细胞功能的主要效应器和关键的治疗靶点。然而，现有的模型要么忽略了空间信息，要么忽略了细胞内复杂的遗传和蛋白质组学程序。因此，他们无法推断细胞内部调节如何适应微环境信号。此外，这些模型通常使用固定的基因词汇表，阻碍了它们对未知基因的推广。本文介绍了空间转录组学和蛋白质组学的层次图转换基础模型HEIST。HEIST将组织建模为层次图。较高层次的图是一个空间细胞图，每个细胞依次由其较低层次的基因共表达网络图表示。HEIST通过执行层内和跨层消息传递来实现这一点，从而在其嵌入中利用层次结构，从而可以推广到新的数据类型，包括空间蛋白质组学，而无需重新训练。HEIST在来自15个器官的124个组织的2230万个细胞上进行预训练，使用空间感知对比和掩蔽自动编码目标。HEIST嵌入的无监督分析揭示了先前模型遗漏的空间知情亚种群。下游评估证明了蛋白质组学数据的普遍性，以及在临床结果预测、细胞类型注释和跨多种技术的基因植入方面的最新性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量