EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment.

bioRxiv : the preprint server for biology Pub Date : 2025-09-28 DOI:10.1101/2025.02.05.636688

Juncheng Wu, Changxin Wan, Zhicheng Ji, Yuyin Zhou, Wenpin Hou

{"title":"EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment.","authors":"Juncheng Wu, Changxin Wan, Zhicheng Ji, Yuyin Zhou, Wenpin Hou","doi":"10.1101/2025.02.05.636688","DOIUrl":null,"url":null,"abstract":"<p><p>Foundation models exhibit strong capabilities for downstream tasks by learning generalized representations through self-supervised pre-training on large datasets. While several foundation models have been developed for single-cell RNA-seq (scRNA-seq) data, there is still a lack of models specifically tailored for single-cell ATAC-seq (scATAC-seq), which measures epigenetic information in individual cells. The principal challenge in developing such a model lies in the vast number of scATAC peaks and the significant sparsity of the data, which complicates the formulation of peak-to-peak correlations. To address this challenge, we introduce EpiFoundation, a foundation model for learning cell representations from the high-dimensional and sparse space of peaks. EpiFoundation relies on an innovative cross-modality pre-training procedure with two key technical innovations. First, EpiFoundation exclusively processes the non-zero peak set, thereby enhancing the density of cell-specific information within the input data. Second, EpiFoundation utilizes dense gene expression information to supervise the pre-training process, aligning peak-to-gene correlations. EpiFoundation can handle various types of downstream tasks, including cell-type annotation, batch correction, and gene expression prediction. To train and validate EpiFoundation, we curated MiniAtlas, a dataset of 100,000+ single cells with paired scRNA-seq and scATAC-seq data, along with diverse test sets spanning various tissues and cell types for robust evaluation. EpiFoundation demonstrates state-of-the-art performance across multiple tissues and diverse downstream tasks.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839112/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.05.636688","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Foundation models exhibit strong capabilities for downstream tasks by learning generalized representations through self-supervised pre-training on large datasets. While several foundation models have been developed for single-cell RNA-seq (scRNA-seq) data, there is still a lack of models specifically tailored for single-cell ATAC-seq (scATAC-seq), which measures epigenetic information in individual cells. The principal challenge in developing such a model lies in the vast number of scATAC peaks and the significant sparsity of the data, which complicates the formulation of peak-to-peak correlations. To address this challenge, we introduce EpiFoundation, a foundation model for learning cell representations from the high-dimensional and sparse space of peaks. EpiFoundation relies on an innovative cross-modality pre-training procedure with two key technical innovations. First, EpiFoundation exclusively processes the non-zero peak set, thereby enhancing the density of cell-specific information within the input data. Second, EpiFoundation utilizes dense gene expression information to supervise the pre-training process, aligning peak-to-gene correlations. EpiFoundation can handle various types of downstream tasks, including cell-type annotation, batch correction, and gene expression prediction. To train and validate EpiFoundation, we curated MiniAtlas, a dataset of 100,000+ single cells with paired scRNA-seq and scATAC-seq data, along with diverse test sets spanning various tissues and cell types for robust evaluation. EpiFoundation demonstrates state-of-the-art performance across multiple tissues and diverse downstream tasks.

查看原文本刊更多论文

EpiFoundation：通过峰对基因比对单细胞ATAC-seq的基础模型。

基础模型通过在大型数据集上进行自我监督预训练来学习广义表征，从而为下游任务展现出强大的能力。虽然已经开发出了几种用于单细胞 RNA-seq （scRNA-seq）数据的基础模型，但仍然缺乏专门为单细胞 ATAC-seq （scATAC-seq）量身定制的模型，而单细胞 ATAC-seq 可测量单个细胞的表观遗传信息。开发此类模型的主要挑战在于 scATAC 峰的数量庞大，数据稀疏，这使得峰间相关性的建立变得复杂。为了应对这一挑战，我们引入了 EpiFoundation，这是一种从高维稀疏的峰值空间学习细胞表征的基础模型。EpiFoundation 依靠创新的跨模态预训练程序和两项关键的技术创新。首先，EpiFoundation 只处理非零峰值集，从而提高了输入数据中特定细胞信息的密度。其次，EpiFoundation 利用密集的基因表达信息来监督预训练过程，调整峰与基因之间的相关性。EpiFoundation 可以处理各种类型的下游任务，包括细胞类型注释、批量校正和基因表达预测。为了训练和验证 EpiFoundation，我们策划了 MiniAtlas，这是一个包含 100,000+ 个单细胞的数据集，其中有成对的 scRNA-seq 和 scATAC-seq 数据，还有跨越各种组织和细胞类型的各种测试集，以便进行稳健的评估。EpiFoundation 在多个组织和多种下游任务中都表现出了一流的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量