Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2024-07-05 eCollection Date: 2024-09-01 DOI:10.1093/nargab/lqae073

Nathan J LeRoy, Jason P Smith, Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Donald E Brown, Aidong Zhang, Nathan C Sheffield

{"title":"Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.","authors":"Nathan J LeRoy, Jason P Smith, Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Donald E Brown, Aidong Zhang, Nathan C Sheffield","doi":"10.1093/nargab/lqae073","DOIUrl":null,"url":null,"abstract":"<p><p>Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae073"},"PeriodicalIF":4.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11224678/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqae073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.

查看原文本刊更多论文

使用预训练嵌入对 scATAC 数据进行快速聚类和细胞类型注释。

利用测序技术对转座酶可进入染色质进行单细胞检测（scATAC-seq）所获得的数据现已广泛应用。一个主要的计算挑战是处理高维度和固有的稀疏性，通常是通过为下游聚类任务生成较低维度的单细胞表示来解决这一问题。目前的方法是通过一步学习过程直接生成这种单细胞嵌入。在这里，我们提出了另一种方法，即在参考数据上建立预先训练好的嵌入模型。我们认为，这提供了一种更灵活的分析工作流程，而且通过迁移学习还具有计算性能优势。我们在 scEmbed 中实现了我们的方法，这是一个无监督机器学习框架，它学习基因组调控区域的低维嵌入，以表示和分析 scATAC-seq 数据。scEmbed 在聚类能力方面表现出色，其关键优势在于学习区域共现模式，并可将其迁移到其他未见过的数据集。此外，在参考数据上预先训练的模型可用于构建快速、准确的细胞类型注释系统，而无需其他数据模式。 scEmbed 用 Python 实现，可从 GitHub 下载。我们还在 huggingface 上提供预训练模型供公众使用。scEmbed 是开源的，可从 https://github.com/databio/geniml 上获取。这项工作的预训练模型可在 huggingface 上获取：https://huggingface.co/databio。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊