annATAC: automatic cell type annotation for scATAC-seq data based on language model.

IF 4.4 1区生物学 Q1 BIOLOGY

BMC Biology Pub Date : 2025-05-28 DOI:10.1186/s12915-025-02244-5

Lingyu Cui, Fang Wang, Hongfei Li, Qiaoming Liu, Murong Zhou, Guohua Wang

{"title":"annATAC: automatic cell type annotation for scATAC-seq data based on language model.","authors":"Lingyu Cui, Fang Wang, Hongfei Li, Qiaoming Liu, Murong Zhou, Guohua Wang","doi":"10.1186/s12915-025-02244-5","DOIUrl":null,"url":null,"abstract":"Background: Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.Results: We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.Conclusions: Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.","PeriodicalId":9339,"journal":{"name":"BMC Biology","volume":"23 1","pages":"145"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12121080/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12915-025-02244-5","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.

Results: We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.

Conclusions: Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.

查看原文本刊更多论文

annATAC：基于语言模型的scATAC-seq数据单元格类型自动标注。

背景：细胞类型注释是单细胞数据下游分析的基石。然而，scATAC-seq数据具有高稀疏度和高维数的特点，这给其标注过程带来了很大的挑战。结果：我们提出了一种基于语言模型的新方法——annATAC，用于scATAC-seq数据中细胞类型的自动标注。该方法主要包括三个阶段。在预训练阶段，通过对大量未标记数据的训练，模型可以学习到峰值之间的交互关系，从而对数据特征有初步的了解。随后，在微调阶段，利用少量标记数据对模型进行二次训练，使模型能够准确识别细胞类型。最后，在预测阶段，应用训练好的模型对scATAC-seq数据进行标注。结论：与其他跨多数据集的自动标注方法相比，annATAC在标注性能上具有优势。进一步的实验验证了annATAC在识别标记峰和标记基序方面具有很大的潜力。期望annATAC能为scATAC-seq研究提供更深刻、更精确的分析结果。因此，它将有效地促进相关生物医学研究的进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Biology 生物-生物学

CiteScore

7.80

自引率

1.90%

发文量

260

审稿时长

3 months

期刊介绍： BMC Biology is a broad scope journal covering all areas of biology. Our content includes research articles, new methods and tools. BMC Biology also publishes reviews, Q&A, and commentaries.