annATAC: automatic cell type annotation for scATAC-seq data based on language model.

IF 4.4 1区 生物学 Q1 BIOLOGY
Lingyu Cui, Fang Wang, Hongfei Li, Qiaoming Liu, Murong Zhou, Guohua Wang
{"title":"annATAC: automatic cell type annotation for scATAC-seq data based on language model.","authors":"Lingyu Cui, Fang Wang, Hongfei Li, Qiaoming Liu, Murong Zhou, Guohua Wang","doi":"10.1186/s12915-025-02244-5","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.</p><p><strong>Results: </strong>We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.</p><p><strong>Conclusions: </strong>Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.</p>","PeriodicalId":9339,"journal":{"name":"BMC Biology","volume":"23 1","pages":"145"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12121080/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12915-025-02244-5","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.

Results: We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.

Conclusions: Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.

annATAC:基于语言模型的scATAC-seq数据单元格类型自动标注。
背景:细胞类型注释是单细胞数据下游分析的基石。然而,scATAC-seq数据具有高稀疏度和高维数的特点,这给其标注过程带来了很大的挑战。结果:我们提出了一种基于语言模型的新方法——annATAC,用于scATAC-seq数据中细胞类型的自动标注。该方法主要包括三个阶段。在预训练阶段,通过对大量未标记数据的训练,模型可以学习到峰值之间的交互关系,从而对数据特征有初步的了解。随后,在微调阶段,利用少量标记数据对模型进行二次训练,使模型能够准确识别细胞类型。最后,在预测阶段,应用训练好的模型对scATAC-seq数据进行标注。结论:与其他跨多数据集的自动标注方法相比,annATAC在标注性能上具有优势。进一步的实验验证了annATAC在识别标记峰和标记基序方面具有很大的潜力。期望annATAC能为scATAC-seq研究提供更深刻、更精确的分析结果。因此,它将有效地促进相关生物医学研究的进展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Biology
BMC Biology 生物-生物学
CiteScore
7.80
自引率
1.90%
发文量
260
审稿时长
3 months
期刊介绍: BMC Biology is a broad scope journal covering all areas of biology. Our content includes research articles, new methods and tools. BMC Biology also publishes reviews, Q&A, and commentaries.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信