Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data.

IF 4.1 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-07-01 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.06.043

Joseba Sancho-Zamora, Akash Kanhirodan, Xabier Garrote, Juan Manuel Silva Rojas, Olivier Gevaert, Mikel Hernaez, Guillermo Serrano, Idoia Ochoa

{"title":"Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data.","authors":"Joseba Sancho-Zamora, Akash Kanhirodan, Xabier Garrote, Juan Manuel Silva Rojas, Olivier Gevaert, Mikel Hernaez, Guillermo Serrano, Idoia Ochoa","doi":"10.1016/j.csbj.2025.06.043","DOIUrl":null,"url":null,"abstract":"<p><p>The creation of single-cell atlases is essential for understanding cellular diversity and heterogeneity. However, assembling these atlases is challenging due to batch effects and the need for accurate and consistent cell annotation. Current methods for single-cell RNA and ATAC sequencing (scRNA-Seq and scATAC-Seq), while effective for integration, are not optimized for cell annotation. Additionally, many annotation tools rely on external databases or reference scRNA-Seq datasets, which may limit their adaptability to specific study needs, especially for rare cell-types or scATAC-Seq data. Here, we introduce JIND-Multi, a new framework designed to transfer cell-type labels across multiple annotated datasets. Notably, JIND-Multi can be applied to both scRNA-Seq and scATAC-Seq data, requiring in each case annotated data of the same type, contrary to most methods for scATAC-Seq data that require (paired) annotated scRNA-Seq data. In both cases, JIND-Multi significantly reduces the proportion of unclassified cells while maintaining the accuracy and performance of the original JIND model, and compares favorable to state-of-the-art methods. These results prove its versatility and effectiveness across different single-cell sequencing technologies. JIND-Multi represents an improvement in cell annotation, reducing unassigned cells and offering a reliable solution for both scRNA-Seq and scATAC-Seq data. Its ability to handle multiple labeled datasets enhances the precision of annotations, making it a valuable tool for the single-cell research community. JIND-Multi is publicly available at: https://github.com/ML4BM-Lab/JIND-Multi.git.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"2863-2870"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12270792/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.06.043","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The creation of single-cell atlases is essential for understanding cellular diversity and heterogeneity. However, assembling these atlases is challenging due to batch effects and the need for accurate and consistent cell annotation. Current methods for single-cell RNA and ATAC sequencing (scRNA-Seq and scATAC-Seq), while effective for integration, are not optimized for cell annotation. Additionally, many annotation tools rely on external databases or reference scRNA-Seq datasets, which may limit their adaptability to specific study needs, especially for rare cell-types or scATAC-Seq data. Here, we introduce JIND-Multi, a new framework designed to transfer cell-type labels across multiple annotated datasets. Notably, JIND-Multi can be applied to both scRNA-Seq and scATAC-Seq data, requiring in each case annotated data of the same type, contrary to most methods for scATAC-Seq data that require (paired) annotated scRNA-Seq data. In both cases, JIND-Multi significantly reduces the proportion of unclassified cells while maintaining the accuracy and performance of the original JIND model, and compares favorable to state-of-the-art methods. These results prove its versatility and effectiveness across different single-cell sequencing technologies. JIND-Multi represents an improvement in cell annotation, reducing unassigned cells and offering a reliable solution for both scRNA-Seq and scATAC-Seq data. Its ability to handle multiple labeled datasets enhances the precision of annotations, making it a valuable tool for the single-cell research community. JIND-Multi is publicly available at: https://github.com/ML4BM-Lab/JIND-Multi.git.

查看原文本刊更多论文

利用多个标记数据集对单细胞RNA和ATAC数据进行自动注释。

单细胞图谱的建立对于理解细胞的多样性和异质性至关重要。然而，由于批处理效果和需要准确和一致的单元注释，组装这些地图集是具有挑战性的。目前的单细胞RNA和ATAC测序方法（scRNA-Seq和scATAC-Seq）虽然对整合有效，但没有对细胞注释进行优化。此外，许多注释工具依赖于外部数据库或参考scRNA-Seq数据集，这可能会限制它们对特定研究需求的适应性，特别是对于稀有细胞类型或scacc - seq数据。在这里，我们介绍了JIND-Multi，这是一个新的框架，旨在跨多个注释数据集传输单元格类型标签。值得注意的是，JIND-Multi可以应用于scRNA-Seq和scATAC-Seq数据，在每种情况下都需要相同类型的注释数据，而大多数scATAC-Seq数据方法需要（配对）注释的scRNA-Seq数据。在这两种情况下，JIND- multi显著降低了未分类细胞的比例，同时保持了原始JIND模型的准确性和性能，并且与最先进的方法相比具有优势。这些结果证明了它在不同单细胞测序技术中的通用性和有效性。JIND-Multi代表了细胞注释的改进，减少了未分配的细胞，并为scRNA-Seq和scATAC-Seq数据提供了可靠的解决方案。它处理多个标记数据集的能力提高了注释的精度，使其成为单细胞研究社区的一个有价值的工具。JIND-Multi公开链接：https://github.com/ML4BM-Lab/JIND-Multi.git。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology