PCLDA: An interpretable cell annotation tool for single-cell RNA-sequencing data based on simple statistical methods.

IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY
Computational and structural biotechnology journal Pub Date : 2025-07-23 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.07.019
Kailun Bai, Belaid Moa, Xiaojian Shao, Xuekui Zhang
{"title":"PCLDA: An interpretable cell annotation tool for single-cell RNA-sequencing data based on simple statistical methods.","authors":"Kailun Bai, Belaid Moa, Xiaojian Shao, Xuekui Zhang","doi":"10.1016/j.csbj.2025.07.019","DOIUrl":null,"url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, yet accurate and consistent cell-type annotation remains a crucial challenge. Numerous automated tools exist, but their complex modeling assumptions can hinder reliability across varied datasets and protocols. We propose PCLDA, a pipeline composed of three modules: t-test-based gene screening, principal component analysis (PCA) and linear discriminant analysis (LDA), all built on simple statistical methods. An ablation study shows that each module in PCLDA contributes significantly to performance and robustness, with two novel enhancements in the second module yielding substantial gains. Despite these additions, the model retains its original assumptions, computational efficiency, and interpretability. Benchmarking against nine state-of-the-art methods across 22 public scRNA-seq datasets and 35 distinct evaluation scenarios, PCLDA consistently achieves top-tier accuracy under both intra-dataset (cross-validation) and inter-dataset (cross-platform) conditions. Notably, when reference and query data are generated via different protocols, PCLDA remains stable and often outperforms more complex machine-learning approaches. Furthermore, PCLDA offers strong interpretability, attributed to the linear nature of its PCA and LDA modules. The final decision boundaries are linear combinations of the original gene expression values, directly reflecting the contribution of each gene to the classification. Top-weighted genes identified by PCLDA better capture biologically meaningful signals in enrichment analyses than those selected via marginal screening alone, offering deeper functional insights into cell-type specificity. In conclusion, our work underscores the utility of carefully enhanced simple statistics methods for single-cell annotation. PCLDA's simplicity, interpretability, and consistently high performance make it a practical, reliable alternative to more complex annotation pipelines. Code is available on GitHub:https://github.com/kellen8hao/PCLDA.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"3264-3274"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12329077/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.07.019","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, yet accurate and consistent cell-type annotation remains a crucial challenge. Numerous automated tools exist, but their complex modeling assumptions can hinder reliability across varied datasets and protocols. We propose PCLDA, a pipeline composed of three modules: t-test-based gene screening, principal component analysis (PCA) and linear discriminant analysis (LDA), all built on simple statistical methods. An ablation study shows that each module in PCLDA contributes significantly to performance and robustness, with two novel enhancements in the second module yielding substantial gains. Despite these additions, the model retains its original assumptions, computational efficiency, and interpretability. Benchmarking against nine state-of-the-art methods across 22 public scRNA-seq datasets and 35 distinct evaluation scenarios, PCLDA consistently achieves top-tier accuracy under both intra-dataset (cross-validation) and inter-dataset (cross-platform) conditions. Notably, when reference and query data are generated via different protocols, PCLDA remains stable and often outperforms more complex machine-learning approaches. Furthermore, PCLDA offers strong interpretability, attributed to the linear nature of its PCA and LDA modules. The final decision boundaries are linear combinations of the original gene expression values, directly reflecting the contribution of each gene to the classification. Top-weighted genes identified by PCLDA better capture biologically meaningful signals in enrichment analyses than those selected via marginal screening alone, offering deeper functional insights into cell-type specificity. In conclusion, our work underscores the utility of carefully enhanced simple statistics methods for single-cell annotation. PCLDA's simplicity, interpretability, and consistently high performance make it a practical, reliable alternative to more complex annotation pipelines. Code is available on GitHub:https://github.com/kellen8hao/PCLDA.

Abstract Image

Abstract Image

Abstract Image

Abstract Image

Abstract Image

Abstract Image

PCLDA:基于简单统计方法的单细胞rna测序数据的可解释细胞注释工具。
单细胞RNA测序(scRNA-seq)能够实现高分辨率的细胞异质性分析,但准确和一致的细胞类型注释仍然是一个关键挑战。目前存在许多自动化工具,但它们复杂的建模假设可能会阻碍各种数据集和协议的可靠性。我们提出了PCLDA,一个由三个模块组成的流水线:基于t检验的基因筛选,主成分分析(PCA)和线性判别分析(LDA),都建立在简单的统计方法上。一项消融研究表明,PCLDA中的每个模块都对性能和鲁棒性有显著贡献,第二个模块的两个新增强产生了可观的收益。尽管添加了这些内容,该模型仍保留了其原始假设、计算效率和可解释性。通过对22个公共scRNA-seq数据集和35个不同评估场景的9种最先进方法进行基准测试,PCLDA在数据集内(交叉验证)和数据集间(跨平台)条件下始终达到顶级精度。值得注意的是,当通过不同的协议生成参考和查询数据时,PCLDA保持稳定,并且通常优于更复杂的机器学习方法。此外,由于其PCA和LDA模块的线性特性,PCLDA提供了很强的可解释性。最终的决策边界是原始基因表达值的线性组合,直接反映了每个基因对分类的贡献。在富集分析中,通过PCLDA鉴定的顶加权基因比仅通过边缘筛选选择的基因更能捕获生物学上有意义的信号,从而对细胞类型特异性提供更深入的功能见解。总之,我们的工作强调了用于单细胞注释的精心增强的简单统计方法的实用性。PCLDA的简单性、可解释性和始终如一的高性能使其成为更复杂的注释管道的实用、可靠的替代方案。代码可在GitHub:https://github.com/kellen8hao/PCLDA。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational and structural biotechnology journal
Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics
CiteScore
9.30
自引率
3.30%
发文量
540
审稿时长
6 weeks
期刊介绍: Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信