scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks.

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-05-01 DOI:10.1093/bib/bbaf279

Li Huang, Weikang Gong, Dongsheng Chen

{"title":"scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks.","authors":"Li Huang, Weikang Gong, Dongsheng Chen","doi":"10.1093/bib/bbaf279","DOIUrl":null,"url":null,"abstract":"<p><p>Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in downstream machine learning and deep learning (ML/DL) tasks. Here, we introduce scValue, a novel approach that ranks individual cells by 'data value' using out-of-bag estimates from a random forest model. scValue prioritizes high-value cells and allocates greater representation to cell types with higher variability in data value, effectively preserving key biological signals within subsamples. We benchmarked scValue on automatic cell-type annotation tasks across four large datasets, paired with distinct ML/DL models. Our method consistently outperformed existing subsampling methods, closely matching full-data performance across all annotation tasks. In three additional case studies-label transfer learning, cross-study label harmonization, and bulk RNA-seq deconvolution-scValue more effectively preserved T-cell annotations across human gut-colon datasets, more accurately reproduced T-cell subtype relationships in a human spleen dataset, and constructed a more reliable single-cell immune reference for cell-type deconvolution in simulated bulk tissue samples. Finally, using 16 public datasets ranging from tens of thousands to millions of cells, we evaluated subsampling quality based on computational time, Gini coefficient, and Hausdorff distance. scValue demonstrated fast execution, well-balanced cell-type representation, and distributional properties akin to uniform sampling. Overall, scValue provides a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows. It is available as an open-source Python package installable via pip, with source code at https://github.com/LHBCB/scvalue.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 3","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12165832/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf279","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in downstream machine learning and deep learning (ML/DL) tasks. Here, we introduce scValue, a novel approach that ranks individual cells by 'data value' using out-of-bag estimates from a random forest model. scValue prioritizes high-value cells and allocates greater representation to cell types with higher variability in data value, effectively preserving key biological signals within subsamples. We benchmarked scValue on automatic cell-type annotation tasks across four large datasets, paired with distinct ML/DL models. Our method consistently outperformed existing subsampling methods, closely matching full-data performance across all annotation tasks. In three additional case studies-label transfer learning, cross-study label harmonization, and bulk RNA-seq deconvolution-scValue more effectively preserved T-cell annotations across human gut-colon datasets, more accurately reproduced T-cell subtype relationships in a human spleen dataset, and constructed a more reliable single-cell immune reference for cell-type deconvolution in simulated bulk tissue samples. Finally, using 16 public datasets ranging from tens of thousands to millions of cells, we evaluated subsampling quality based on computational time, Gini coefficient, and Hausdorff distance. scValue demonstrated fast execution, well-balanced cell-type representation, and distributional properties akin to uniform sampling. Overall, scValue provides a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows. It is available as an open-source Python package installable via pip, with source code at https://github.com/LHBCB/scvalue.

查看原文本刊更多论文

scValue：用于机器和深度学习任务的大规模单细胞转录组数据的基于值的子采样。

大型单细胞核糖核酸测序（scRNA-seq）数据集提供了前所未有的生物学见解，但在可视化和分析方面提出了实质性的计算挑战。虽然现有的子采样方法可以提高效率，但它们可能无法确保下游机器学习和深度学习（ML/DL）任务的最佳性能。在这里，我们引入scValue，这是一种新颖的方法，它使用随机森林模型的外袋估计，根据“数据值”对单个细胞进行排名。scValue优先考虑高价值细胞，并为数据值变化较大的细胞类型分配更大的代表性，有效地保留了子样本中的关键生物信号。我们在四个大型数据集的自动单元格类型注释任务上对scValue进行了基准测试，这些数据集与不同的ML/DL模型配对。我们的方法始终优于现有的子抽样方法，在所有注释任务中与全数据性能非常接近。在另外三个案例研究——标签迁移学习、交叉研究标签协调和大量RNA-seq反卷积中，scvalue更有效地保存了人类肠道-结肠数据集上的t细胞注释，更准确地再现了人类脾脏数据集中的t细胞亚型关系，并为模拟的大量组织样本中的细胞类型反卷积构建了更可靠的单细胞免疫参考。最后，使用16个公共数据集，从数万到数百万个单元格，我们基于计算时间、基尼系数和豪斯多夫距离来评估子采样质量。scValue展示了快速执行、均衡的单元格类型表示和类似于均匀抽样的分布特性。总的来说，scValue为ML/DL工作流程中的大型scRNA-seq数据的子采样提供了一个强大且可扩展的解决方案。它是一个开源Python包，可以通过pip安装，源代码在https://github.com/LHBCB/scvalue。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.