scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks.

IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Li Huang, Weikang Gong, Dongsheng Chen
{"title":"scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks.","authors":"Li Huang, Weikang Gong, Dongsheng Chen","doi":"10.1093/bib/bbaf279","DOIUrl":null,"url":null,"abstract":"<p><p>Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in downstream machine learning and deep learning (ML/DL) tasks. Here, we introduce scValue, a novel approach that ranks individual cells by 'data value' using out-of-bag estimates from a random forest model. scValue prioritizes high-value cells and allocates greater representation to cell types with higher variability in data value, effectively preserving key biological signals within subsamples. We benchmarked scValue on automatic cell-type annotation tasks across four large datasets, paired with distinct ML/DL models. Our method consistently outperformed existing subsampling methods, closely matching full-data performance across all annotation tasks. In three additional case studies-label transfer learning, cross-study label harmonization, and bulk RNA-seq deconvolution-scValue more effectively preserved T-cell annotations across human gut-colon datasets, more accurately reproduced T-cell subtype relationships in a human spleen dataset, and constructed a more reliable single-cell immune reference for cell-type deconvolution in simulated bulk tissue samples. Finally, using 16 public datasets ranging from tens of thousands to millions of cells, we evaluated subsampling quality based on computational time, Gini coefficient, and Hausdorff distance. scValue demonstrated fast execution, well-balanced cell-type representation, and distributional properties akin to uniform sampling. Overall, scValue provides a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows. It is available as an open-source Python package installable via pip, with source code at https://github.com/LHBCB/scvalue.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 3","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12165832/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf279","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in downstream machine learning and deep learning (ML/DL) tasks. Here, we introduce scValue, a novel approach that ranks individual cells by 'data value' using out-of-bag estimates from a random forest model. scValue prioritizes high-value cells and allocates greater representation to cell types with higher variability in data value, effectively preserving key biological signals within subsamples. We benchmarked scValue on automatic cell-type annotation tasks across four large datasets, paired with distinct ML/DL models. Our method consistently outperformed existing subsampling methods, closely matching full-data performance across all annotation tasks. In three additional case studies-label transfer learning, cross-study label harmonization, and bulk RNA-seq deconvolution-scValue more effectively preserved T-cell annotations across human gut-colon datasets, more accurately reproduced T-cell subtype relationships in a human spleen dataset, and constructed a more reliable single-cell immune reference for cell-type deconvolution in simulated bulk tissue samples. Finally, using 16 public datasets ranging from tens of thousands to millions of cells, we evaluated subsampling quality based on computational time, Gini coefficient, and Hausdorff distance. scValue demonstrated fast execution, well-balanced cell-type representation, and distributional properties akin to uniform sampling. Overall, scValue provides a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows. It is available as an open-source Python package installable via pip, with source code at https://github.com/LHBCB/scvalue.

scValue:用于机器和深度学习任务的大规模单细胞转录组数据的基于值的子采样。
大型单细胞核糖核酸测序(scRNA-seq)数据集提供了前所未有的生物学见解,但在可视化和分析方面提出了实质性的计算挑战。虽然现有的子采样方法可以提高效率,但它们可能无法确保下游机器学习和深度学习(ML/DL)任务的最佳性能。在这里,我们引入scValue,这是一种新颖的方法,它使用随机森林模型的外袋估计,根据“数据值”对单个细胞进行排名。scValue优先考虑高价值细胞,并为数据值变化较大的细胞类型分配更大的代表性,有效地保留了子样本中的关键生物信号。我们在四个大型数据集的自动单元格类型注释任务上对scValue进行了基准测试,这些数据集与不同的ML/DL模型配对。我们的方法始终优于现有的子抽样方法,在所有注释任务中与全数据性能非常接近。在另外三个案例研究——标签迁移学习、交叉研究标签协调和大量RNA-seq反卷积中,scvalue更有效地保存了人类肠道-结肠数据集上的t细胞注释,更准确地再现了人类脾脏数据集中的t细胞亚型关系,并为模拟的大量组织样本中的细胞类型反卷积构建了更可靠的单细胞免疫参考。最后,使用16个公共数据集,从数万到数百万个单元格,我们基于计算时间、基尼系数和豪斯多夫距离来评估子采样质量。scValue展示了快速执行、均衡的单元格类型表示和类似于均匀抽样的分布特性。总的来说,scValue为ML/DL工作流程中的大型scRNA-seq数据的子采样提供了一个强大且可扩展的解决方案。它是一个开源Python包,可以通过pip安装,源代码在https://github.com/LHBCB/scvalue。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Briefings in bioinformatics
Briefings in bioinformatics 生物-生化研究方法
CiteScore
13.20
自引率
13.70%
发文量
549
审稿时长
6 months
期刊介绍: Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信