Partial label learning for automated classification of single-cell transcriptomic profiles

IF 3.6 2区生物学

PLoS Computational Biology Pub Date : 2024-04-01 DOI:10.1371/journal.pcbi.1012006

Malek Senoussi, Thierry Artières, Paul Villoutreix

{"title":"Partial label learning for automated classification of single-cell transcriptomic profiles","authors":"Malek Senoussi, Thierry Artières, Paul Villoutreix","doi":"10.1371/journal.pcbi.1012006","DOIUrl":null,"url":null,"abstract":"Single-cell RNA sequencing (scRNASeq) data plays a major role in advancing our understanding of developmental biology. An important current question is how to classify transcriptomic profiles obtained from scRNASeq experiments into the various cell types and identify the lineage relationship for individual cells. Because of the fast accumulation of datasets and the high dimensionality of the data, it has become challenging to explore and annotate single-cell transcriptomic profiles by hand. To overcome this challenge, automated classification methods are needed. Classical approaches rely on supervised training datasets. However, due to the difficulty of obtaining data annotated at single-cell resolution, we propose instead to take advantage of partial annotations. The partial label learning framework assumes that we can obtain a set of candidate labels containing the correct one for each data point, a simpler setting than requiring a fully supervised training dataset. We study and extend when needed state-of-the-art multi-class classification methods, such as SVM, kNN, prototype-based, logistic regression and ensemble methods, to the partial label learning framework. Moreover, we study the effect of incorporating the structure of the label set into the methods. We focus particularly on the hierarchical structure of the labels, as commonly observed in developmental processes. We show, on simulated and real datasets, that these extensions enable to learn from partially labeled data, and perform predictions with high accuracy, particularly with a nonlinear prototype-based method. We demonstrate that the performances of our methods trained with partially annotated data reach the same performance as fully supervised data. Finally, we study the level of uncertainty present in the partially annotated data, and derive some prescriptive results on the effect of this uncertainty on the accuracy of the partial label learning methods. Overall our findings show how hierarchical and non-hierarchical partial label learning strategies can help solve the problem of automated classification of single-cell transcriptomic profiles, interestingly these methods rely on a much less stringent type of annotated datasets compared to fully supervised learning methods.","PeriodicalId":49688,"journal":{"name":"PLoS Computational Biology","volume":"796 ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1012006","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Single-cell RNA sequencing (scRNASeq) data plays a major role in advancing our understanding of developmental biology. An important current question is how to classify transcriptomic profiles obtained from scRNASeq experiments into the various cell types and identify the lineage relationship for individual cells. Because of the fast accumulation of datasets and the high dimensionality of the data, it has become challenging to explore and annotate single-cell transcriptomic profiles by hand. To overcome this challenge, automated classification methods are needed. Classical approaches rely on supervised training datasets. However, due to the difficulty of obtaining data annotated at single-cell resolution, we propose instead to take advantage of partial annotations. The partial label learning framework assumes that we can obtain a set of candidate labels containing the correct one for each data point, a simpler setting than requiring a fully supervised training dataset. We study and extend when needed state-of-the-art multi-class classification methods, such as SVM, kNN, prototype-based, logistic regression and ensemble methods, to the partial label learning framework. Moreover, we study the effect of incorporating the structure of the label set into the methods. We focus particularly on the hierarchical structure of the labels, as commonly observed in developmental processes. We show, on simulated and real datasets, that these extensions enable to learn from partially labeled data, and perform predictions with high accuracy, particularly with a nonlinear prototype-based method. We demonstrate that the performances of our methods trained with partially annotated data reach the same performance as fully supervised data. Finally, we study the level of uncertainty present in the partially annotated data, and derive some prescriptive results on the effect of this uncertainty on the accuracy of the partial label learning methods. Overall our findings show how hierarchical and non-hierarchical partial label learning strategies can help solve the problem of automated classification of single-cell transcriptomic profiles, interestingly these methods rely on a much less stringent type of annotated datasets compared to fully supervised learning methods.

查看原文本刊更多论文

部分标签学习用于单细胞转录组图谱的自动分类

单细胞 RNA 测序（scRNASeq）数据在促进我们对发育生物学的理解方面发挥着重要作用。当前的一个重要问题是如何将从 scRNASeq 实验中获得的转录组图谱分类到各种细胞类型中，并确定单个细胞的系谱关系。由于数据集的快速积累和数据的高维度，手工探索和注释单细胞转录组图谱已成为一项挑战。为了克服这一挑战，需要采用自动分类方法。经典方法依赖于有监督的训练数据集。然而，由于难以获得单细胞分辨率的注释数据，我们建议利用部分注释。部分标注学习框架假定我们可以获得一组候选标签，其中包含每个数据点的正确标签，这比需要完全监督的训练数据集更简单。我们研究了最先进的多类分类方法，如 SVM、kNN、基于原型、逻辑回归和集合方法，并在必要时将其扩展到部分标签学习框架。此外，我们还研究了将标签集结构纳入方法的效果。我们特别关注标签的分层结构，这在发展过程中很常见。我们在模拟和真实数据集上表明，这些扩展能够从部分标签数据中学习，并进行高精度预测，尤其是基于非线性原型的方法。我们证明，使用部分标注数据训练的方法的性能与完全监督数据的性能相同。最后，我们研究了部分标注数据中存在的不确定性水平，并得出了这种不确定性对部分标注学习方法准确性影响的一些规范性结果。总之，我们的研究结果表明了分层和非分层部分标签学习策略如何帮助解决单细胞转录组图谱的自动分类问题，有趣的是，与完全监督学习方法相比，这些方法所依赖的注释数据集的类型要宽松得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS Computational Biology 生物-生化研究方法

CiteScore

7.10

自引率

4.70%

发文量

820

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.