Where single-cell transcriptomics fails T cells: The misuse of unsupervised clustering for T-cell annotation

Immunoinformatics (Amsterdam, Netherlands) Pub Date : 2025-12-01 Epub Date: 2025-10-21 DOI:10.1016/j.immuno.2025.100063

Kerry A. Mullan , Sebastiaan Valkiers , Nicky de Vrij , Chen Li , Sara Verbandt , Ting Pu , Pieter Meysman

{"title":"Where single-cell transcriptomics fails T cells: The misuse of unsupervised clustering for T-cell annotation","authors":"Kerry A. Mullan , Sebastiaan Valkiers , Nicky de Vrij , Chen Li , Sara Verbandt , Ting Pu , Pieter Meysman","doi":"10.1016/j.immuno.2025.100063","DOIUrl":null,"url":null,"abstract":"<div><div>The current state of single-cell transcriptomic interrogation typically consists of using an unsupervised clustering approach followed by expert opinion-based annotation. The underlying assumption is that this process will identify transcriptional differences between cellular subsets accurately, and thus be able to cluster for example CD8+ <em>T</em> cells apart from CD4+ <em>T</em> cells. However, this widely applied assumption that the clustering reflects T-cell biology has never been validated. We used a large T-cell atlas (V2) that combined twelve 10x Genomics single T-cell transcriptomics datasets (∼500 K cells) as well as an independent CITE-seq dataset to qualify if the unsupervised clustering produced by Seurat reflected the biology. Annotations were then evaluated using the expression of key marker genes. The main T-cell markers CD8 and CD4 were mixed in most clusters, regardless of the feature selection and either principal/harmony components or features. The factors driving the clustering were also related to cellular functions (glucose metabolism), T-cell receptor (TCR), immunoglobulin and HLA transcripts, and not typical markers. Against current assumptions, the clustering was not being driven by the T-cell phenotypes and could not accurately segregate the CD4+ from CD8+ <em>T</em> cells, let alone the sub-classifications. This implicated many of the T cells would be incorrectly classified if using the standard cluster-based annotation approach. Methods relying on unsupervised clustering should be used with care, as improper handling can misrepresent the data, and alternatives such as semi-supervised approaches with TCR-seq or protein-based annotations should be preferred.</div></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"20 ","pages":"Article 100063"},"PeriodicalIF":0.0000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Immunoinformatics (Amsterdam, Netherlands)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667119025000163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/21 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The current state of single-cell transcriptomic interrogation typically consists of using an unsupervised clustering approach followed by expert opinion-based annotation. The underlying assumption is that this process will identify transcriptional differences between cellular subsets accurately, and thus be able to cluster for example CD8+ T cells apart from CD4+ T cells. However, this widely applied assumption that the clustering reflects T-cell biology has never been validated. We used a large T-cell atlas (V2) that combined twelve 10x Genomics single T-cell transcriptomics datasets (∼500 K cells) as well as an independent CITE-seq dataset to qualify if the unsupervised clustering produced by Seurat reflected the biology. Annotations were then evaluated using the expression of key marker genes. The main T-cell markers CD8 and CD4 were mixed in most clusters, regardless of the feature selection and either principal/harmony components or features. The factors driving the clustering were also related to cellular functions (glucose metabolism), T-cell receptor (TCR), immunoglobulin and HLA transcripts, and not typical markers. Against current assumptions, the clustering was not being driven by the T-cell phenotypes and could not accurately segregate the CD4+ from CD8+ T cells, let alone the sub-classifications. This implicated many of the T cells would be incorrectly classified if using the standard cluster-based annotation approach. Methods relying on unsupervised clustering should be used with care, as improper handling can misrepresent the data, and alternatives such as semi-supervised approaches with TCR-seq or protein-based annotations should be preferred.

查看原文本刊更多论文

单细胞转录组学在T细胞失败的地方：对T细胞注释滥用无监督聚类

目前的单细胞转录组询问通常包括使用无监督聚类方法，然后是基于专家意见的注释。潜在的假设是，这一过程将准确地识别细胞亚群之间的转录差异，从而能够将CD8+ T细胞与CD4+ T细胞分开聚集。然而，这种广泛应用的聚类反应t细胞生物学的假设从未得到验证。我们使用了一个大型t细胞图谱（V2），该图谱结合了12个10x Genomics单个t细胞转录组学数据集（~ 500 K细胞）以及一个独立的CITE-seq数据集，以确定Seurat产生的无监督聚类是否反映了生物学。然后使用关键标记基因的表达来评估注释。主要的t细胞标记CD8和CD4在大多数集群中是混合的，无论特征选择和主要/和谐成分或特征。驱动聚类的因素还与细胞功能（葡萄糖代谢）、t细胞受体（TCR）、免疫球蛋白和HLA转录物有关，而不是典型的标志物。与目前的假设相反，这种聚类并不是由T细胞表型驱动的，也不能准确地分离CD4+和CD8+ T细胞，更不用说亚分类了。这意味着如果使用标准的基于簇的注释方法，许多T细胞将被错误地分类。应该谨慎使用依赖于无监督聚类的方法，因为不当的处理可能会歪曲数据，并且应该优先选择使用TCR-seq或基于蛋白质的注释的半监督方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Immunoinformatics (Amsterdam, Netherlands) Immunology, Computer Science Applications

自引率

0.00%

发文量

审稿时长

60 days