Where single-cell transcriptomics fails T cells: The misuse of unsupervised clustering for T-cell annotation

Kerry A. Mullan , Sebastiaan Valkiers , Nicky de Vrij , Chen Li , Sara Verbandt , Ting Pu , Pieter Meysman
{"title":"Where single-cell transcriptomics fails T cells: The misuse of unsupervised clustering for T-cell annotation","authors":"Kerry A. Mullan ,&nbsp;Sebastiaan Valkiers ,&nbsp;Nicky de Vrij ,&nbsp;Chen Li ,&nbsp;Sara Verbandt ,&nbsp;Ting Pu ,&nbsp;Pieter Meysman","doi":"10.1016/j.immuno.2025.100063","DOIUrl":null,"url":null,"abstract":"<div><div>The current state of single-cell transcriptomic interrogation typically consists of using an unsupervised clustering approach followed by expert opinion-based annotation. The underlying assumption is that this process will identify transcriptional differences between cellular subsets accurately, and thus be able to cluster for example CD8+ <em>T</em> cells apart from CD4+ <em>T</em> cells. However, this widely applied assumption that the clustering reflects T-cell biology has never been validated. We used a large T-cell atlas (V2) that combined twelve 10x Genomics single T-cell transcriptomics datasets (∼500 K cells) as well as an independent CITE-seq dataset to qualify if the unsupervised clustering produced by Seurat reflected the biology. Annotations were then evaluated using the expression of key marker genes. The main T-cell markers CD8 and CD4 were mixed in most clusters, regardless of the feature selection and either principal/harmony components or features. The factors driving the clustering were also related to cellular functions (glucose metabolism), T-cell receptor (TCR), immunoglobulin and HLA transcripts, and not typical markers. Against current assumptions, the clustering was not being driven by the T-cell phenotypes and could not accurately segregate the CD4+ from CD8+ <em>T</em> cells, let alone the sub-classifications. This implicated many of the T cells would be incorrectly classified if using the standard cluster-based annotation approach. Methods relying on unsupervised clustering should be used with care, as improper handling can misrepresent the data, and alternatives such as semi-supervised approaches with TCR-seq or protein-based annotations should be preferred.</div></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"20 ","pages":"Article 100063"},"PeriodicalIF":0.0000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Immunoinformatics (Amsterdam, Netherlands)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667119025000163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/21 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The current state of single-cell transcriptomic interrogation typically consists of using an unsupervised clustering approach followed by expert opinion-based annotation. The underlying assumption is that this process will identify transcriptional differences between cellular subsets accurately, and thus be able to cluster for example CD8+ T cells apart from CD4+ T cells. However, this widely applied assumption that the clustering reflects T-cell biology has never been validated. We used a large T-cell atlas (V2) that combined twelve 10x Genomics single T-cell transcriptomics datasets (∼500 K cells) as well as an independent CITE-seq dataset to qualify if the unsupervised clustering produced by Seurat reflected the biology. Annotations were then evaluated using the expression of key marker genes. The main T-cell markers CD8 and CD4 were mixed in most clusters, regardless of the feature selection and either principal/harmony components or features. The factors driving the clustering were also related to cellular functions (glucose metabolism), T-cell receptor (TCR), immunoglobulin and HLA transcripts, and not typical markers. Against current assumptions, the clustering was not being driven by the T-cell phenotypes and could not accurately segregate the CD4+ from CD8+ T cells, let alone the sub-classifications. This implicated many of the T cells would be incorrectly classified if using the standard cluster-based annotation approach. Methods relying on unsupervised clustering should be used with care, as improper handling can misrepresent the data, and alternatives such as semi-supervised approaches with TCR-seq or protein-based annotations should be preferred.
单细胞转录组学在T细胞失败的地方:对T细胞注释滥用无监督聚类
目前的单细胞转录组询问通常包括使用无监督聚类方法,然后是基于专家意见的注释。潜在的假设是,这一过程将准确地识别细胞亚群之间的转录差异,从而能够将CD8+ T细胞与CD4+ T细胞分开聚集。然而,这种广泛应用的聚类反应t细胞生物学的假设从未得到验证。我们使用了一个大型t细胞图谱(V2),该图谱结合了12个10x Genomics单个t细胞转录组学数据集(~ 500 K细胞)以及一个独立的CITE-seq数据集,以确定Seurat产生的无监督聚类是否反映了生物学。然后使用关键标记基因的表达来评估注释。主要的t细胞标记CD8和CD4在大多数集群中是混合的,无论特征选择和主要/和谐成分或特征。驱动聚类的因素还与细胞功能(葡萄糖代谢)、t细胞受体(TCR)、免疫球蛋白和HLA转录物有关,而不是典型的标志物。与目前的假设相反,这种聚类并不是由T细胞表型驱动的,也不能准确地分离CD4+和CD8+ T细胞,更不用说亚分类了。这意味着如果使用标准的基于簇的注释方法,许多T细胞将被错误地分类。应该谨慎使用依赖于无监督聚类的方法,因为不当的处理可能会歪曲数据,并且应该优先选择使用TCR-seq或基于蛋白质的注释的半监督方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Immunoinformatics (Amsterdam, Netherlands)
Immunoinformatics (Amsterdam, Netherlands) Immunology, Computer Science Applications
自引率
0.00%
发文量
0
审稿时长
60 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书