TCRCluster: a novel approach to T-cell receptor latent featurization and clustering using contrastive learning-guided two-stage variational autoencoders.
{"title":"TCRCluster: a novel approach to T-cell receptor latent featurization and clustering using contrastive learning-guided two-stage variational autoencoders.","authors":"Yat-Tsai Richie Wan, Morten Nielsen","doi":"10.1093/nargab/lqaf065","DOIUrl":null,"url":null,"abstract":"<p><p>T cells play a vital role in adaptive immunity by targeting pathogen-infected or cancerous cells, but predicting their specificity remains challenging. Encoding T-cell receptor (TCR) sequences into informative feature spaces is therefore crucial for advancing specificity prediction and downstream applications. For this, we developed a variational autoencoder (VAE)-based model trained on paired TCR α-β chain data, incorporating all six complementarity-determining regions. A semi-supervised 'two-stage VAE' framework, integrating cosine triplet loss and a classifier, was found to further refine peptide-specific latent representations, outperforming sequence-based methods in specificity prediction. Clustering analyses leveraging our VAE latent space were evaluated using <i>K</i>-means, agglomerative clustering, and a novel graph-based method. Agglomerative clustering achieved the most biologically relevant results, balancing cluster purity and retention despite noise in TCR specificity annotations. We extended these insights to evaluate TCR repertoire data. Across datasets, VAE-based models outperformed sequence-based methods, particularly in retention metrics, with notable improvements in the SARS-CoV-2 repertoire dataset. Moreover, the cancer repertoire analysis highlighted the generalizability of our approach, where the model displayed high performance despite minimal similarity between the training and test data. Collectively, these results demonstrate the potential of VAE-based latent representations to offer a robust framework for prediction, clustering, and repertoire analysis.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf065"},"PeriodicalIF":2.8000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12107435/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
T cells play a vital role in adaptive immunity by targeting pathogen-infected or cancerous cells, but predicting their specificity remains challenging. Encoding T-cell receptor (TCR) sequences into informative feature spaces is therefore crucial for advancing specificity prediction and downstream applications. For this, we developed a variational autoencoder (VAE)-based model trained on paired TCR α-β chain data, incorporating all six complementarity-determining regions. A semi-supervised 'two-stage VAE' framework, integrating cosine triplet loss and a classifier, was found to further refine peptide-specific latent representations, outperforming sequence-based methods in specificity prediction. Clustering analyses leveraging our VAE latent space were evaluated using K-means, agglomerative clustering, and a novel graph-based method. Agglomerative clustering achieved the most biologically relevant results, balancing cluster purity and retention despite noise in TCR specificity annotations. We extended these insights to evaluate TCR repertoire data. Across datasets, VAE-based models outperformed sequence-based methods, particularly in retention metrics, with notable improvements in the SARS-CoV-2 repertoire dataset. Moreover, the cancer repertoire analysis highlighted the generalizability of our approach, where the model displayed high performance despite minimal similarity between the training and test data. Collectively, these results demonstrate the potential of VAE-based latent representations to offer a robust framework for prediction, clustering, and repertoire analysis.