Masked language modeling pretraining dynamics for downstream peptide: T-cell receptor binding prediction.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-02-20 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf028

Brock Landry, Jian Zhang

{"title":"Masked language modeling pretraining dynamics for downstream peptide: T-cell receptor binding prediction.","authors":"Brock Landry, Jian Zhang","doi":"10.1093/bioadv/vbaf028","DOIUrl":null,"url":null,"abstract":"Motivation: Predicting antigen peptide and T-cell receptor (TCR) binding is difficult due to the combinatoric nature of peptides and the scarcity of labeled peptide-binding pairs. The masked language modeling method of pretraining is reliably used to increase the downstream performance of peptide:TCR binding prediction models by leveraging unlabeled data. In the literature, binding prediction models are commonly trained until the validation loss converges. To evaluate this method, cited transformer model architectures pretrained with masked language modeling are investigated to assess the benefits of achieving lower loss metrics during pretraining. The downstream performance metrics for these works are recorded after each subsequent interval of masked language modeling pretraining.Results: The results demonstrate that the downstream performance benefit achieved from masked language modeling peaks substantially before the pretraining loss converges. Using the pretraining loss metric is largely ineffective for precisely identifying the best downstream performing pretrained model checkpoints (or saved states). However, the pretraining loss metric in these scenarios can be used to mark a threshold in which the downstream performance benefits from pretraining have fully diminished. Further pretraining beyond this threshold does not negatively impact downstream performance but results in unpredictable bilateral deviations from the post-threshold average downstream performance benefit.Availability and implementation: The datasets used in this article for model training are publicly available from each original model's authors at https://github.com/SFGLab/bertrand, https://github.com/wukevin/tcr-bert, https://github.com/NKI-AI/STAPLER, and https://github.com/barthelemymp/TULIP-TCR.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf028"},"PeriodicalIF":2.4000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11908642/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Predicting antigen peptide and T-cell receptor (TCR) binding is difficult due to the combinatoric nature of peptides and the scarcity of labeled peptide-binding pairs. The masked language modeling method of pretraining is reliably used to increase the downstream performance of peptide:TCR binding prediction models by leveraging unlabeled data. In the literature, binding prediction models are commonly trained until the validation loss converges. To evaluate this method, cited transformer model architectures pretrained with masked language modeling are investigated to assess the benefits of achieving lower loss metrics during pretraining. The downstream performance metrics for these works are recorded after each subsequent interval of masked language modeling pretraining.

Results: The results demonstrate that the downstream performance benefit achieved from masked language modeling peaks substantially before the pretraining loss converges. Using the pretraining loss metric is largely ineffective for precisely identifying the best downstream performing pretrained model checkpoints (or saved states). However, the pretraining loss metric in these scenarios can be used to mark a threshold in which the downstream performance benefits from pretraining have fully diminished. Further pretraining beyond this threshold does not negatively impact downstream performance but results in unpredictable bilateral deviations from the post-threshold average downstream performance benefit.

Availability and implementation: The datasets used in this article for model training are publicly available from each original model's authors at https://github.com/SFGLab/bertrand, https://github.com/wukevin/tcr-bert, https://github.com/NKI-AI/STAPLER, and https://github.com/barthelemymp/TULIP-TCR.

查看原文本刊更多论文

隐藏语言建模预训练动态下游肽：t细胞受体结合预测。

动机：由于多肽的组合特性和标记肽结合对的稀缺性，预测抗原肽和t细胞受体（TCR）的结合是困难的。预训练的掩码语言建模方法可靠地用于利用未标记数据来提高肽：TCR结合预测模型的下游性能。在文献中，绑定预测模型通常被训练到验证损失收敛为止。为了评估这种方法，研究了用屏蔽语言建模预训练的变压器模型架构，以评估在预训练期间实现更低损耗指标的好处。在屏蔽语言建模预训练的每个后续间隔之后，记录这些作品的下游性能指标。结果：结果表明，在预训练损失收敛之前，掩蔽语言建模的下游性能优势达到峰值。使用预训练损失度量在很大程度上是无效的，无法精确地识别最佳下游执行预训练模型检查点（或保存状态）。然而，在这些场景中，预训练损失度量可以用来标记一个阈值，在这个阈值中，预训练的下游性能收益已经完全减少。超过该阈值的进一步预训练不会对下游性能产生负面影响，但会导致与阈值后平均下游性能收益的不可预测的双边偏差。可用性和实现：本文中用于模型训练的数据集可以在https://github.com/SFGLab/bertrand、https://github.com/wukevin/tcr-bert、https://github.com/NKI-AI/STAPLER和https://github.com/barthelemymp/TULIP-TCR上从每个原始模型的作者那里公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量