{"title":"隐藏语言建模预训练动态下游肽:t细胞受体结合预测。","authors":"Brock Landry, Jian Zhang","doi":"10.1093/bioadv/vbaf028","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Predicting antigen peptide and T-cell receptor (TCR) binding is difficult due to the combinatoric nature of peptides and the scarcity of labeled peptide-binding pairs. The masked language modeling method of pretraining is reliably used to increase the downstream performance of peptide:TCR binding prediction models by leveraging unlabeled data. In the literature, binding prediction models are commonly trained until the validation loss converges. To evaluate this method, cited transformer model architectures pretrained with masked language modeling are investigated to assess the benefits of achieving lower loss metrics during pretraining. The downstream performance metrics for these works are recorded after each subsequent interval of masked language modeling pretraining.</p><p><strong>Results: </strong>The results demonstrate that the downstream performance benefit achieved from masked language modeling peaks substantially before the pretraining loss converges. Using the pretraining loss metric is largely ineffective for precisely identifying the best downstream performing pretrained model checkpoints (or saved states). However, the pretraining loss metric in these scenarios can be used to mark a threshold in which the downstream performance benefits from pretraining have fully diminished. Further pretraining beyond this threshold does not negatively impact downstream performance but results in unpredictable bilateral deviations from the post-threshold average downstream performance benefit.</p><p><strong>Availability and implementation: </strong>The datasets used in this article for model training are publicly available from each original model's authors at https://github.com/SFGLab/bertrand, https://github.com/wukevin/tcr-bert, https://github.com/NKI-AI/STAPLER, and https://github.com/barthelemymp/TULIP-TCR.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf028"},"PeriodicalIF":2.4000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11908642/pdf/","citationCount":"0","resultStr":"{\"title\":\"Masked language modeling pretraining dynamics for downstream peptide: T-cell receptor binding prediction.\",\"authors\":\"Brock Landry, Jian Zhang\",\"doi\":\"10.1093/bioadv/vbaf028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Predicting antigen peptide and T-cell receptor (TCR) binding is difficult due to the combinatoric nature of peptides and the scarcity of labeled peptide-binding pairs. The masked language modeling method of pretraining is reliably used to increase the downstream performance of peptide:TCR binding prediction models by leveraging unlabeled data. In the literature, binding prediction models are commonly trained until the validation loss converges. To evaluate this method, cited transformer model architectures pretrained with masked language modeling are investigated to assess the benefits of achieving lower loss metrics during pretraining. The downstream performance metrics for these works are recorded after each subsequent interval of masked language modeling pretraining.</p><p><strong>Results: </strong>The results demonstrate that the downstream performance benefit achieved from masked language modeling peaks substantially before the pretraining loss converges. Using the pretraining loss metric is largely ineffective for precisely identifying the best downstream performing pretrained model checkpoints (or saved states). However, the pretraining loss metric in these scenarios can be used to mark a threshold in which the downstream performance benefits from pretraining have fully diminished. Further pretraining beyond this threshold does not negatively impact downstream performance but results in unpredictable bilateral deviations from the post-threshold average downstream performance benefit.</p><p><strong>Availability and implementation: </strong>The datasets used in this article for model training are publicly available from each original model's authors at https://github.com/SFGLab/bertrand, https://github.com/wukevin/tcr-bert, https://github.com/NKI-AI/STAPLER, and https://github.com/barthelemymp/TULIP-TCR.</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf028\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11908642/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
Masked language modeling pretraining dynamics for downstream peptide: T-cell receptor binding prediction.
Motivation: Predicting antigen peptide and T-cell receptor (TCR) binding is difficult due to the combinatoric nature of peptides and the scarcity of labeled peptide-binding pairs. The masked language modeling method of pretraining is reliably used to increase the downstream performance of peptide:TCR binding prediction models by leveraging unlabeled data. In the literature, binding prediction models are commonly trained until the validation loss converges. To evaluate this method, cited transformer model architectures pretrained with masked language modeling are investigated to assess the benefits of achieving lower loss metrics during pretraining. The downstream performance metrics for these works are recorded after each subsequent interval of masked language modeling pretraining.
Results: The results demonstrate that the downstream performance benefit achieved from masked language modeling peaks substantially before the pretraining loss converges. Using the pretraining loss metric is largely ineffective for precisely identifying the best downstream performing pretrained model checkpoints (or saved states). However, the pretraining loss metric in these scenarios can be used to mark a threshold in which the downstream performance benefits from pretraining have fully diminished. Further pretraining beyond this threshold does not negatively impact downstream performance but results in unpredictable bilateral deviations from the post-threshold average downstream performance benefit.
Availability and implementation: The datasets used in this article for model training are publicly available from each original model's authors at https://github.com/SFGLab/bertrand, https://github.com/wukevin/tcr-bert, https://github.com/NKI-AI/STAPLER, and https://github.com/barthelemymp/TULIP-TCR.