蛋白质需要特别注意：用提示符号提高蛋白质语言模型对突变数据集的预测能力。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2025-09-26 eCollection Date: 2025-09-01 DOI:10.1093/nargab/lqaf128

Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson

{"title":"蛋白质需要特别注意：用提示符号提高蛋白质语言模型对突变数据集的预测能力。","authors":"Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson","doi":"10.1093/nargab/lqaf128","DOIUrl":null,"url":null,"abstract":"In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf128"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12464817/pdf/","citationCount":"0","resultStr":"{\"title\":\"Proteins need extra attention: improving the predictive power of protein language models on mutational datasets with hint tokens.\",\"authors\":\"Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson\",\"doi\":\"10.1093/nargab/lqaf128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.\",\"PeriodicalId\":33994,\"journal\":{\"name\":\"NAR Genomics and Bioinformatics\",\"volume\":\"7 3\",\"pages\":\"lqaf128\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12464817/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NAR Genomics and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/nargab/lqaf128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

在这项计算研究中，我们通过使用一种新的标记化策略，提示标记学习（html）对蛋白质语言模型（PLMs）进行微调，解决了预测突变后蛋白质功能的挑战。为了评估html的有效性，我们在四个不同的蛋白质突变数据集上对四个具有不同架构和大小的预训练模型进行了基准测试。我们的结果显示，在大多数情况下，当应用html时，加权F1分数有了显著提高。为了理解html如何增强蛋白质突变预测，我们训练稀疏自编码器对源自微调plm的嵌入进行训练。对潜在空间的分析表明，HTL训练的PLM增加了功能蛋白结构域内的活化残基数量。这些发现表明，经过HTL微调的PLMs可能会捕获更多与蛋白质生物学相关的表征。我们的研究强调了HTL在推进蛋白质功能预测方面的潜力，并提供了HTL如何使PLMs在功能水平上捕获突变影响的见解。所有数据和代码可在：https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Proteins need extra attention: improving the predictive power of protein language models on mutational datasets with hint tokens.

In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NAR Genomics and Bioinformatics Multiple-

CiteScore

8.00

自引率

2.20%

发文量

审稿时长

15 weeks