Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson
{"title":"蛋白质需要特别注意:用提示符号提高蛋白质语言模型对突变数据集的预测能力。","authors":"Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson","doi":"10.1093/nargab/lqaf128","DOIUrl":null,"url":null,"abstract":"<p><p>In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf128"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12464817/pdf/","citationCount":"0","resultStr":"{\"title\":\"Proteins need extra attention: improving the predictive power of protein language models on mutational datasets with hint tokens.\",\"authors\":\"Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson\",\"doi\":\"10.1093/nargab/lqaf128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.</p>\",\"PeriodicalId\":33994,\"journal\":{\"name\":\"NAR Genomics and Bioinformatics\",\"volume\":\"7 3\",\"pages\":\"lqaf128\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12464817/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NAR Genomics and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/nargab/lqaf128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
Proteins need extra attention: improving the predictive power of protein language models on mutational datasets with hint tokens.
In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.