Boost Protein Language Model with Injected Structure Information Through Parameter Efficient Fine-tuning

IF 7 2区医学 Q1 BIOLOGY

Computers in biology and medicine Pub Date : 2025-06-30 DOI:10.1016/j.compbiomed.2025.110607

Zixun Zhang , Yuzhe Zhou , Jiayou Zheng , Chunmei Feng , Shuguang Cui , Sheng Wang , Zhen Li

{"title":"Boost Protein Language Model with Injected Structure Information Through Parameter Efficient Fine-tuning","authors":"Zixun Zhang , Yuzhe Zhou , Jiayou Zheng , Chunmei Feng , Shuguang Cui , Sheng Wang , Zhen Li","doi":"10.1016/j.compbiomed.2025.110607","DOIUrl":null,"url":null,"abstract":"<div><div>Large-scale Protein Language Models (PLMs), such as the Evolutionary Scale Modeling (ESM) family, have significantly advanced our understanding of protein structures and functions. These models have shown immense potential in biomedical applications, including drug discovery, protein design, and understanding disease mechanisms at the molecular level. However, PLMs are typically pre-trained on residue sequences alone, with limited incorporation of structural information, presenting opportunities for further enhancement. In this paper, we propose Structure Information Injecting Tuning (SI-Tuning), a parameter-efficient fine-tuning method, to integrate structural information into PLMs. SI-Tuning maintains the original model parameters in a frozen state while optimizing task-specific vectors for input embedding and attention maps. Structural features, including dihedral angles and distance maps, are used to derive this vector, injecting the structural information that improves model performance in downstream tasks. Extensive experiments on 650M ESM-2 demonstrate the effectiveness of our SI-Tuning across multiple downstream tasks. Specifically, our SI-Tuning achieves an accuracy of 93.95% on DeepLoc binary classification, and 76.05% on Metal Ion Binding, outperforming SaProt, a large-scale pre-trained PLM with structural modeling. SI-Tuning effectively enhances the performance of PLMs by incorporating structural information in a parameter-efficient manner. Our method not only advances downstream task performance, but also offers significant computational efficiency, making it a valuable strategy for applying large-scale PLM to broad biomedical downstream applications. Code is available at <span><span>https://github.com/Nocturne0256/SI-tuning</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"195 ","pages":"Article 110607"},"PeriodicalIF":7.0000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482525009588","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Large-scale Protein Language Models (PLMs), such as the Evolutionary Scale Modeling (ESM) family, have significantly advanced our understanding of protein structures and functions. These models have shown immense potential in biomedical applications, including drug discovery, protein design, and understanding disease mechanisms at the molecular level. However, PLMs are typically pre-trained on residue sequences alone, with limited incorporation of structural information, presenting opportunities for further enhancement. In this paper, we propose Structure Information Injecting Tuning (SI-Tuning), a parameter-efficient fine-tuning method, to integrate structural information into PLMs. SI-Tuning maintains the original model parameters in a frozen state while optimizing task-specific vectors for input embedding and attention maps. Structural features, including dihedral angles and distance maps, are used to derive this vector, injecting the structural information that improves model performance in downstream tasks. Extensive experiments on 650M ESM-2 demonstrate the effectiveness of our SI-Tuning across multiple downstream tasks. Specifically, our SI-Tuning achieves an accuracy of 93.95% on DeepLoc binary classification, and 76.05% on Metal Ion Binding, outperforming SaProt, a large-scale pre-trained PLM with structural modeling. SI-Tuning effectively enhances the performance of PLMs by incorporating structural information in a parameter-efficient manner. Our method not only advances downstream task performance, but also offers significant computational efficiency, making it a valuable strategy for applying large-scale PLM to broad biomedical downstream applications. Code is available at https://github.com/Nocturne0256/SI-tuning.

查看原文本刊更多论文

通过参数高效微调增强注入结构信息的蛋白质语言模型

大规模蛋白质语言模型（PLMs），如进化尺度模型（ESM）家族，极大地促进了我们对蛋白质结构和功能的理解。这些模型在生物医学应用中显示出巨大的潜力，包括药物发现、蛋白质设计和在分子水平上理解疾病机制。然而，plm通常仅对残基序列进行预训练，结构信息有限，为进一步增强提供了机会。本文提出了一种参数高效的结构信息注入调谐（SI-Tuning）方法，将结构信息集成到plm中。SI-Tuning将原始模型参数保持在冻结状态，同时优化用于输入嵌入和注意图的特定任务向量。结构特征，包括二面角和距离图，用于导出该向量，注入结构信息，提高下游任务中的模型性能。在650M ESM-2上进行的大量实验证明了我们的SI-Tuning跨多个下游任务的有效性。具体来说，我们的SI-Tuning在DeepLoc二元分类上的准确率为93.95%，在金属离子结合（Metal Ion Binding）上的准确率为76.05%，优于具有结构建模的大规模预训练PLM SaProt。si -调谐通过以参数有效的方式整合结构信息，有效地提高了plm的性能。我们的方法不仅提高了下游任务的性能，而且提供了显著的计算效率，使其成为将大规模PLM应用于广泛的生物医学下游应用的有价值的策略。代码可从https://github.com/Nocturne0256/SI-tuning获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers in biology and medicine 工程技术-工程：生物医学

CiteScore

11.70

自引率

10.40%

发文量

1086

审稿时长

74 days

期刊介绍： Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.