Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information.

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-08-31 DOI:10.1093/bib/bbaf457

Xiang Li, Wei Peng, Xiaolei Zhu

{"title":"Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information.","authors":"Xiang Li, Wei Peng, Xiaolei Zhu","doi":"10.1093/bib/bbaf457","DOIUrl":null,"url":null,"abstract":"<p><p>Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance. However, structure-based information requires 3D protein structures, which is a challenge for large-scale protein sequence spaces. To address this limitation, researchers have attempted to use predicted protein structure information to guide binding site prediction. While this strategy has improved accuracy, it still depends on the quality of structure predictions. Thus, some studies have returned to prediction methods based solely on protein sequences, particularly those using protein language models, which have greatly enhanced the prediction accuracy. This paper proposes a novel protein-nucleic acid binding site prediction framework, ATtention Maps and Graph convolutional neural networks to predict nucleic acid-protein Binding sites (ATMGBs), which first fuses protein language embeddings with physicochemical properties to obtain multiview information, then leverages the attention map of a protein language model to simulate the relationship between residues, and then utilizes graph convolutional networks for enhancing the feature representations for final prediction. ATMGBs was evaluated on several different independent test sets. The results indicate that the proposed approach significantly improves sequence-based prediction performance, even achieving prediction accuracy comparable to structure-based frameworks. The dataset and code used in this study are available at https://github.com/lixiangli01/ATMGBs.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12415854/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf457","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance. However, structure-based information requires 3D protein structures, which is a challenge for large-scale protein sequence spaces. To address this limitation, researchers have attempted to use predicted protein structure information to guide binding site prediction. While this strategy has improved accuracy, it still depends on the quality of structure predictions. Thus, some studies have returned to prediction methods based solely on protein sequences, particularly those using protein language models, which have greatly enhanced the prediction accuracy. This paper proposes a novel protein-nucleic acid binding site prediction framework, ATtention Maps and Graph convolutional neural networks to predict nucleic acid-protein Binding sites (ATMGBs), which first fuses protein language embeddings with physicochemical properties to obtain multiview information, then leverages the attention map of a protein language model to simulate the relationship between residues, and then utilizes graph convolutional networks for enhancing the feature representations for final prediction. ATMGBs was evaluated on several different independent test sets. The results indicate that the proposed approach significantly improves sequence-based prediction performance, even achieving prediction accuracy comparable to structure-based frameworks. The dataset and code used in this study are available at https://github.com/lixiangli01/ATMGBs.

Abstract Image

查看原文本刊更多论文

基于蛋白质语言嵌入和物理化学信息的注意图引导图卷积网络预测核酸结合位点。

蛋白核酸结合位点在基因表达、信号转导、复制和转录等生物过程中起着至关重要的作用。近年来，随着人工智能的发展，蛋白质语言模型、图神经网络和变压器架构被用于开发基于结构和基于序列的预测模型。基于结构的方法受益于残基之间的空间关系，并显示出良好的性能。然而，基于结构的信息需要三维蛋白质结构，这对大规模蛋白质序列空间是一个挑战。为了解决这一限制，研究人员试图使用预测的蛋白质结构信息来指导结合位点的预测。虽然这种策略提高了准确性，但它仍然依赖于结构预测的质量。因此，一些研究回归到单纯基于蛋白质序列的预测方法，特别是使用蛋白质语言模型的预测方法，大大提高了预测精度。本文提出了一种新的蛋白质-核酸结合位点预测框架，即注意图和图卷积神经网络来预测核酸-蛋白结合位点（atmbs），该框架首先将蛋白质语言嵌入与物理化学性质融合以获得多视图信息，然后利用蛋白质语言模型的注意图来模拟残基之间的关系。然后利用图卷积网络增强特征表示以进行最终预测。在几个不同的独立测试集上对atmbs进行了评估。结果表明，该方法显著提高了基于序列的预测性能，甚至达到了与基于结构的框架相当的预测精度。本研究使用的数据集和代码可在https://github.com/lixiangli01/ATMGBs上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.