USPDB：一种基于子图采样的新型u形等变图神经网络，用于蛋白质- dna结合位点预测

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-06-14 DOI:10.1016/j.eswa.2025.128554

Jia Mi , Chang Li , Han Wang , Ying Du , Chong Chu , Jing Wan , Kunfeng Wang

{"title":"USPDB：一种基于子图采样的新型u形等变图神经网络，用于蛋白质- dna结合位点预测","authors":"Jia Mi , Chang Li , Han Wang , Ying Du , Chong Chu , Jing Wan , Kunfeng Wang","doi":"10.1016/j.eswa.2025.128554","DOIUrl":null,"url":null,"abstract":"<div><div>Protein-DNA binding directly influences the normal functioning of biological processes by regulating gene expression. Accurate identification of binding sites can reveal the mechanisms of protein-DNA interactions and provide a clear direction for drug target development. However, traditional experimental methods are time-consuming and costly, necessitating the development of efficient computational methods. Although existing computational methods have made significant progress in the field of protein binding site prediction, they have difficulty extracting key residue features and atomic-level features. To address this, we propose a novel method, USPDB, based on a U-shaped Equivariant Graph Neural Network(U-EGNNet) and Subgraph Sampling for Protein-DNA Binding Site Prediction. USPDB reformulates the binding site prediction task by converting the protein into a graph and performing a binary classification for each residue. It leverages protein large language models, such as Protrans, ESM2, and ESM3, to extract sequence and structural features. The General Equivariant Transformer (GET) module is employed to capture geometric features of residues and atoms. Additionally, the U-EGNNet, composed of EGNN and Subgraph Sampling, is utilized to preserve more global information while sampling subgraphs that contain key residues for further computation. Experimental results on DNA_test_181 and DNA_test_129 datasets demonstrate that USPDB achieves prediction accuracies of 0.532 and 0.361, respectively, outperforming all baseline methods. Through interpretability analysis, we observed that USPDB effectively focuses on residues within DNA-binding domains without requiring prior knowledge, thereby enhancing the performance of DNA-binding protein prediction. The code is publicly available at the following link: <span><span>https://github.com/MiJia-ID/USPDB</span><svg><path></path></svg></span></div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"291 ","pages":"Article 128554"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"USPDB: A novel U-shaped equivariant graph neural network with subgraph sampling for protein-DNA binding site prediction\",\"authors\":\"Jia Mi , Chang Li , Han Wang , Ying Du , Chong Chu , Jing Wan , Kunfeng Wang\",\"doi\":\"10.1016/j.eswa.2025.128554\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Protein-DNA binding directly influences the normal functioning of biological processes by regulating gene expression. Accurate identification of binding sites can reveal the mechanisms of protein-DNA interactions and provide a clear direction for drug target development. However, traditional experimental methods are time-consuming and costly, necessitating the development of efficient computational methods. Although existing computational methods have made significant progress in the field of protein binding site prediction, they have difficulty extracting key residue features and atomic-level features. To address this, we propose a novel method, USPDB, based on a U-shaped Equivariant Graph Neural Network(U-EGNNet) and Subgraph Sampling for Protein-DNA Binding Site Prediction. USPDB reformulates the binding site prediction task by converting the protein into a graph and performing a binary classification for each residue. It leverages protein large language models, such as Protrans, ESM2, and ESM3, to extract sequence and structural features. The General Equivariant Transformer (GET) module is employed to capture geometric features of residues and atoms. Additionally, the U-EGNNet, composed of EGNN and Subgraph Sampling, is utilized to preserve more global information while sampling subgraphs that contain key residues for further computation. Experimental results on DNA_test_181 and DNA_test_129 datasets demonstrate that USPDB achieves prediction accuracies of 0.532 and 0.361, respectively, outperforming all baseline methods. Through interpretability analysis, we observed that USPDB effectively focuses on residues within DNA-binding domains without requiring prior knowledge, thereby enhancing the performance of DNA-binding protein prediction. The code is publicly available at the following link: <span><span>https://github.com/MiJia-ID/USPDB</span><svg><path></path></svg></span></div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"291 \",\"pages\":\"Article 128554\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425021736\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425021736","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质- dna结合通过调节基因表达直接影响生物过程的正常功能。结合位点的准确鉴定可以揭示蛋白质- dna相互作用的机制，为药物靶点的开发提供明确的方向。然而，传统的实验方法耗时长，成本高，需要开发高效的计算方法。尽管现有的计算方法在蛋白质结合位点预测领域取得了重大进展，但它们在提取关键残基特征和原子水平特征方面存在困难。为了解决这个问题，我们提出了一种基于u形等变图神经网络（U-EGNNet）和子图采样的蛋白质- dna结合位点预测新方法USPDB。USPDB通过将蛋白质转换成图形并对每个残基进行二值分类来重新制定结合位点预测任务。它利用蛋白质大语言模型，如Protrans、ESM2和ESM3来提取序列和结构特征。通用等变变压器（GET）模块用于捕获残基和原子的几何特征。此外，U-EGNNet由EGNN和子图采样组成，在对包含关键残数的子图进行采样时，可以保留更多的全局信息，以供进一步计算。在DNA_test_181和DNA_test_129数据集上的实验结果表明，USPDB的预测准确率分别为0.532和0.361，优于所有基线方法。通过可解释性分析，我们发现USPDB在不需要先验知识的情况下有效地关注dna结合域内的残基，从而提高了dna结合蛋白预测的性能。该代码可从以下链接公开获取：https://github.com/MiJia-ID/USPDB

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

USPDB: A novel U-shaped equivariant graph neural network with subgraph sampling for protein-DNA binding site prediction

Protein-DNA binding directly influences the normal functioning of biological processes by regulating gene expression. Accurate identification of binding sites can reveal the mechanisms of protein-DNA interactions and provide a clear direction for drug target development. However, traditional experimental methods are time-consuming and costly, necessitating the development of efficient computational methods. Although existing computational methods have made significant progress in the field of protein binding site prediction, they have difficulty extracting key residue features and atomic-level features. To address this, we propose a novel method, USPDB, based on a U-shaped Equivariant Graph Neural Network(U-EGNNet) and Subgraph Sampling for Protein-DNA Binding Site Prediction. USPDB reformulates the binding site prediction task by converting the protein into a graph and performing a binary classification for each residue. It leverages protein large language models, such as Protrans, ESM2, and ESM3, to extract sequence and structural features. The General Equivariant Transformer (GET) module is employed to capture geometric features of residues and atoms. Additionally, the U-EGNNet, composed of EGNN and Subgraph Sampling, is utilized to preserve more global information while sampling subgraphs that contain key residues for further computation. Experimental results on DNA_test_181 and DNA_test_129 datasets demonstrate that USPDB achieves prediction accuracies of 0.532 and 0.361, respectively, outperforming all baseline methods. Through interpretability analysis, we observed that USPDB effectively focuses on residues within DNA-binding domains without requiring prior knowledge, thereby enhancing the performance of DNA-binding protein prediction. The code is publicly available at the following link: https://github.com/MiJia-ID/USPDB

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.