ToxDL 2.0: Protein toxicity prediction using a pretrained language model and graph neural networks.

IF 4.4 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-04-02 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.04.002

Lin Zhu, Yi Fang, Shuting Liu, Hong-Bin Shen, Wesley De Neve, Xiaoyong Pan

{"title":"ToxDL 2.0: Protein toxicity prediction using a pretrained language model and graph neural networks.","authors":"Lin Zhu, Yi Fang, Shuting Liu, Hong-Bin Shen, Wesley De Neve, Xiaoyong Pan","doi":"10.1016/j.csbj.2025.04.002","DOIUrl":null,"url":null,"abstract":"Motivation: Assessing the potential toxicity of proteins is crucial for both therapeutic and agricultural applications. Traditional experimental methods for protein toxicity evaluation are time-consuming, expensive, and labor-intensive, highlighting the requirement for efficient computational approaches. Recent advancements in language models and deep learning have significantly improved protein toxicity prediction, yet current models often lack the ability to integrate evolutionary and structural information, which is crucial for accurate toxicity assessment of proteins.Results: In this study, we present ToxDL 2.0, a novel multimodal deep learning model for protein toxicity prediction that integrates both evolutionary and structural information derived from a pretrained language model and AlphaFold2. ToxDL 2.0 consists of three key modules: (1) a Graph Convolutional Network (GCN) module for generating protein graph embeddings based on AlphaFold2-predicted structures, (2) a domain embedding module for capturing protein domain representations, and (3) a dense module that combines these embeddings to predict the toxicity. After constructing a comprehensive toxicity benchmark dataset, we obtained experimental results on both an original non-redundant test set (comprising pre-2022 protein sequences) and an independent non-redundant test set (a holdout set of post-2022 protein sequences), demonstrating that ToxDL 2.0 outperforms existing state-of-the-art methods. Additionally, we utilized Integrated Gradients to discover known toxic motifs associated with protein toxicity. A web server for ToxDL 2.0 is publicly available at www.csbio.sjtu.edu.cn/bioinf/ToxDL2/.","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"1538-1549"},"PeriodicalIF":4.4000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12018212/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.04.002","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Assessing the potential toxicity of proteins is crucial for both therapeutic and agricultural applications. Traditional experimental methods for protein toxicity evaluation are time-consuming, expensive, and labor-intensive, highlighting the requirement for efficient computational approaches. Recent advancements in language models and deep learning have significantly improved protein toxicity prediction, yet current models often lack the ability to integrate evolutionary and structural information, which is crucial for accurate toxicity assessment of proteins.

Results: In this study, we present ToxDL 2.0, a novel multimodal deep learning model for protein toxicity prediction that integrates both evolutionary and structural information derived from a pretrained language model and AlphaFold2. ToxDL 2.0 consists of three key modules: (1) a Graph Convolutional Network (GCN) module for generating protein graph embeddings based on AlphaFold2-predicted structures, (2) a domain embedding module for capturing protein domain representations, and (3) a dense module that combines these embeddings to predict the toxicity. After constructing a comprehensive toxicity benchmark dataset, we obtained experimental results on both an original non-redundant test set (comprising pre-2022 protein sequences) and an independent non-redundant test set (a holdout set of post-2022 protein sequences), demonstrating that ToxDL 2.0 outperforms existing state-of-the-art methods. Additionally, we utilized Integrated Gradients to discover known toxic motifs associated with protein toxicity. A web server for ToxDL 2.0 is publicly available at www.csbio.sjtu.edu.cn/bioinf/ToxDL2/.

查看原文本刊更多论文

使用预训练语言模型和图神经网络的蛋白质毒性预测。

动机：评估蛋白质的潜在毒性对治疗和农业应用都至关重要。传统的蛋白质毒性评价实验方法耗时长、成本高、劳动强度大，需要高效的计算方法。语言模型和深度学习的最新进展显著改善了蛋白质毒性预测，但目前的模型往往缺乏整合进化和结构信息的能力，这对蛋白质的准确毒性评估至关重要。结果：在这项研究中，我们提出了一种新的多模态深度学习模型ToxDL 2.0，用于蛋白质毒性预测，该模型集成了来自预训练语言模型和AlphaFold2的进化和结构信息。ToxDL 2.0由三个关键模块组成：(1)基于alphafold2预测结构生成蛋白质图嵌入的图卷积网络（GCN）模块，(2)捕获蛋白质结构域表示的域嵌入模块，以及(3)结合这些嵌入来预测毒性的密集模块。在构建全面的毒性基准数据集后，我们在原始的非冗余测试集（包含2022年前的蛋白质序列）和独立的非冗余测试集（包含2022年后的蛋白质序列）上获得了实验结果，表明ToxDL 2.0优于现有的最先进的方法。此外，我们利用集成梯度来发现与蛋白质毒性相关的已知毒性基序。ToxDL 2.0的web服务器可以在www.csbio.sjtu.edu.cn/bioinf/ToxDL2/上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology