Exo-Tox: Identifying Exotoxins from secreted bacterial proteins.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-08-08 DOI:10.1186/s13040-025-00469-2

Tanja Krueger, Damla A Durmaz, Luisa F Jimenez-Soto

{"title":"Exo-Tox: Identifying Exotoxins from secreted bacterial proteins.","authors":"Tanja Krueger, Damla A Durmaz, Luisa F Jimenez-Soto","doi":"10.1186/s13040-025-00469-2","DOIUrl":null,"url":null,"abstract":"Background: Bacterial exotoxins are secreted proteins able to affect target cells, and associated with diseases. Their accurate identification can enhance drug discovery and ensure the safety of bacteria-based medical applications. However, current toxin predictors prioritize broad coverage by mixing toxins from multiple biological kingdoms and diverse control sets. This general approach has proven sub-optimal for identifying niche toxins, such as bacterial exotoxins. Recent Protein Language Models offer an opportunity to improve toxin prediction by capturing global sequence context and biochemical properties from protein sequences.Results: We introduce Exo-Tox, a specialized predictor trained exclusively on curated datasets of bacterial exotoxins and secreted non-toxic bacterial proteins, represented as embeddings by Protein Language Models. Compared to Basic Local Alignment Search Tool (BLAST)-based methods and generalized toxin predictors, Exo-Tox outperforms across multiple metrics, achieving a Matthews correlation coefficient > 0.9. Notably, Exo-Tox's performance remains robust regardless of protein length or the presence of signal peptides. We analyze its limited transferability to bacteriophage proteins and non-secreted proteins.Conclusion: Exo-Tox reliably identifies bacterial exotoxins, filling a niche overlooked by generalized predictors. Our findings highlight the importance of domain-specific training data and emphasize that specialized predictors are necessary for accurate classification. We provide open access to the model, training data, and usage guidelines via the LMU Munich Open Data repository.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"52"},"PeriodicalIF":6.1000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12333140/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00469-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Bacterial exotoxins are secreted proteins able to affect target cells, and associated with diseases. Their accurate identification can enhance drug discovery and ensure the safety of bacteria-based medical applications. However, current toxin predictors prioritize broad coverage by mixing toxins from multiple biological kingdoms and diverse control sets. This general approach has proven sub-optimal for identifying niche toxins, such as bacterial exotoxins. Recent Protein Language Models offer an opportunity to improve toxin prediction by capturing global sequence context and biochemical properties from protein sequences.

Results: We introduce Exo-Tox, a specialized predictor trained exclusively on curated datasets of bacterial exotoxins and secreted non-toxic bacterial proteins, represented as embeddings by Protein Language Models. Compared to Basic Local Alignment Search Tool (BLAST)-based methods and generalized toxin predictors, Exo-Tox outperforms across multiple metrics, achieving a Matthews correlation coefficient > 0.9. Notably, Exo-Tox's performance remains robust regardless of protein length or the presence of signal peptides. We analyze its limited transferability to bacteriophage proteins and non-secreted proteins.

Conclusion: Exo-Tox reliably identifies bacterial exotoxins, filling a niche overlooked by generalized predictors. Our findings highlight the importance of domain-specific training data and emphasize that specialized predictors are necessary for accurate classification. We provide open access to the model, training data, and usage guidelines via the LMU Munich Open Data repository.

Abstract Image

查看原文本刊更多论文

外毒素：从分泌的细菌蛋白中鉴定外毒素。

背景：细菌外毒素是一种能够影响靶细胞的分泌蛋白，与疾病有关。它们的准确鉴定可以加强药物发现，并确保基于细菌的医疗应用的安全性。然而，目前的毒素预测通过混合来自多个生物王国和不同控制集的毒素来优先考虑广泛的覆盖范围。这种一般的方法已被证明是次优的识别生态位毒素，如细菌外毒素。最近的蛋白质语言模型通过捕获蛋白质序列的全局序列上下文和生化特性，提供了改进毒素预测的机会。结果：我们介绍了Exo-Tox，一个专门的预测器，专门训练细菌外毒素和分泌的无毒细菌蛋白的策划数据集，用蛋白质语言模型表示嵌入。与基于基本局部比对搜索工具（Basic Local Alignment Search Tool， BLAST）的方法和广义毒素预测器相比，Exo-Tox在多个指标上都表现出色，马修斯相关系数达到了>.9。值得注意的是，Exo-Tox的性能保持稳健，无论蛋白质长度或信号肽的存在。我们分析了它对噬菌体蛋白和非分泌蛋白的有限可转移性。结论：Exo-Tox可以可靠地识别细菌外毒素，填补了一般预测指标所忽视的空白。我们的研究结果强调了特定领域训练数据的重要性，并强调了专业预测器对于准确分类是必要的。我们通过LMU慕尼黑开放数据存储库提供对模型、训练数据和使用指南的开放访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.