Accurate prediction of virulence factors using pre-train protein language model and ensemble learning.

IF 3.5 2区生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

BMC Genomics Pub Date : 2025-05-21 DOI:10.1186/s12864-025-11694-8

Guanghui Li, Jian Zhou, Jiawei Luo, Cheng Liang

{"title":"Accurate prediction of virulence factors using pre-train protein language model and ensemble learning.","authors":"Guanghui Li, Jian Zhou, Jiawei Luo, Cheng Liang","doi":"10.1186/s12864-025-11694-8","DOIUrl":null,"url":null,"abstract":"Background: As bacterial pathogens develop increasing resistance to antibiotics, strategies targeting virulence factors (VFs) have emerged as a promising and effective approach for treating bacterial infections. Existing methods mainly relied on sequence similarity, and remote homology relationships cannot be discovered by sequence analysis alone.Results: To address this limitation, we developed a protein language model and ensemble learning approach for VF identification (PLMVF). Specifically, we extracted features from protein sequences using ESM-2 and their three-dimensional (3D) structures using ESMFold. We calculated the true TM-score of the proteins based on their 3D structures and trained a TM-predictor model to predict structural similarity, thereby capturing hidden remote homology information within the sequences. Subsequently, we concatenated the sequence-level features extracted by ESM-2 with the predicted TM-score features to form a comprehensive feature set for prediction. Extensive experimental validation demonstrated that PLMVF achieved an accuracy (ACC) of 86.1%, significantly outperforming existing models across multiple evaluation metrics. This study provided an ideal tool for identifying novel targets in the development of anti-virulence therapies, offering promise for the effective prevention and control of pathogenic bacterial infections.Conclusions: The proposed PLMVF model offers an efficient computational approach for VF identification.","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"517"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12093764/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-11694-8","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: As bacterial pathogens develop increasing resistance to antibiotics, strategies targeting virulence factors (VFs) have emerged as a promising and effective approach for treating bacterial infections. Existing methods mainly relied on sequence similarity, and remote homology relationships cannot be discovered by sequence analysis alone.

Results: To address this limitation, we developed a protein language model and ensemble learning approach for VF identification (PLMVF). Specifically, we extracted features from protein sequences using ESM-2 and their three-dimensional (3D) structures using ESMFold. We calculated the true TM-score of the proteins based on their 3D structures and trained a TM-predictor model to predict structural similarity, thereby capturing hidden remote homology information within the sequences. Subsequently, we concatenated the sequence-level features extracted by ESM-2 with the predicted TM-score features to form a comprehensive feature set for prediction. Extensive experimental validation demonstrated that PLMVF achieved an accuracy (ACC) of 86.1%, significantly outperforming existing models across multiple evaluation metrics. This study provided an ideal tool for identifying novel targets in the development of anti-virulence therapies, offering promise for the effective prevention and control of pathogenic bacterial infections.

Conclusions: The proposed PLMVF model offers an efficient computational approach for VF identification.

查看原文本刊更多论文

利用预训练蛋白语言模型和集成学习对毒力因子进行准确预测。

背景：随着细菌病原体对抗生素的耐药性日益增强，针对毒力因子（VFs）的策略已成为治疗细菌感染的一种有希望和有效的方法。现有的方法主要依赖于序列相似性，仅通过序列分析无法发现远程同源关系。为了解决这一限制，我们开发了一种蛋白质语言模型和集成学习方法用于VF识别（PLMVF）。具体来说，我们使用ESM-2提取蛋白质序列的特征，并使用ESMFold提取蛋白质序列的三维结构。我们根据蛋白质的三维结构计算了它们的真实tm分数，并训练了一个tm预测模型来预测结构相似性，从而捕获序列中隐藏的远程同源信息。随后，我们将ESM-2提取的序列级特征与预测的TM-score特征连接起来，形成一个综合的特征集进行预测。广泛的实验验证表明，PLMVF达到了86.1%的准确率（ACC），在多个评估指标上显著优于现有模型。本研究为开发新的抗毒疗法提供了理想的工具，为有效预防和控制病原菌感染提供了希望。结论：提出的PLMVF模型为VF识别提供了有效的计算方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Genomics 生物-生物工程与应用微生物

CiteScore

7.40

自引率

4.50%

发文量

769

审稿时长

6.4 months

期刊介绍： BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.