Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm.

IF 4 2区 生物学 Q1 GENETICS & HEREDITY
Tao Feng, Xirao Chen, Shufang Wu, Waijiao Tang, Hongwei Zhou, Zhencheng Fang
{"title":"Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm.","authors":"Tao Feng, Xirao Chen, Shufang Wu, Waijiao Tang, Hongwei Zhou, Zhencheng Fang","doi":"10.1099/mgen.0.001355","DOIUrl":null,"url":null,"abstract":"<p><p>The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001355","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.

基于语言模型的一类支持向量机算法预测细菌宿主质粒基因组范围。
质粒宿主范围的预测对于研究质粒的传播以及质粒介导的抗性和毒力基因的转移具有重要意义。已经开发了几种基于机器学习的工具来预测质粒宿主范围。这些工具已经根据相关数据库中质粒的细菌宿主记录进行了培训和测试。通常,在诸如美国国家生物技术信息中心这样的数据库中,一个质粒基因组只注释了一个或几个细菌宿主,这并不包括所有可能的宿主。因此,现有的方法可能大大低估了移动质粒的宿主范围。在这项工作中,我们提出了一种名为HRPredict的新方法,该方法采用词向量模型对质粒基因组上编码的蛋白质进行数字表示。由于很难确定一个特定的质粒绝对不能进入哪个宿主,我们开发了一种机器学习方法来预测一个质粒是否可以进入一个特定的细菌,作为一个无阴性样本学习任务。HRPredict利用不需要负样本进行训练的多个单类支持向量机(SVM)模型,预测了45个科、56个属和56个物种的质粒宿主范围。在基准测试集中,我们通过两种间接方法构建了每个宿主分类单元的可靠阴性样本,结果发现大多数分类单元预测模型的曲线下面积(AUC)、f1得分、召回率、精密度和准确度均超过0.9。在13种宿主范围广泛的质粒类型中,HRPredict比HOTSPOT和PlasmidHostFinder的覆盖范围更广,因此成功预测了之前报道的大多数宿主。通过对每个SVM模型的特征重要性计算,我们发现与质粒宿主范围密切相关的基因参与了细菌的适应性、致病性和生存等功能。这些发现为细菌通过质粒适应不同环境的机制提供了重要的见解。HRPredict算法有望促进对广泛宿主质粒传播的深入研究,并使从微生物组测序数据重建的新型质粒能够预测宿主范围。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Microbial Genomics
Microbial Genomics Medicine-Epidemiology
CiteScore
6.60
自引率
2.60%
发文量
153
审稿时长
12 weeks
期刊介绍: Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信