Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm.

IF 4 2区 生物学 Q1 GENETICS & HEREDITY
Tao Feng, Xirao Chen, Shufang Wu, Waijiao Tang, Hongwei Zhou, Zhencheng Fang
{"title":"Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm.","authors":"Tao Feng, Xirao Chen, Shufang Wu, Waijiao Tang, Hongwei Zhou, Zhencheng Fang","doi":"10.1099/mgen.0.001355","DOIUrl":null,"url":null,"abstract":"<p><p>The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001355","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Microbial Genomics
Microbial Genomics Medicine-Epidemiology
CiteScore
6.60
自引率
2.60%
发文量
153
审稿时长
12 weeks
期刊介绍: Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信