PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure.

IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES
Yongxin Ji, Jiayu Shang, Jiaojiao Guan, Wei Zou, Herui Liao, Xubo Tang, Yanni Sun
{"title":"PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure.","authors":"Yongxin Ji, Jiayu Shang, Jiaojiao Guan, Wei Zou, Herui Liao, Xubo Tang, Yanni Sun","doi":"10.1093/gigascience/giae104","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces 2 major challenges: the high diversity of functions and the limited availability of high-quality GO annotations.</p><p><strong>Results: </strong>In this study, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against 7 state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the 3 GO categories, respectively, as measured on the novel protein test set.</p><p><strong>Conclusions: </strong>PlasGO, a hierarchical tool incorporating protein language models and BERT, significantly expanded plasmid protein annotations by predicting high-confidence GO terms. These annotations have been compiled into a database, which will serve as a valuable contribution to downstream plasmid analysis and research.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11659980/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae104","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces 2 major challenges: the high diversity of functions and the limited availability of high-quality GO annotations.

Results: In this study, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against 7 state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the 3 GO categories, respectively, as measured on the novel protein test set.

Conclusions: PlasGO, a hierarchical tool incorporating protein language models and BERT, significantly expanded plasmid protein annotations by predicting high-confidence GO terms. These annotations have been compiled into a database, which will serve as a valuable contribution to downstream plasmid analysis and research.

PlasGO:基于基因结构加强质粒编码蛋白的 GO 功能预测。
背景:质粒作为一种可移动的遗传元件,在促进细菌群落中抗菌素耐药性等性状的转移中起着关键作用。用广泛使用的基因本体(Gene Ontology, GO)词汇对质粒编码的蛋白质进行注释是包括质粒迁移率分类在内的各种任务的基本步骤。然而,质粒编码蛋白的氧化石墨烯预测面临两个主要挑战:功能的高度多样性和高质量氧化石墨烯注释的有限可用性。结果:在本研究中,我们引入了PlasGO,这是一种利用层次结构来预测质粒蛋白的GO术语的工具。PlasGO利用强大的蛋白质语言模型来学习蛋白质句子中的局部上下文,利用BERT模型来捕获质粒句子中的全局上下文。此外,PlasGO允许用户通过结合自我关注自信加权机制来控制精度。我们对PlasGO进行了严格的评估,并在一系列实验中对7种最先进的工具进行了基准测试。实验结果表明,PlasGO取得了良好的性能。PlasGO通过为超过95%的先前未注释的蛋白质分配高置信度的GO术语,显著扩展了质粒编码蛋白质数据库的注释,在新的蛋白质测试集上测量的3个GO类别分别显示出令人印象深刻的精度为0.8229,0.7941和0.8870。结论:PlasGO是一个结合蛋白质语言模型和BERT的分层工具,通过预测高置信度的GO术语,显著扩展了质粒蛋白质注释。这些注释已汇编成一个数据库,将为下游质粒分析和研究提供有价值的贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
GigaScience
GigaScience MULTIDISCIPLINARY SCIENCES-
CiteScore
15.50
自引率
1.10%
发文量
119
审稿时长
1 weeks
期刊介绍: GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信