全面评估基于 BERT 的抗菌肽预测方法

IF 5.3 2区 化学 Q1 CHEMISTRY, MEDICINAL
Wanling Gao, Jun Zhao, Jianfeng Gui, Zehan Wang, Jie Chen* and Zhenyu Yue*, 
{"title":"全面评估基于 BERT 的抗菌肽预测方法","authors":"Wanling Gao,&nbsp;Jun Zhao,&nbsp;Jianfeng Gui,&nbsp;Zehan Wang,&nbsp;Jie Chen* and Zhenyu Yue*,&nbsp;","doi":"10.1021/acs.jcim.4c0050710.1021/acs.jcim.4c00507","DOIUrl":null,"url":null,"abstract":"<p >In recent years, the prediction of antimicrobial peptides (AMPs) has gained prominence due to their high antibacterial activity and reduced susceptibility to drug resistance, making them potential antibiotic substitutes. To advance the field of AMP recognition, an increasing number of natural language processing methods are being applied. These methods exhibit diversity in terms of pretraining models, pretraining data sets, word vector embeddings, feature encoding methods, and downstream classification models. Here, we provide a comprehensive survey of current BERT-based methods for AMP prediction. An independent benchmark test data set is constructed to evaluate the predictive capabilities of the surveyed tools. Furthermore, we compared the predictive performance of these computational methods based on six different AMP public databases. LM_pred (BFD) outperformed all other surveyed tools due to abundant pretraining data set and the unique vector embedding approach. To avoid the impact of varying training data sets used by different methods on prediction performance, we performed the 5-fold cross-validation experiments using the same data set, involving retraining. Additionally, to explore the applicability and generalization ability of the models, we constructed a short peptide data set and an external data set to test the retrained models. Although these prediction methods based on BERT can achieve good prediction performance, there is still room for improvement in recognition accuracy. With the continuous enhancement of protein language model, we proposed an AMP prediction method based on the ESM-2 pretrained model called iAMP-bert. Experimental results demonstrate that iAMP-bert outperforms other approaches. iAMP-bert is freely accessible to the public at http://iamp.aielab.cc/.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"64 19","pages":"7772–7785 7772–7785"},"PeriodicalIF":5.3000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comprehensive Assessment of BERT-Based Methods for Predicting Antimicrobial Peptides\",\"authors\":\"Wanling Gao,&nbsp;Jun Zhao,&nbsp;Jianfeng Gui,&nbsp;Zehan Wang,&nbsp;Jie Chen* and Zhenyu Yue*,&nbsp;\",\"doi\":\"10.1021/acs.jcim.4c0050710.1021/acs.jcim.4c00507\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >In recent years, the prediction of antimicrobial peptides (AMPs) has gained prominence due to their high antibacterial activity and reduced susceptibility to drug resistance, making them potential antibiotic substitutes. To advance the field of AMP recognition, an increasing number of natural language processing methods are being applied. These methods exhibit diversity in terms of pretraining models, pretraining data sets, word vector embeddings, feature encoding methods, and downstream classification models. Here, we provide a comprehensive survey of current BERT-based methods for AMP prediction. An independent benchmark test data set is constructed to evaluate the predictive capabilities of the surveyed tools. Furthermore, we compared the predictive performance of these computational methods based on six different AMP public databases. LM_pred (BFD) outperformed all other surveyed tools due to abundant pretraining data set and the unique vector embedding approach. To avoid the impact of varying training data sets used by different methods on prediction performance, we performed the 5-fold cross-validation experiments using the same data set, involving retraining. Additionally, to explore the applicability and generalization ability of the models, we constructed a short peptide data set and an external data set to test the retrained models. Although these prediction methods based on BERT can achieve good prediction performance, there is still room for improvement in recognition accuracy. With the continuous enhancement of protein language model, we proposed an AMP prediction method based on the ESM-2 pretrained model called iAMP-bert. Experimental results demonstrate that iAMP-bert outperforms other approaches. iAMP-bert is freely accessible to the public at http://iamp.aielab.cc/.</p>\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"64 19\",\"pages\":\"7772–7785 7772–7785\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jcim.4c00507\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.4c00507","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0

摘要

近年来,由于抗菌肽(AMPs)具有较高的抗菌活性,且不易产生耐药性,因此成为潜在的抗生素替代品,对其进行预测的工作也日益突出。为了推动 AMP 识别领域的发展,越来越多的自然语言处理方法得到了应用。这些方法在预训练模型、预训练数据集、词向量嵌入、特征编码方法和下游分类模型等方面表现出多样性。在此,我们对当前基于 BERT 的 AMP 预测方法进行了全面调查。我们构建了一个独立的基准测试数据集,以评估所调查工具的预测能力。此外,我们还基于六个不同的 AMP 公共数据库比较了这些计算方法的预测性能。LM_pred (BFD)凭借丰富的预训练数据集和独特的向量嵌入方法,在性能上优于所有其他调查工具。为了避免不同方法所使用的不同训练数据集对预测性能的影响,我们使用相同的数据集进行了 5 倍交叉验证实验,其中包括重新训练。此外,为了探索模型的适用性和泛化能力,我们还构建了一个短肽数据集和一个外部数据集来测试重新训练的模型。虽然这些基于 BERT 的预测方法都能达到较好的预测效果,但在识别准确率方面仍有提升空间。随着蛋白质语言模型的不断完善,我们提出了一种基于 ESM-2 预训练模型的 AMP 预测方法,称为 iAMP-bert。实验结果表明 iAMP-bert 优于其他方法。iAMP-bert 可在 http://iamp.aielab.cc/ 免费向公众开放。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Comprehensive Assessment of BERT-Based Methods for Predicting Antimicrobial Peptides

Comprehensive Assessment of BERT-Based Methods for Predicting Antimicrobial Peptides

In recent years, the prediction of antimicrobial peptides (AMPs) has gained prominence due to their high antibacterial activity and reduced susceptibility to drug resistance, making them potential antibiotic substitutes. To advance the field of AMP recognition, an increasing number of natural language processing methods are being applied. These methods exhibit diversity in terms of pretraining models, pretraining data sets, word vector embeddings, feature encoding methods, and downstream classification models. Here, we provide a comprehensive survey of current BERT-based methods for AMP prediction. An independent benchmark test data set is constructed to evaluate the predictive capabilities of the surveyed tools. Furthermore, we compared the predictive performance of these computational methods based on six different AMP public databases. LM_pred (BFD) outperformed all other surveyed tools due to abundant pretraining data set and the unique vector embedding approach. To avoid the impact of varying training data sets used by different methods on prediction performance, we performed the 5-fold cross-validation experiments using the same data set, involving retraining. Additionally, to explore the applicability and generalization ability of the models, we constructed a short peptide data set and an external data set to test the retrained models. Although these prediction methods based on BERT can achieve good prediction performance, there is still room for improvement in recognition accuracy. With the continuous enhancement of protein language model, we proposed an AMP prediction method based on the ESM-2 pretrained model called iAMP-bert. Experimental results demonstrate that iAMP-bert outperforms other approaches. iAMP-bert is freely accessible to the public at http://iamp.aielab.cc/.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
9.80
自引率
10.70%
发文量
529
审稿时长
1.4 months
期刊介绍: The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信