基于n -图表示和机器学习的抗菌肽分类与预测

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics Pub Date : 2017-08-20 DOI:10.1145/3107411.3108215

M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Katherine Du, I. Vaisman

{"title":"基于n -图表示和机器学习的抗菌肽分类与预测","authors":"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Katherine Du, I. Vaisman","doi":"10.1145/3107411.3108215","DOIUrl":null,"url":null,"abstract":"Current antibiotic treatments for infectious diseases are drastically losing effectiveness, as the organisms they target have developed resistance to the drugs over time. In the United States, antibiotic-resistant bacterial infections annually result in more than 23,000 deaths, the morbidity rates are much higher. A promising alternative to current antibiotic treatments are antimicrobial peptides (AMPs), short sequences of amino acid residues that have been experimentally identified to inhibit the propagation of pathogens. In this study, we demonstrated that an N-gram representation of AMP sequences using reduced amino acid alphabet combined with machine learning (ML) methods provide a simple and efficient AMP classification with performance comparable to the more complex algorithms. All AMP sequences were retrieved from public data sources. Our AMP set consists of 7760 sequences, regardless of AMP subclass. We also used class-specific AMP sets (antibacterial, antiviral, antifungal, and antiparasitic). We created a raw negative set consisting of 20258 non-antimicrobial peptides (non-AMPs) using sequence fragments from annotated protein sequence databases. Models for all AMP against non-AMP sequences classification achieved a maximum accuracy of 85.0% using frequency N-gram analysis, and the RF model with 10-fold cross-validation. The datasets ranged from 200 to 7760 sequences per class. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABP sequences achieved an accuracy of up to 100% depending on a ML algorithm and alphabet reduction used. ABP against AVP sequences classification yielded a maximum accuracy of 81.8% AVP against non-AVP - 80.7% and AVP against AFP - 80.5%. The common trends present across multiple experiment series include the following: Random Forest frequently outperforms other algorithms. The optimal size of the reduced alphabet is either 3 or 4 letters. Reduction to 2 letters leads to a significant drop in accuracy, reduction to 5 or more letters does not provide any noticeable gains in classification accuracy. The results of this study indicate that N-gram based classification of AMPs is a promising approach with a strong potential for providing important insights into understanding AMP mechanisms and computationally designing new AMPs.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Classification and Prediction of Antimicrobial Peptides Using N-gram Representation and Machine Learning\",\"authors\":\"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Katherine Du, I. Vaisman\",\"doi\":\"10.1145/3107411.3108215\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current antibiotic treatments for infectious diseases are drastically losing effectiveness, as the organisms they target have developed resistance to the drugs over time. In the United States, antibiotic-resistant bacterial infections annually result in more than 23,000 deaths, the morbidity rates are much higher. A promising alternative to current antibiotic treatments are antimicrobial peptides (AMPs), short sequences of amino acid residues that have been experimentally identified to inhibit the propagation of pathogens. In this study, we demonstrated that an N-gram representation of AMP sequences using reduced amino acid alphabet combined with machine learning (ML) methods provide a simple and efficient AMP classification with performance comparable to the more complex algorithms. All AMP sequences were retrieved from public data sources. Our AMP set consists of 7760 sequences, regardless of AMP subclass. We also used class-specific AMP sets (antibacterial, antiviral, antifungal, and antiparasitic). We created a raw negative set consisting of 20258 non-antimicrobial peptides (non-AMPs) using sequence fragments from annotated protein sequence databases. Models for all AMP against non-AMP sequences classification achieved a maximum accuracy of 85.0% using frequency N-gram analysis, and the RF model with 10-fold cross-validation. The datasets ranged from 200 to 7760 sequences per class. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABP sequences achieved an accuracy of up to 100% depending on a ML algorithm and alphabet reduction used. ABP against AVP sequences classification yielded a maximum accuracy of 81.8% AVP against non-AVP - 80.7% and AVP against AFP - 80.5%. The common trends present across multiple experiment series include the following: Random Forest frequently outperforms other algorithms. The optimal size of the reduced alphabet is either 3 or 4 letters. Reduction to 2 letters leads to a significant drop in accuracy, reduction to 5 or more letters does not provide any noticeable gains in classification accuracy. The results of this study indicate that N-gram based classification of AMPs is a promising approach with a strong potential for providing important insights into understanding AMP mechanisms and computationally designing new AMPs.\",\"PeriodicalId\":246388,\"journal\":{\"name\":\"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3107411.3108215\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

目前针对传染病的抗生素治疗正在急剧失去效力，因为它们针对的生物体随着时间的推移已经产生了对药物的耐药性。在美国，耐抗生素细菌感染每年导致23,000多人死亡，发病率要高得多。抗菌肽(AMPs)是目前抗生素治疗的一个有希望的替代方案，抗菌肽是由氨基酸残基组成的短序列，经实验鉴定可抑制病原体的繁殖。在这项研究中，我们证明了使用还原氨基酸字母表结合机器学习(ML)方法的AMP序列的n图表示提供了一个简单而有效的AMP分类，其性能可与更复杂的算法相媲美。所有AMP序列均从公共数据源检索。我们的AMP集合由7760个序列组成，与AMP子类无关。我们还使用了特定类别的AMP套装(抗菌、抗病毒、抗真菌和抗寄生虫)。我们利用带注释的蛋白质序列数据库中的序列片段创建了一个由20258个非抗菌肽(non-AMPs)组成的原始阴性集。使用频率n图分析，所有AMP与非AMP序列分类的模型达到了85.0%的最高准确率，RF模型具有10倍交叉验证。每个类的数据集从200到7760个序列不等。接下来使用更具体的amp类别进行分类。首先，根据ML算法和使用的字母表约简，abp与非abp序列的分类准确率达到100%。ABP对AVP序列分类的最高准确率为81.8%，AVP对非AVP的准确率为80.7%，AVP对AFP的准确率为80.5%。在多个实验系列中出现的共同趋势包括:随机森林经常优于其他算法。简化后的字母表的最佳大小是3个或4个字母。减少到2个字母会导致准确性显著下降，减少到5个或更多的字母不会提供任何明显的分类准确性提高。本研究的结果表明，基于n图的AMP分类是一种很有前途的方法，具有强大的潜力，可以为理解AMP机制和计算设计新的AMP提供重要的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Classification and Prediction of Antimicrobial Peptides Using N-gram Representation and Machine Learning

Current antibiotic treatments for infectious diseases are drastically losing effectiveness, as the organisms they target have developed resistance to the drugs over time. In the United States, antibiotic-resistant bacterial infections annually result in more than 23,000 deaths, the morbidity rates are much higher. A promising alternative to current antibiotic treatments are antimicrobial peptides (AMPs), short sequences of amino acid residues that have been experimentally identified to inhibit the propagation of pathogens. In this study, we demonstrated that an N-gram representation of AMP sequences using reduced amino acid alphabet combined with machine learning (ML) methods provide a simple and efficient AMP classification with performance comparable to the more complex algorithms. All AMP sequences were retrieved from public data sources. Our AMP set consists of 7760 sequences, regardless of AMP subclass. We also used class-specific AMP sets (antibacterial, antiviral, antifungal, and antiparasitic). We created a raw negative set consisting of 20258 non-antimicrobial peptides (non-AMPs) using sequence fragments from annotated protein sequence databases. Models for all AMP against non-AMP sequences classification achieved a maximum accuracy of 85.0% using frequency N-gram analysis, and the RF model with 10-fold cross-validation. The datasets ranged from 200 to 7760 sequences per class. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABP sequences achieved an accuracy of up to 100% depending on a ML algorithm and alphabet reduction used. ABP against AVP sequences classification yielded a maximum accuracy of 81.8% AVP against non-AVP - 80.7% and AVP against AFP - 80.5%. The common trends present across multiple experiment series include the following: Random Forest frequently outperforms other algorithms. The optimal size of the reduced alphabet is either 3 or 4 letters. Reduction to 2 letters leads to a significant drop in accuracy, reduction to 5 or more letters does not provide any noticeable gains in classification accuracy. The results of this study indicate that N-gram based classification of AMPs is a promising approach with a strong potential for providing important insights into understanding AMP mechanisms and computationally designing new AMPs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

自引率

0.00%

发文量