使用简化字母的抗菌肽机器学习分类

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233657

M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman

{"title":"使用简化字母的抗菌肽机器学习分类","authors":"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman","doi":"10.1145/3233547.3233657","DOIUrl":null,"url":null,"abstract":"Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets\",\"authors\":\"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman\",\"doi\":\"10.1145/3233547.3233657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.\",\"PeriodicalId\":131906,\"journal\":{\"name\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3233547.3233657\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233657","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

抗菌肽(AMPs)被认为是一种很有前途的抗生素替代品。它们在人体的适应性免疫系统中起作用。虽然其在体内的作用是已知的，但基于其序列特征正确识别amp的问题仍然是一个积极研究的主题。在这里，我们优化了简化字母表的使用，将20个字母的氨基酸字母表简化为2-4个字母，并使用N-grams(氨基酸的短串)来寻找N-gram频率之间的相关性。计算使用为本研究编写的java程序和WEKA机器学习软件进行。然后使用机器学习方法对AMP亚类进行分类，包括抗菌、抗真菌和抗病毒肽。结果表明，基于N-gram频率分析的简化字母表在AMP分类和预测领域是一个很有前途的选择。所有AMP序列均来自不同来源。AMP集合由7984个序列组成，不一定是任何特定的类。我们还使用了特定类别的AMP套装(抗菌、抗病毒和抗真菌)。由20258个非amp组成的原始阴性集合，使用来自带注释的蛋白质序列数据库的序列片段。amp与非amp的分类成功。使用频率N-gram分析、字母减少选项47和10树交叉验证的RF模型，模型的准确率达到87.71%。接下来使用更具体的amp类别进行分类。首先，使用频率N-gram分析、字母约简选项47和RF模型对ABPs和非ABPs的AMPs进行分类，准确率达到86.83%，而使用bagging算法的准确率为84.35%。其次，使用频率N-gram分析、字母减少选项47和29以及RF模型，avp与非avp amp的分类准确率分别达到92.75%和92.30%。这个实验还包括许多其他成功的试验。RF显著优于其他六种学习算法。字母还原47通常产生最高的分类准确率。这一发现意味着4聚类字母表对于N-gram频率分析和机器学习是最佳的。我们的研究结果表明，所产生的分类器具有很强的预测能力，可以在各种生物和医学应用中发挥重要作用，可能挽救数万或数十万人的生命。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets

Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量