M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman
{"title":"使用简化字母的抗菌肽机器学习分类","authors":"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman","doi":"10.1145/3233547.3233657","DOIUrl":null,"url":null,"abstract":"Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets\",\"authors\":\"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman\",\"doi\":\"10.1145/3233547.3233657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.\",\"PeriodicalId\":131906,\"journal\":{\"name\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3233547.3233657\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233657","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets
Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.