{"title":"生物信息学中基于距离的蛋白质序列分类特征编码技术","authors":"M. Iqbal, I. Faye, A. Said, B. Samir","doi":"10.1109/CYBERNETICSCOM.2013.6865770","DOIUrl":null,"url":null,"abstract":"Bioinformatics has been emerging as a new research dimension since the last century by combining computer science and biology techniques for the automatic analysis of biological sequence data. The volume of the biological data gathered under different sequencing projects is increasing exponentially. These sequences contain extremely important information about genes, their structure and function. Computational techniques which involve machine learning and pattern recognition are becoming very useful on Bioinformatics data like DNA and protein. Protein classification into different groups could be used for knowing the structure or the function of unknown protein sequence. The process of classifying protein amino acid sequences into a family /superfamily is a very complex problem. However, from among other major issues in a protein classification, the critical one is an accurate representation of amino acid sequence during the feature extraction. In this work, we have proposed a distance-based feature-encoding method; the proposed technique has been tested with different classifiers, which have shown better results than the previously available techniques for superfamily classification of protein sequences. The maximum average classification accuracy obtained was 91.2%. The dataset used in the experiments was taken from the well known UniProtKB protein database.","PeriodicalId":351051,"journal":{"name":"2013 IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM)","volume":"223 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"A distance-based feature-encoding technique for protein sequence classification in bioinformatics\",\"authors\":\"M. Iqbal, I. Faye, A. Said, B. Samir\",\"doi\":\"10.1109/CYBERNETICSCOM.2013.6865770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bioinformatics has been emerging as a new research dimension since the last century by combining computer science and biology techniques for the automatic analysis of biological sequence data. The volume of the biological data gathered under different sequencing projects is increasing exponentially. These sequences contain extremely important information about genes, their structure and function. Computational techniques which involve machine learning and pattern recognition are becoming very useful on Bioinformatics data like DNA and protein. Protein classification into different groups could be used for knowing the structure or the function of unknown protein sequence. The process of classifying protein amino acid sequences into a family /superfamily is a very complex problem. However, from among other major issues in a protein classification, the critical one is an accurate representation of amino acid sequence during the feature extraction. In this work, we have proposed a distance-based feature-encoding method; the proposed technique has been tested with different classifiers, which have shown better results than the previously available techniques for superfamily classification of protein sequences. The maximum average classification accuracy obtained was 91.2%. The dataset used in the experiments was taken from the well known UniProtKB protein database.\",\"PeriodicalId\":351051,\"journal\":{\"name\":\"2013 IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM)\",\"volume\":\"223 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBERNETICSCOM.2013.6865770\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERNETICSCOM.2013.6865770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A distance-based feature-encoding technique for protein sequence classification in bioinformatics
Bioinformatics has been emerging as a new research dimension since the last century by combining computer science and biology techniques for the automatic analysis of biological sequence data. The volume of the biological data gathered under different sequencing projects is increasing exponentially. These sequences contain extremely important information about genes, their structure and function. Computational techniques which involve machine learning and pattern recognition are becoming very useful on Bioinformatics data like DNA and protein. Protein classification into different groups could be used for knowing the structure or the function of unknown protein sequence. The process of classifying protein amino acid sequences into a family /superfamily is a very complex problem. However, from among other major issues in a protein classification, the critical one is an accurate representation of amino acid sequence during the feature extraction. In this work, we have proposed a distance-based feature-encoding method; the proposed technique has been tested with different classifiers, which have shown better results than the previously available techniques for superfamily classification of protein sequences. The maximum average classification accuracy obtained was 91.2%. The dataset used in the experiments was taken from the well known UniProtKB protein database.