J. Ma, M. N. Nguyen, G.W.L. Pang, Jagath Rajapakse
{"title":"基于密码子和支持向量机的基因分类","authors":"J. Ma, M. N. Nguyen, G.W.L. Pang, Jagath Rajapakse","doi":"10.1109/CIBCB.2005.1594951","DOIUrl":null,"url":null,"abstract":"A novel approach for gene classification is proposed, which adopts codon usage bias pattern as feature vector for the subsequent classification using Support Vector Machines (SVMs). A given DNA sequence is first converted to 59-dimensional feature vector, each element corresponding to the relative synonymous usage frequency of a codon. Therefore, the input to the classifier is independent of the size of the DNA sequences. Therefore, our approach is useful when the genes to be classified are of different length, where the homology-based methods are inapplicable due to the difficulty in the alignment of sequences having different lengths. The applicability and usage of the present method is demonstrated by a classification of 1841 HLA (Human Leukocyte Antigen) coding sequences selected from the database of IMGT/HLA. Using the codon usage frequencies, the binary SVM achieved accuracy up to 99.30% for classification human MHC (Major Histocompatibility Complex) molecules in their major classes: MHC-I and MHC-II. By using a multi-class SVM approach, the accuracy rates of 99.73% and 98.38% were achieved for subclasss classification of MHC-I and MHC-II classes, respectively. The results show that the proposed method is capable of accurately classifying MHC molecules in to their major classes as well as in to the subclasses within major classes. Also, the results of gene classification according to the codon usage bias pattern are consistent with the molecule structures and biological functions, further validating our approach.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Gene Classification using Codon Usage and SVMs\",\"authors\":\"J. Ma, M. N. Nguyen, G.W.L. Pang, Jagath Rajapakse\",\"doi\":\"10.1109/CIBCB.2005.1594951\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A novel approach for gene classification is proposed, which adopts codon usage bias pattern as feature vector for the subsequent classification using Support Vector Machines (SVMs). A given DNA sequence is first converted to 59-dimensional feature vector, each element corresponding to the relative synonymous usage frequency of a codon. Therefore, the input to the classifier is independent of the size of the DNA sequences. Therefore, our approach is useful when the genes to be classified are of different length, where the homology-based methods are inapplicable due to the difficulty in the alignment of sequences having different lengths. The applicability and usage of the present method is demonstrated by a classification of 1841 HLA (Human Leukocyte Antigen) coding sequences selected from the database of IMGT/HLA. Using the codon usage frequencies, the binary SVM achieved accuracy up to 99.30% for classification human MHC (Major Histocompatibility Complex) molecules in their major classes: MHC-I and MHC-II. By using a multi-class SVM approach, the accuracy rates of 99.73% and 98.38% were achieved for subclasss classification of MHC-I and MHC-II classes, respectively. The results show that the proposed method is capable of accurately classifying MHC molecules in to their major classes as well as in to the subclasses within major classes. Also, the results of gene classification according to the codon usage bias pattern are consistent with the molecule structures and biological functions, further validating our approach.\",\"PeriodicalId\":330810,\"journal\":{\"name\":\"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIBCB.2005.1594951\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2005.1594951","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A novel approach for gene classification is proposed, which adopts codon usage bias pattern as feature vector for the subsequent classification using Support Vector Machines (SVMs). A given DNA sequence is first converted to 59-dimensional feature vector, each element corresponding to the relative synonymous usage frequency of a codon. Therefore, the input to the classifier is independent of the size of the DNA sequences. Therefore, our approach is useful when the genes to be classified are of different length, where the homology-based methods are inapplicable due to the difficulty in the alignment of sequences having different lengths. The applicability and usage of the present method is demonstrated by a classification of 1841 HLA (Human Leukocyte Antigen) coding sequences selected from the database of IMGT/HLA. Using the codon usage frequencies, the binary SVM achieved accuracy up to 99.30% for classification human MHC (Major Histocompatibility Complex) molecules in their major classes: MHC-I and MHC-II. By using a multi-class SVM approach, the accuracy rates of 99.73% and 98.38% were achieved for subclasss classification of MHC-I and MHC-II classes, respectively. The results show that the proposed method is capable of accurately classifying MHC molecules in to their major classes as well as in to the subclasses within major classes. Also, the results of gene classification according to the codon usage bias pattern are consistent with the molecule structures and biological functions, further validating our approach.